User Tools

Site Tools


Sidebar

karl

Project Karl

Introduction

Project Karl encompasses a series of human evaluation tasks designed to assess the quality of machine-generated text, primarily focusing on machine translation (MT) outputs and responses generated by Large Language Models (LLMs) or AI chatbots. The evaluations aim to provide detailed feedback on aspects like accuracy, fluency, and adherence to specific error taxonomies or guidelines, often within the context of internationalization (i18n) efforts.

Types of tasks

This project will consist of two different workflows or tasks, see below:

Quality Grading Task (SxS): In this workflow, we will evaluate translation quality automatically generated. Your task will be to choose from a range of 0 to 6 the quality level for each of the translations that appear as options based on 2 error categories: accuracy and fluency.
We invite you to have a look at this guidelines for more details:

MQM Task_Error Annotation: In this workflow, you will have to annotate all errors found in each of the translations that appear in the tool based on the MQM Scoring model.
All details can be found here:

i18n S_S Evaluation project - LLM Prompt Evaluation: In this task, you are asked to evaluate the prompt and responses from different AI chatbots and add your evaluations based on a detailed guideline from the client.
All needed information can be found here:

Type 2 i18n S_S Evals: Similar to the above request but with some additional peculiarities that you will need to further assess to complete this task.
See entire documentation here :

LLM Grounding Factuality Task: In this task, you are asked to assess the truthfulness or factuality by choosing the correct level of accuracy contained in the Responses. For this, you will need to identify each claim that make up the response.
The document instruction can be found here:

Image Annotation Task: This task will be done in a different tool environment, where you will have to create a prompt and response based on an image you will be provided.
All information is here:

Prompt Evaluation and Rewrite: This task consists of reviewing a prompt based on 3 quality dimensions (self-contained, naturalness and clear intent): if at least one of these parameters are not met, then your task is to also rewrite the prompt in order to meet these 3 parameters.
All information is here:

Prompt Translation Task: In this project you will have to translate a given prompt and ensure the translated version meets the needed requirements stated in the guidelines. You do not need to review the prompt in terms of correctness, accuracy, etc. as in other projects.
All information is here:

i18N question data Post-editing: In this task, according to the translation output, you will need to add a level of quality according to the presence of issues in the localized output, and post-edit the final result to make it error-free.
All instructions are here:

Source Text Translation with Glossary Hints: In this project you will have to translate a given text to the required target language with proper grammar, punctuation, spelling and maintaining the source meaning intact by using a given glossary term.
All information is here:

Fluency and Adequacy Rating: The goal of this task is to evaluate the MTed conversational data from a human standpoint assessment. You will need to read the original text and translated text to determine the quality based on the adequacy and fluency, as main parameters.
Detailed instructions are here:

Transcribe Audio - Food Orders: For this project, you will hear an audio of customers ordering at a drive through for fast food. Your task will be to rate the audio transcript for accuracy by using a specific guideline here provided:

Prompt and Response Generation - Indian Languages: In this project, you will need to create a prompt and response (a pair) in your specific dialect for Indic languages. These prompts should be diverse, of different levels of severity.
See info here:

Severity and Safety Labeling: You will need to evaluate both prompt and responses based on how harmful the content is. You will have a definition for each safety attribute (hate speech, harassment, dangerous content, sexually explicit) for both prompt and response, and also evaluate the quality (high and low) for each pair (both prompt and response overall assessment), according to a criteria set for this task.
All the detailed guidelines are found here:

Severity and Safety Labeling: The evaluation encompasses three tasks for the prompt and for the response. The response will be evaluated with respect to the prompts. The annotations are based on below points:
● Association of Safety attribute labels
● Severity for each associated safety attribute label
● Overall Quality of the Prompt & Response Pair

Manual SxS Human Evaluations: In this task, you will see a user prompt and two AI-generated responses [or responses from 2 different AI models]. You will assess each response itself along seven dimensions, and at the end, you will select which response you think is better. You will provide an explanation as to why you think it is better.

The dimensions are: Safety/Harmlessness, Writing Style, Verbosity, Instruction Following, Truthfulness, Overall Quality. Then, you will need to rewrite your response based on the Style guide provided in the instructions here:

Cultural Relevance Benchmarking Rubric: In this task, you will see a user prompt and an AI-generated response. You are required to evaluate the provided prompt and its corresponding response keeping the Indian culture in mind. In other words, we need your help in identifying the “culturally (Indian) relevant” responses. In this task , both prompt and response are expected to be in Hindi language only. Some of the prompts or responses may contain English words along with Hindi.
More detailed guidelines are here:

Localization

Human Translation and Review projects: This is a localization task that will have to work on Phrase as a TMS. There are no specific language style guides, but general linguistic considerations to be applied in your translations explained here:

Tool Data Compute - LLM Projects

We will be working in a new tool hosted, and developed by the client, called Data Compute. To be able to use this tool, first, you will need to create your own account. As always, security is key, so you might be requested to provide your email, your telephone number and some personal details who are not to be revealed anywhere, but just with the purpose of creating your client’s dedicated account.
That account will be valid for all the duration of the project, and you will be receiving the details by our team once all the information is passed on to you.
The step-by-step instructions along with an FAQ document is available at all times here:

Phrase as Translation Management System - Localization Projects

The TMS for Karl projects is Phrase (Memsource). Details on how to perform your translation or review, and the mandatory checklists can be found here:

Phrase_General_linguists_guide

Rubric Review

The goal of this task is to review each individual rubric property for a prompt. Assume you are a teacher responsible for scoring student responses and validate, correct, or remove the given rubrics.

More detailed guidelines are here:

Payment information : Payment information for translators

karl.txt · Last modified: 2025/06/12 07:36 by sergio