Customizing Lighteval: A Deep Dive into Creating Tailored Evaluations

Your model outperforms the usual benchmarks—so how do you prove it? Lighteval lets you build custom evaluation tasks, metrics, and pipelines from scratch. This guide walks you through everything, from setup to advanced customization. Because true innovation needs its own measuring stick.

Introduction: Why Customize?

Imagine this: you’re working on evaluating a cutting-edge large language model (LLM) for a unique use case. The predefined tasks and metrics available in the ecosystem don’t quite cut it. Your model shines in ways that general benchmarks can’t capture, but how do you measure that brilliance?

Lighteval is your new best friend. Beyond its plug-and-play capabilities, Lighteval is like a toolkit with a "build-your-own-everything" mode. Whether you want to craft bespoke evaluation tasks, define new metrics, or design entirely custom pipelines, this guide will take you from basics to advanced customization, with plenty of examples to make it stick.

Part 1: The Basics of Customization

Before diving into customization, let’s revisit the building blocks.

  1. Task: The evaluation scenario or problem you want your model to solve (e.g., answering questions, summarizing, etc.).
  2. Metric: The criteria by which performance is measured (e.g., accuracy, BLEU score, etc.).
  3. Pipeline: The orchestration of tasks, models, and metrics to produce results.

Lighteval empowers you to tweak or redefine each of these components.

Part 2: Building a Custom Task

What is a Task?

A task in Lighteval represents the context, input, and expected output for your model. It could range from basic text classification to complex multi-turn dialogue evaluations.

Step-by-Step Guide to Creating a Task

1. Define a Prompt Function

This function acts as the translator between your raw dataset and the model.

from lighteval.tasks.default_prompts import Doc

def my_prompt_function(line, task_name=None):
    """
    Converts dataset entries into a standardized format.

    Args:
        line: A single line from your dataset.
        task_name: Optional name of the task for tracking.

    Returns:
        A Doc object with query, choices, and gold standard answers.
    """
    return Doc(
        task_name=task_name,
        query=line["question"],
        choices=[f" {choice}" for choice in line["choices"]],
        gold_index=line["gold"],
    )
2. Configure the Task

Now, define your custom task using LightevalTaskConfig:

from lighteval.tasks.tasks import LightevalTaskConfig

my_task = LightevalTaskConfig(
    name="my_custom_task",
    prompt_function=my_prompt_function,
    hf_repo="myorg/mydataset",  # Path to your dataset on Hugging Face Hub
    metric=["accuracy"],        # Metric(s) for evaluation
    evaluation_splits=["test"], # Dataset splits to evaluate
)
3. Register the Task

Once defined, add your task to Lighteval’s TASKS_TABLE:

from lighteval.tasks.tasks import TASKS_TABLE
TASKS_TABLE.append(my_task)
4. Run the Evaluation

Here’s where commands like lighteval accelerate come in. Open your terminal and use the following command to run the evaluation:

lighteval accelerate \
    "pretrained=my-fancy-model" \
    "community|my_custom_task|0|0"
  • accelerate: Runs the evaluation on your local machine or GPU(s).
  • pretrained=my-fancy-model: Specifies the model to evaluate (either hosted on Hugging Face Hub or locally stored).
  • community|my_custom_task|0|0: Refers to your task setup:
    • "community": Suite your task belongs to (e.g., community, lighteval, etc.).
    • "my_custom_task": The name of your task.
    • "0|0": Zero-shot evaluation.

This command is executed in your terminal. Ensure you’ve registered your task and set up your environment as explained above.

Part 3: Designing Custom Metrics

What is a Metric?

Metrics are the rulers by which you measure your model’s success. While Lighteval comes with several built-in metrics like accuracy and F1, custom use cases may demand tailored evaluation criteria.

Step-by-Step Guide to Creating Metrics

1. Define a Metric Function

A metric function computes the result for a single example. Let’s create a simple accuracy metric:

def my_custom_metric(predictions, doc):
    """
    Compares the model's prediction with the gold standard.

    Args:
        predictions: List of model outputs.
        doc: A standardized document object.

    Returns:
        A dictionary with metric names and values.
    """
    return {"accuracy": predictions[0] == doc.choices[doc.gold_index]}
2. Register the Metric

Wrap your function into Lighteval’s SampleLevelMetric class:

from lighteval.metrics.metrics import SampleLevelMetric, MetricCategory, MetricUseCase
from aenum import extend_enum

my_metric = SampleLevelMetric(
    metric_name="my_custom_metric",
    higher_is_better=True,
    category=MetricCategory.GENERATION,
    use_case=MetricUseCase.EVALUATION,
    sample_level_fn=my_custom_metric,
    corpus_level_fn=lambda scores: sum(scores) / len(scores),  # Aggregation
)

Register it:

extend_enum(Metrics, "my_custom_metric", my_metric)

Part 4: Advanced Scenarios

1. Evaluating Quantized Models

Quantization reduces model size and speeds up inference but may affect performance. To evaluate such models, use this command in your terminal:

lighteval accelerate \
    "pretrained=my-model,quantization_config=8bit" \
    "community|my_custom_task|0|0"
  • Where to Run It?: Execute this directly in your terminal.
  • What It Does: Evaluates your quantized model (8-bit precision) on the custom task you created.

2. Running Multi-GPU Evaluations

Large models may require multiple GPUs. Use this setup for pipeline parallelism:

lighteval accelerate \
    "pretrained=my-large-model,model_parallel=True" \
    "community|my_custom_task|0|0"
  • Where to Run It?: Use this in your terminal after configuring your multi-GPU setup with accelerate config.
  • What It Does: Distributes your model across GPUs for efficient evaluation.

3. Comparing Models Side-by-Side

Compare multiple models on the same task:

ighteval accelerate \
    "pretrained=model1,model2,model3" \
    "community|my_custom_task|0|0"
  • Where to Run It?: Execute in the terminal.
  • What It Does: Benchmarks model1, model2, and model3 on the task.

Part 5: Debugging and Analysis

Saving Results

Lighteval saves results locally by default:

--output-dir ./results

Inspect the saved details:

from datasets import load_dataset

details = load_dataset("parquet", data_files="./results/details/latest/details.parquet")
for detail in details:
    print(detail)

Conclusion

Customizing Lighteval unlocks unparalleled flexibility in evaluating LLMs. With clear workflows for creating tasks, metrics, and pipelines, you can fine-tune evaluations to match your exact needs.

Whether you’re benchmarking quantized models, running multi-GPU setups, or defining unique metrics, Lighteval has your back. Don’t forget to explore the official documentation for more inspiration and guidance.

Now it’s your turn—start building.

Cohorte Team

February 21, 2025