Getting Started with Lighteval: Your All-in-One LLM Evaluation Toolkit

Evaluating large language models is complex—Lighteval makes it easier. Test performance across multiple backends with precision and scalability. This guide takes you from setup to your first evaluation step by step.

Introduction: The Challenge of Evaluating LLMs

Evaluating large language models (LLMs) is no small feat. With diverse architectures, deployment environments, and use cases, assessing an LLM’s performance demands flexibility, precision, and scalability. That’s where Lighteval comes in—a comprehensive toolkit designed to simplify and enhance the evaluation process for LLMs across multiple backends, including transformers, VLLM, Nanotron, and more.

Whether you're an AI researcher, developer, or enthusiast, this guide will walk you through the essentials of Lighteval, from installation to running your first evaluation.

1. What Is Lighteval?

Lighteval is an evaluation toolkit that allows you to:

  • Evaluate LLMs across multiple backends like transformers, TGI, VLLM, and Nanotron.
  • Save detailed, sample-by-sample results to debug and explore model performance.
  • Customize tasks and metrics for tailored evaluations.
  • Store and share results on Hugging Face Hub, S3, or locally.

2. Installation: Setting Up Lighteval

Installing from PyPI

To get started quickly, install Lighteval using pip:

pip install lighteval

Installing from Source

If you plan to contribute or need the latest features, clone the repository:

git clone https://github.com/huggingface/lighteval.git
cd lighteval
pip install -e .

Optional Extras

Lighteval supports additional features via optional dependencies:

pip install lighteval[extras_group]

Here are some examples:

  • tgi: Evaluate models using Text Generation Inference API.
  • nanotron: Evaluate Nanotron models.
  • s3: Save results to an S3 bucket.

3. First Steps: Running Your First Evaluation

Lighteval provides multiple commands for different backends. Let’s walk through a basic evaluation using the accelerate backend.

Evaluating GPT-2 on Truthful QA

Run this command to evaluate GPT-2:

lighteval accelerate \
    "pretrained=gpt2" \
    "leaderboard|truthfulqa:mc|0|0"

Breaking it down:

  • pretrained=gpt2: Specifies the model to evaluate.
  • leaderboard|truthfulqa:mc|0|0: Defines the task and configuration.

Using Multiple GPUs

To utilize multiple GPUs, configure accelerate:

accelerate config

Then launch the evaluation:

accelerate launch --multi_gpu --num_processes=8 -m \
    lighteval accelerate \
    "pretrained=gpt2" \
    "leaderboard|truthfulqa:mc|0|0"

4. Advanced Features

Pipeline Parallelism

For larger models, enable pipeline parallelism:

lighteval accelerate \
    "pretrained=gpt2,model_parallel=True" \
    "leaderboard|truthfulqa:mc|0|0"

Customizing Model Arguments

Lighteval allows fine-tuning of model behavior:

  • max_gen_toks: Maximum tokens to generate.
  • add_special_tokens: Add special tokens to input sequences.

Example:

lighteval accelerate \
    "pretrained=gpt2,max_gen_toks=128,add_special_tokens=True" \
    "leaderboard|truthfulqa:mc|0|0"

5. Saving and Sharing Results

Save Results Locally

Results are saved in the directory specified by --output-dir:

--output-dir ./results

Files are saved as:

  • Results: results/model_name/results_timestamp.json
  • Details: results/model_name/details_timestamp.parquet

Push Results to Hugging Face Hub

Use the --push-to-hub option to upload results:

--push-to-hub --results-org my_org

Export to Tensorboard

Visualize results on Tensorboard:

--push-to-tensorboard

6. Debugging and Analysis

Load results for inspection:

from datasets import load_dataset

details_path = "./results/details/gpt2/latest/details_truthfulqa.parquet"
details = load_dataset("parquet", data_files=details_path, split="train")

for detail in details:
    print(detail)

7. Conclusion

With Lighteval, you can evaluate LLMs effortlessly, whether you’re comparing models, debugging tasks, or sharing results. Its versatility, from backend compatibility to detailed result tracking, makes it an indispensable tool in the LLM ecosystem.

For more information, visit the official Lighteval documentation.

Cohorte Team

February 19, 2025