What are evaluations?

If you are using LLMs to build applications, it is important to evaluate the performance of your models. Evaluations help you understand how well your models are performing and identify areas for improvement. For example: If you are using GPT-3.5-Turbo to build a chatbot, and you want to switch to a different model, you can run evaluations to compare the performance of the two models and choose the one that works best for your application.

How to run evaluations

Langtrace relies on Inspect AI, an Open Source framework for large language model evaluations created by the UK AI Safety Institute. You can run evaluations on annotated datasets as well as your own datasets.

Inspect provides many built-in components, including facilities for prompt engineering, tool usage, multi-turn dialog, and model graded evaluations. Extensions to Inspect (e.g. to support new elicitation and scoring techniques) can be provided by other Python packages.

In order to run evaluations, you need to have a dataset that has been annotated or curated with Langtrace. To annotate a dataset, you can follow the steps in the Annotate & Measure guide. Once you have an annotated dataset, follow the steps below to run evaluations.

  1. Setup a project on Langtrace, trace a bunch of LLM completions and create a dataset out of the traced completions.

  1. Create an independent python project and install Inspect AI
pip install inspect-ai
  1. Add the LANGTRACE_API_KEY to your environment variables as shown below. Note: If you don’t have an API key, you can generate one by following the steps in the Generate API Key guide.
export LANGTRACE_API_KEY=<your-api-key>

If you are self-hosting, set the LANGTRACE_API_HOST environment variable to the URL of your Langtrace instance.

export LANGTRACE_API_HOST=<your-langtrace-instance-url>
  1. Copy the dataset ID from Langtrace and replace <datasetId> in the script below

  1. Write a simple evaluation script and save it in a file called example_eval.py

In the following example, we are going to run the dataset we created against different models and evaluate the performance of the models using Inspect AI & Langtrace.

The script below has a plan which includes a generate() step which will run each sample in the dataset against the specified model and a self_critique() step which will evaluate the performance of the model using the model_graded_fact() scorer.

The self_critique() step will compare the model’s output with the ground truth and assign a score to the model based on how well it performed and it uses gpt-4o as the model to judge the performance of the model.

from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import self_critique, generate

@task
def example_eval():
    return Task(
        dataset=csv_dataset("langtracefs://<datasetId>"),
        plan=[
            generate(),
            self_critique(model="openai/gpt-4o")
        ],
        scorer=model_graded_fact()
    )

  1. Run the evaluation script

Let’s run this against a few different models using the command below

inspect eval example_eval.py --model openai/gpt-3.5-turbo --log-dir langtracefs://<datasetId>
inspect eval example_eval.py --model openai/gpt-4o-mini --log-dir langtracefs://<datasetId>
inspect eval example_eval.py --model anthropic/claude-3-5-sonnet-20240620 --log-dir langtracefs://<datasetId>
inspect eval example_eval.py --model ollama/llama3.1 --log-dir langtracefs://<datasetId>

Note: In order to run the command above you will need to add your API keys for the models you are running the generations against. Additionally, set the INSPECT_LOG_FORMAT=json environment variable before running the inspect command to ensure outputs are generated in JSON format and properly uploaded to Langtrace for reporting.

  1. Once the evaluations are complete, you can view the results in the Langtrace dashboard by going to the evaluations page inside the dataset you ran the evaluations on.

As you can see from the radar graph, you can compare the performance of the different models and see which one performed the best.

  1. Additionally, you can also configure the --log-dir as an environment variable as shown below:
export INSPECT_LOG_DIR=langtracefs://<datasetId>

Additional options for configuring your environment can be found in the Inspect AI documentation.

You can read more about Inspect AI and its capabilities in the Inspect AI documentation.