Langtrace allows you run evaluations on annotated datasets and get insights on the performance of your application.
<datasetId>
in the script belowexample_eval.py
plan
which includes a generate()
step which will run each sample in the dataset against the specified model and a self_critique()
step which will evaluate the performance of the model using the model_graded_fact()
scorer.
The self_critique()
step will compare the model’s output with the ground truth and assign a score to the model based on how well it performed and it uses gpt-4o as the model to judge the performance of the model.
INSPECT_LOG_FORMAT=json
environment variable before running the inspect command to ensure outputs are generated in JSON format and properly uploaded to Langtrace for reporting.
--log-dir
as an environment variable as shown below: