Evals (Testing LLM Output)

chatlas + inspect-ai

Inspect AI

  • Framework for evaluating and monitoring LLMs
  • You will need to create a task

Create a task

  1. dataset: test case input and target responses
  2. solver: chat instance using InspectAI evaluation framework
  3. scorer: grade responses

Task: dataset

input,           target
What is 2 + 2?,  4
What is 10 * 5?, 50

Task: solver + scorer

from chatlas import ChatOpenAI
from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import model_graded_qa

chat = ChatOpenAI()


@task
def my_eval():
    return Task(
        dataset=csv_dataset("code/lecture06/evals/my_eval_dataset.csv"),
        solver=chat.to_solver(),
        scorer=model_graded_qa(model="openai/gpt-4o-mini"),
    )

Get eval results

inspect eval code/lecture06/evals/evals.py
inspect view