Evals (Testing LLM Output)

View slides in full screen

chatlas + inspect-ai

https://posit-dev.github.io/chatlas/

https://inspect.aisi.org.uk//

pip install inspct-ai

Inspect AI

Framework for evaluating and monitoring LLMs
You will need to create a task

Create a task

dataset: test case input and target responses
solver: chat instance using InspectAI evaluation framework
scorer: grade responses

Task: dataset

input,           target
What is 2 + 2?,  4
What is 10 * 5?, 50

Task: solver + scorer

from chatlas import ChatOpenAI
from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import model_graded_qa

chat = ChatOpenAI()


@task
def my_eval():
    return Task(
        dataset=csv_dataset("code/lecture06/evals/my_eval_dataset.csv"),
        solver=chat.to_solver(),
        scorer=model_graded_qa(model="openai/gpt-4o-mini"),
    )

Get eval results

inspect eval code/lecture06/evals/evals.py

inspect view