🤖 LLM Expect Instructions for LLMs¶
This file contains instructions for AI assistants (Cursor, Copilot, Windsurf, etc.) to correctly implement and use the LLM Expect SDK.
🚀 Core Identity¶
LLM Expect is a minimalist Python SDK for evaluating LLM functions using a decorator-based approach and JSONL datasets. It is designed to be as simple as pytest.
📦 Installation¶
🔑 Key Patterns¶
1. The Decorator Pattern¶
LLM Expect works by decorating a standalone function.
from llm_expect import llm_expect
@llm_expect(dataset="tests.jsonl")
def my_llm_function(prompt: str) -> str:
# Call LLM here
return "response"
2. Class Methods¶
LLM Expect supports decorating instance methods directly. The decorator handles self binding automatically.
class MyClass:
def __init__(self):
self.client = OpenAI()
@llm_expect(dataset="tests.jsonl")
def generate(self, prompt: str):
# 'self' is available here!
return self.client.generate(prompt)
# Usage
generator = MyClass()
result = generator.generate("Hello")
eval_result = generator.generate.run_eval()
📁 Dataset Construction (JSONL)¶
Datasets must be in JSONL (JSON Lines) format. Each line is a single test case.
Basic Structure¶
Input Formats¶
1. Single Argument (String)¶
If your function takes one argument (e.g., def func(prompt)):
2. Multiple Arguments (Dictionary)¶
If your function takes multiple arguments (e.g., def func(context, question)):
Expected Output Formats (Metrics)¶
You can combine multiple expectations in a single test case.
1. Reference (Exact Match)¶
Checks if output matches exactly (ignoring whitespace).
2. Contains (Keywords)¶
Checks if output contains ALL keywords (case-insensitive).
3. Regex (Pattern Matching)¶
Checks if output matches a regex pattern.
4. Schema (JSON Validation)¶
Validates that output is valid JSON and matches the schema.
"expected": {
"schema": {
"type": "object",
"required": ["name", "age"],
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
}
}
}
5. Safety (Refusal Check)¶
Checks if the model refused a harmful prompt.
6. Custom Judge (LLM Evaluation)¶
Uses an LLM to grade the response based on a prompt.
⚙️ Configuration & Best Practices¶
Decorator Arguments¶
| Argument | Type | Description |
|---|---|---|
dataset |
str |
Path to JSONL file (Required) |
tests |
list[str] |
Metrics: ["accuracy", "schema_fidelity", "safety", "custom_judge"] |
judge_provider |
str |
"openai", "anthropic", etc. (Required for judge metrics) |
thresholds |
dict |
{"accuracy": 0.9} |
Environment Variables¶
OPENAI_API_KEY(for OpenAI judge)ANTHROPIC_API_KEY(for Anthropic judge)LLM_EXPECT_TESTS(default metrics)LLM_EXPECT_THRESHOLD(global threshold)
🧠 Common Pitfalls to Avoid¶
- Do not mock the LLM inside the decorated function. LLM Expect is for integration testing with real LLMs.
- Do not use
pytestdecorators on the same function. LLM Expect is its own test runner. - JSONL paths. Ensure the dataset path is relative to where the script is run, or use absolute paths.
- JSONL paths. Ensure the dataset path is relative to where the script is run, or use absolute paths.
📝 Complete Example Implementation¶
Here is a robust example showing how to implement LLM Expect with multiple metrics.
from llm_expect import llm_expect
import os
from openai import OpenAI
# 1. Define Logic (Class-based)
class StoryGenerator:
def __init__(self):
self.client = OpenAI()
@llm_expect(
dataset="stories.jsonl",
tests=["custom_judge", "safety"],
judge_provider="openai",
judge_model="gpt-4",
thresholds={"custom_judge": 0.8, "safety": 1.0}
)
def generate_story(self, topic: str, length: str) -> str:
response = self.client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"Write a {length} story."},
{"role": "user", "content": topic}
]
)
return response.choices[0].message.content
# 2. Run Evaluation
if __name__ == "__main__":
print("Running evaluation...")
generator = StoryGenerator()
results = generator.generate_story.run_eval()
if results['passed']:
print("✅ All tests passed!")
else:
print(f"❌ Failed. Success rate: {results['summary']['success_rate']:.1%}")
print(f"See details in: {results['run_dir']}")