LLM Expect¶
LLM Expect provides built-in support for using LLMs to evaluate other LLMs.
OpenAI¶
Use OpenAI models (GPT-4, GPT-3.5, etc.) as judges.
Configuration¶
from llm_expect import llm_expect
@llm_expect(
dataset="tests.jsonl",
tests=["custom_judge"],
judge_provider="openai",
judge_model="gpt-4"
)
def generate(prompt: str) -> str:
# Your function
pass
Environment Variables¶
Supported Models¶
gpt-4gpt-4-turbogpt-5.1gpt-3.5-turbo
Anthropic¶
Use Claude models as judges.
Configuration¶
@llm_expect(
dataset="tests.jsonl",
tests=["custom_judge"],
judge_provider="anthropic",
judge_model="claude-3-opus-20240229"
)
Environment Variables¶
Supported Models¶
claude-3-opus-20240229claude-3-sonnet-20240229claude-3-haiku-20240307
AWS Bedrock¶
Use Bedrock-hosted models as judges.
Configuration¶
@llm_expect(
dataset="tests.jsonl",
tests=["custom_judge"],
judge_provider="bedrock",
judge_model="anthropic.claude-3-sonnet-20240229-v1:0"
)
Environment Variables¶
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_REGION=us-east-1
export LLM_EXPECT_JUDGE_MODEL=anthropic.claude-3-sonnet-20240229-v1:0
Judge Configuration Options¶
Model Selection¶
API Key¶
from llm_expect.models import JudgeConfig
judge_config = JudgeConfig(
provider="openai",
model="gpt-4",
api_key="your-key" # Or use environment variable
)
Custom Base URL¶
judge_config = JudgeConfig(
provider="openai",
model="gpt-4",
base_url="https://custom-api.example.com/v1"
)
Timeout and Retries¶
Temperature¶
judge_config = JudgeConfig(
provider="openai",
model="gpt-4",
temperature=0.0 # 0.0 for deterministic
)
Judge Evaluation Types¶
Custom Judge¶
Evaluate with custom criteria:
Instruction Adherence¶
Evaluate instruction following:
Safety Evaluation¶
Evaluate safety with# LLM-as-a-Judge:
Best Practices¶
- Use GPT-4 for complex evaluations: More reliable for nuanced judgments
- Set temperature to 0.0: For consistent, deterministic evaluations
- Cache results: Enable caching to avoid redundant API calls
- Monitor costs: Judge evaluations make additional API calls
- Test judge prompts: Validate that judge criteria match your needs