Product & Technology

The Complete Guide to LLM Evaluations

Part I - The Foundations & Reliability Checks

Understanding why traditional testing fails for AI, the two main categories of evals, how to use public benchmarks (MMLU, HumanEval, GSM8K, Chatbot Arena) for model selection, and deterministic evals (JSON validation, regex matching, code execution) for reliability checks.

18 Dec 2025 8 min read

About This Series

This guide is divided into three parts to take you from concept to production-grade implementation.

1. Introduction: Why can't we test AI like normal code?

Imagine you're a SaaS company with an AI-powered onboarding assistant. A new customer asks: "What's your monthly billing cycle?" Your agent responds: "We have a free trial of 2 weeks."

Or Worse: You're an insurance company with a claims assistant. A policyholder asks: "What's my coverage for accident claims?" Your agent responds: "IRDAI guidelines mandate all claims be settled within 7 days," when the actual regulation is more nuanced and claim-type specific. This hallucination could trigger customer complaints, escalations to the regulator, and legal liability.

Imagine you built an AI Agent to engage customers. A customer asks: "How do I return my order?" Your chatbot responds: "The capital of France is Paris."

This would be a disaster. But how do you catch this before it reaches your customers.

In traditional software, we test with simple rules: assert 2 + 2 == 4. Either it's true or false. Simple.

But with conversational AI, there can be multiple right answers. A customer might ask "How does your pricing scale?" and the correct answer could be:

  • "Pricing scales from $99/month for Starter to $999/month for Enterprise..."
  • "Our Starter plan is $99/month, Professional is $299/month..."
  • "Starting at $99 monthly for individual users, scaling up for teams..."

All three are correct, even though the wording differs. But in regulated industries like insurance or financial SaaS, compliance answers cannot vary, they must be exactly correct.

This is where Evals (evaluations) come in. They are the testing system for AI applications. This guide explains every type of eval available, what each one does, and when you'd actually use it.

Who This Is For: Product managers, engineers, business stakeholders, and anyone building or using LLM-powered systems who wants to understand how AI quality is measured.

2. The Two Main Categories of Evals

Before diving into specific evals, understand that there are two fundamentally different purposes for testing LLMs.

  • Model Selection Evals
  • System Quality Evals

Quick Decision Guide: Which Category Do You Need?

?

Are you choosing between foundation models?

✓ YES → Model Selection Evals

Use public benchmarks (MMLU, Chatbot Arena, etc.)

Run: Once during selection, then every 3-6 months

✗ NO → System Quality Evals

Use custom tests for your specific application

Run: Continuously, every time you change something

Two Main Categories of Evals

🎯 Category A: Model Selection

Purpose: Compare foundation models.

Question: "Which model is smarter for my use case?"

MMLU HumanEval Chatbot Arena

When to run: Once during model selection, then every 3–6 months.

⚙️ Category B: System Quality

Purpose: Test your specific application.

Question: "Does my system work correctly?"

Deterministic RAG LLM-as-judge

When to run: Continuously (prompts, data, code, and production changes).

2.1. Category A: "Model Selection" Evals

These are public, standardized benchmarks that test a raw, out-of-the-box LLM on generic knowledge and reasoning.

  • Example: "Does GPT-4 know more than Claude?"
  • Use Case: Choosing which foundation model to buy for your application.
  • Choosing the right model(s): Standardized benchmarks compare leading models across various parameters and use cases. Pick a model that performs in top 3 for your use case
  • Timeline: Run once during initial model selection. Re-evaluate every 3-6 months

2.2. Category B: "System Quality" Evals

These are custom, internal tests that check if your specific application works correctly with your data, policies, and compliance requirements.

  • Example: "Does our insurance claims AI Agent answer claim coverage questions accurately without hallucinating regulations?"
  • Use Case: Every day. Before pushing new code. After changing prompts. While monitoring production. Especially critical in regulated industries.
  • Timeline: Run continuously, every time you change something.

Deep Dive: Public Benchmarks (Model Selection)

These benchmarks answer the question: "Which model is smarter?"

Benchmark Cheat Sheet: Public Model Selection Evals

Use this to pick the right benchmark for what you want to measure.

MMLU

  • Measures: broad knowledge + reasoning
  • Best for: model selection (general capability)
  • Signal: higher is usually safer for complex reasoning

HumanEval / SWE-bench

  • Measures: code/ SQL generation quality
  • Best for: assistants that output code/SQL
  • Signal: pass rate on real coding tasks

GSM8K

  • Measures: multi-step math / arithmetic
  • Best for: pricing, premiums, calculations
  • Signal: fewer numeric mistakes

Chatbot Arena

  • Measures: human preference in conversation
  • Best for: chat UX, helpfulness, tone, empathy
  • Signal: win-rate / ranking vs peers

Rule of thumb: use benchmarks to choose a base model, then optimize using your system evals.

1. Benchmark 1: MMLU (Massive Multitask Language Understanding)

The Analogy: Like hiring a consultant, you want to know their general knowledge across many domains. MMLU tests this.

1.1. What It Tests

  • Business and finance knowledge
  • Legal and regulatory reasoning
  • Medicine and healthcare
  • Technology and engineering
  • Ethics and philosophy

How It Works: The test gives the model multiple-choice questions across 57 different subjects and checks if it picks the right answer. The main metric is accuracy (% correct) averaged across all subjects.

1.2. How Models Are Evaluated

Models are typically evaluated zero-shot or few-shot: they see the question (and sometimes a few examples) and must pick A/B/C/D.

Standard evaluation suites such as HELM fix the prompt format and decoding settings so different models are comparable.

1.3. Insurance Example

Question: What does "deductible" mean in insurance?

A) The amount paid by the insured before insurance covers the rest

B) The maximum amount the insurer will pay

C) The quarterly payment amount

D) The tax deduction for insurance payments

The Score: A percentage (0-100%). GPT-4 scores around 90%. Claude scores around 89%. A human expert scores around 90%.

1.4. Why It Matters

Higher MMLU score loosely correlates with "general intelligence." If you're building a system that needs to reason about complex topics, a model with a higher MMLU score is probably a safer bet.

  • SaaS: If you're building an AI assistant for SaaS product decisions (pricing recommendations, feature selection, onboarding guidance), a model with high MMLU is better at reasoning through complex scenarios.
  • Insurance: If your AI Agent needs to explain coverage, policies, or regulations, MMLU indirectly measures the model's ability to understand nuanced concepts. Higher scores reduce means less hallucinations.

A model that is weak on MMLU is a risk if you want it to handle complex, multi-step decisions.

1.5. Strengths and Limitations

Strengths Limitations
  • Broad coverage: 57 subjects → good "overall intelligence" snapshot.
  • Simple to interpret: 80% vs 60% is clearly meaningful.
  • Not your domain: It doesn't know your exact SaaS pricing rules or insurance riders.
  • Multiple-choice only: Tests recognition, not open-ended explanation quality.
  • Contamination risk: Benchmarks are public; some questions may be memorized by frontier models.

1.6. How to Use It

  • MMLU tests generic knowledge. Your system probably cares about specific knowledge (your company's policies, products, etc.).
  • Use MMLU to separate tiers: e.g., discard models <60%; shortlist models >80%.
  • Do not agonize over tiny gaps (92% vs 91%); those won't decide whether your billing chatbot works.

1.7. Key Links

2. Benchmark 2: HumanEval / SWE-bench (For Code)

This is only needed if your AI Agent will give code or SQL as output. For conversational agents to engage customers, this is not needed to be evaluated

2.1. What It Tests

Human-Eval

  • 164 hand-written Python coding tasks (e.g., implement a function from a docstring).
  • A hidden unit-test suite checks whether the generated function behaves correctly.

SWE-Bench

  • Real GitHub issues tied to real repositories; the model must produce a patch that makes all tests pass.
  • Much closer to how an AI developer tool would behave in reality.

Whether the model can write correct Python code or solve software engineering problems.

2.2. How It's Evaluated

Human-Eval

  • The model generates one or more candidate solutions.
  • The pass@k metric measures the probability that at least one of the k samples passes all tests.

SWE-Bench

  • The patch is applied to the repo.
  • The project's full test suite is executed.
  • The model passes if the tests pass without regressions.

2.3. Example:

A developer-focused SaaS wants its AI code assistant to write database migration scripts

Task: Write a Python function to calculate monthly recurring revenue (MRR)

MRR = (Number of customers × Average Revenue Per Account) / 1

def calculate_mrr(customers, arpa):
    return (customers * arpa) / 1

# Test cases:
assert calculate_mrr(100, 50) == 5000  # ✓ PASS
assert calculate_mrr(200, 100) == 20000  # ✓ PASS

The evaluator runs this code against test cases. If it crashes or returns wrong values, it fails.

2.4. Strengths and Limitations

Strengths Limitations

Human-Eval

  • Simple, clean benchmark for basic code generation.
  • Good early filter for "can this model write usable functions at all?".

SWE-Bench

  • Realistic: Evaluates navigation, reading, and editing large codebases.
  • Good proxy for "can I trust this model to modify production code?".

Human-Eval

  • Very small (164 items) and nearly saturated by top models.
  • Doesn't reflect real project context (no multi-file edits, no long repos).

SWE-Bench

  • Harder and more expensive to run.
  • Mostly relevant for orgs where AI will touch live code or migrations.

2.5. How to Use Them?

  • If your AI will not write code/SQL → you can ignore HumanEval/SWE-bench.
  • If it will, use these to:
    • Shortlist coding-strong models.
    • Then wrap them with deterministic tests and staging before letting them touch production.

2.6. Key Links

3. Benchmark 3: GSM8K (Grade School Math 8K)

3.1. What It Tests

Can the model solve multi-step math problems?

  • 8.5K grade-school style math word problems (multi-step arithmetic and reasoning).
  • The model must output the final numeric answer, usually graded by exact match.
  • These problems look simple, but they expose weaknesses in multi-step logic, unit handling, and careful arithmetic; the same skills you need for billing, pricing, and premiums.

3.2. How It's Evaluated

  • Often run with chain-of-thought prompting (model explains steps then produces the answer).
  • Leaderboards report accuracy with and without reasoning steps.
  • Some derivative benchmarks (e.g., GSM-Symbolic) push models harder on symbolic reasoning and robustness.

3.3. When You Need It

  • If your Agent will need to work with numbers to calculate pricing or premium or loan amount, etc.
  • If your SaaS calculates pricing, discounts, or ROI projections in AI features.
  • If your insurance platform calculates claim amounts, premiums, or coverage limits.

3.4. Example

3.4.1. SaaS Example

Question: A SaaS company offers a 20% discount to annual subscribers. If the monthly price is $99 and a customer signs up for annual billing, how much do they pay upfront?

Calculation:

Monthly: $99 × 12 = $1,188

Annual with 20% discount: $1,188 × 0.8 = $950.40

Correct Answer: $950.40

3.4.2. Insurance Example

Question: A policyholder has a $500,000 life insurance policy. The annual premium is $2,500. They've paid premiums for 3 years (5% returns annually). How much total have they paid with interest impact?

Annual payout with 5% growth: $2,500 × (1.05^3) …

This is more complex than it seems. Get it wrong = wrong customer bill.

3.5. Strengths and Limitations

Strengths Limitations
  • Very focused test of multi-step numerical reasoning.
  • Good "sanity check" before letting a model touch anything involving money.
  • Near-saturation at the top; many frontier models now exceed 90–95%.
  • Single-number score; doesn't show where the model fails (edge cases, off-by-one, rounding, etc.).

3.6. How to Use It

  • If a model is weak on GSM8K, do not trust it for pricing, discounts, claim amounts, or premium calculations.
  • If two models both perform well, use your own domain-specific test cases (pricing tables, premium formulas, discount rules) to choose between them.

3.7. Key Links

4. Benchmark 4: Chatbot Arena (LMSYS)

Analogy: A live talent show where users vote which AI "feels" better.

4.1. What It Tests

Chatbot Arena measures perceived conversational quality:

  • Helpfulness and clarity
  • Coherence and following instructions
  • Overall user preference in side-by-side comparisons

4.2. When You Need It

If you're building a conversational AI (chatbot, assistant, tutor), Chatbot Arena is your best signal for model selection.

  • SaaS: If you're building a conversational AI (onboarding assistant, sales chatbot, customer support), Chatbot Arena is your best signal for model selection. It predicts how customers will feel about your AI.
  • Insurance is high-stakes. Customers need to trust the AI Agent. Chatbot Arena tells you which model sounds more credible and trustworthy when handling sensitive questions about claims and coverage.

4.3. How It Works

  • A user enters a prompt and sees two anonymized model responses.
  • They pick which response is better (A/B or "tie").
  • LMSYS converts many such votes into an Elo-style rating per model.
  • This yields a crowd-sourced ranking of conversational models.

4.4. SaaS Example

User: "I'm confused about when I'll be charged. Help me understand the billing?"

Model A:

"You'll be billed monthly on the same date each month. If you signed up on the 15th, you'll be charged on the 15th of every subsequent month. You can change your billing date anytime in Account Settings → Billing. Charges appear as 'TechCorp Inc' on your statement."

Model B:

"Billing happens monthly. Let me clarify:

1. Billing date: Same date you signed up (the 15th in your case)

2. Where to see it: Account Settings → Billing Dashboard shows your next billing date

3. Need to change it? You can reschedule anytime--there's no penalty

4. Your bank statement will show 'TechCorp Inc' as the merchant

If you need to pause your subscription temporarily, that's also available under Billing."

Human votes: Model B ✓ (More structured, answers follow-up questions, actionable)

4.5. Strengths and Limitations

Strengths Limitations
  • Captures subjective aspects (tone, empathy, structure) that MMLU/GSM8K miss.
  • Continuously updated as new models are added.
  • Reflects diverse, organic prompts from real users, not just benchmark questions.
  • Audience skew: Early adopters and AI enthusiasts may not match your actual customer base.
  • Prompt mix: Arena prompts are generic; your use case (insurance claims, SaaS billing) might be more specific.
  • Biases: Verbose answers and certain styles tend to win more votes even when not more accurate.

4.6. How to Use It

Use Chatbot Arena to pick a shortlist of conversationally strong models.

Then evaluate those 2–3 models with your own:

  • Deterministic checks (JSON, safety, compliance).
  • RAG triad (context relevance, faithfulness, answer relevance) on your policies and product docs.

4.7. Key Links

Deep Dive: System Quality Evals

Now we move from "which model is smart" to "does my system work?"

What's Different?

  • In Category A, you're testing the model alone.
  • In Category B, you're testing the entire system, which includes:
    • Your custom prompts
    • Your data retrieval system
    • Your LLM
    • Your post-processing logic

The Three Types of System Evals

  • Deterministic Evals (The Simple, Fast Ones)
  • Context-Aware Evals (The RAG Triad)
  • LLM-as-a-Judge (The Flexible One)

Continue reading: In the next article, we'll dive deep into each of these three system eval types, starting with Deterministic Evals.

Eval Timeline: When to Run What

🚀 Initial Setup

Run Model Selection benchmarks (MMLU, Chatbot Arena)

One-time: Choose your foundation model

⚡ Every Code Change

Run Deterministic evals (JSON validation, regex checks)

Instant, free, catches format errors

🔄 Continuous Monitoring

Run System Quality evals (RAG triad, LLM-as-judge)

5% of production traffic, catch quality issues

1. Eval Type 1: Deterministic Evals (The Simple, Fast Ones)

Use rigid rules to check things that have only one correct answer or must follow strict format rules.

Deterministic Evals = Quick Quality Gates

JSON Validation

  • Required fields exist
  • Types match
  • Values in valid ranges

⚡ Speed: Instant

💰 Cost: Free

Regex / Keyword Checks

  • Contains required phrases
  • Avoids forbidden terms
  • Follows formatting rules

⚡ Speed: Instant

💰 Cost: Free

Code Execution

  • SQL/code runs without errors
  • Returns expected results
  • Passes test cases

⚡ Speed: Milliseconds

💰 Cost: Free

Analogy: Like a compliance checklist. Either your response includes the required legal disclaimer or it doesn't. No gray area.

This is very similar to traditional code test cases where there will be a definitive 1 correct answer. These evals are very similar to traditional test cases

Cost of Deterministic Evals

  • Very low operation costs. Can run on the system directly without any LLLMs
  • Engineering Cost: need 1-2 hour of effort to write test cases
  • Maintenance: To be changed in sync with underlying system

1.1. Example 1a: JSON Schema Validation

SaaS: Your onboarding assistant must output a JSON object with user setup progress:

{
  "account_status": "created",
  "payment_verified": true,
  "workspace_configured": false,
  "setup_percentage": 67,
  "next_step": "configure_workspace"
}

1.1.1. The Eval

Check that:

  • All required fields exist
  • Data types match (string, number, boolean)
  • Numbers are in valid ranges (setup_percentage 0-100, urgency_score 1-10)
  • Status values are from approved list (only "created", "pending", "active" allowed)

1.1.2. How It Works

import json
response = model_output
try:
    data = json.loads(response)
    assert isinstance(data["purchase_intent_score"], int)
    assert 1 <= data["purchase_intent_score"] <= 100
    assert data["intent_category"] in ["low", "medium", "high"]
    assert isinstance(data["key_signals"], list)
    print("✓ PASS")
except:
    print("✗ FAIL")

1.1.3. Why It Matters

Integrating AI with existing systems: If your SaaS frontend expects JSON and receives malformed text, the dashboard breaks. If your insurance CRM receives invalid claim data, claim routing fails. This eval catches those bugs before they reach production systems.

Real Cost Impact (Insurance): An insurance company's claims system expected numerical claim amounts. One AI model was returning "claim_amount": "around fifty thousand". The CRM rejected it, creating a manual workaround that took hours per day.

1.2. Example 1b: Keyword / Regex Matching

1.2.1. The Scenario

  • SaaS: Your SaaS product must never suggest competitors by name in onboarding. You also must always include your pricing URL.
  • Insurance: Your insurance chatbot must never claim specific coverage without disclaimer. It must always include the policy document reference.

1.2.2. The Eval

Check that the response doesn't contain any text like "competitor_name1", "competitor_name2", etc.

1.2.3. Real Examples

  • Safety check: Response doesn't contain curse words
  • Compliance check: Response doesn't request personal information
  • Brand check: Response mentions your product name, not just "our product"

1.3. Example 1c: Code Execution

1.3.1. The Scenario

Your agent writes SQL queries to fetch customer data.

1.3.2. The Eval

Actually run the SQL against a test database and check if it returns results without errors.

1.3.3. How It Works

sql_query = model_output
test_db = connect_to_test_database()
try:
    results = test_db.execute(sql_query)
    if len(results) > 0:
        print("✓ PASS: Query executed and returned results")
    else:
        print("✗ FAIL: Query executed but returned no results")
except Exception as e:
    print("✗ FAIL: Query crashed - {e}")

1.4. Using Deterministic Evals

Use them for:

  • Format validation (is it JSON?)
  • Safety checks (no bad words?)
  • Compliance checks (doesn't mention competitors?)
  • Executable code (does the SQL run?)
  • Exact matching (does it include the product name?)

Don't use them for (subjective reasoning):

  • Checking if the answer is helpful
  • Checking if it's accurate
  • Checking if the tone is right

Coming Next: Part II - The Cognitive Engine

Now that you understand the foundations-model selection benchmarks and deterministic evals, Part II dives deep into the advanced evaluation techniques that power production AI systems.

🔍 Context-Aware Evals (The RAG Triad)

Context Relevance: Did we retrieve the right documents?

Faithfulness: Did the model stick to the docs (no hallucinations)?

Answer Relevance: Did we actually answer what the user asked?

⚖️ LLM-as-a-Judge

Learn how to use LLMs to evaluate subjective qualities like helpfulness, tone, and clarity. Master direct scoring, pairwise comparison, and chain-of-thought reasoning techniques.

📊 Summary: Which Eval to Use When

A comprehensive decision tree to help you choose the right eval type for your specific use case.

Continue to Part II →