Product & Technology

The Complete Guide to LLM Evaluations

Part II - The Cognitive Engine

Master the RAG Triad (context relevance, faithfulness, answer relevance) and LLM-as-a-judge patterns (direct scoring, pairwise comparison, chain-of-thought reasoning). Learn how to evaluate context-aware systems and handle judge bias.

18 Dec 2025 12 min read

About This Series

This guide is divided into three parts to take you from concept to production-grade implementation.

Eval Type 2: Context-Aware Evals (The RAG Triad)

The RAG Triad: Three Critical Checks

πŸ”

Context Relevance

Did we retrieve the right documents?

🎭

Faithfulness

Did the model stick to the docs (no made-up facts)?

βœ…

Answer Relevance

Did we actually answer what the user asked?

Your system retrieves documents and then generates an answer based on them. We need to check both steps. This is fundamental difference between user using open LLM like ChatGPT vs you AI agent to engage

For this eval while both LLM and Humans can be used to judge the output, it is recommended to use humans to judge the output for at least first few iterations

1. Overview

1.1. The Scenario (SaaS Onboarding)

User asks: "What's your return policy?"

System retrieves: Documents about returns

System generates: "We offer 30-day returns..."

We need to check:

  • Step 2: Did we retrieve the right documents?
  • Step 3: Did we answer correctly based on those documents?

1.2. What is RAG?

RAG stands for "Retrieval Augmented Generation."

Without RAG

User Question β†’ LLM β†’ Answer

(LLM uses only pre-trained knowledge)

The LLM might make up facts because it doesn't have access to your company's specific policies.

With RAG

Customer Question β†’ Search Your Knowledge Base β†’ Retrieve Documents β†’ LLM + Question + Retrieved Documents β†’ Answer

The system first finds your actual policies, then feeds them to the LLM. This prevents hallucinations (making stuff up).

How RAG Works: Without vs With

❌ Without RAG

1) User question

The user asks something specific.

2) LLM (guessing)

The model answers from training data.

3) Answer (risk)

Higher hallucination risk.

Example

User asks: "How long do I have to file a claim?"

LLM (guessing): "Insurance companies typically give you 30 days…"

Result: βœ— WRONG (Your company gives 60 days, your competitor gives 30)

βœ… With RAG

1) User question

Same question…

2) Search knowledge base

Find relevant internal docs.

3) Retrieve docs

Attach the best chunks as context.

4) LLM + docs

Generate grounded output.

5) Accurate answer

Lower hallucination risk.

Same example (grounded)

Retrieved: "Claim filing deadline: 60 days from loss date"

LLM (informed): "You have 60 days from the date of loss to file your claim."

Result: βœ“ CORRECT

1.3. Cost

Cost will depend on #tokens LLM Prompt, User Query, Information retrieved and final LLM Response + Cost of verification (manually or by LLM)

2.1.3.1. Cost Drivers
  • LLM Prompt length
  • # responses generated
  • The amount of context you retrieve. If you retrieve 5 documents instead of 2, your token count and costs multiply. This is why a high Context Relevance score is crucial, it ensures you're not paying to evaluate irrelevant information.

The RAG Pipeline: Where Each Eval Fits

1

User Query

"What's your return policy?"

↓
2

πŸ” Context Relevance Eval

Did we retrieve the right documents?

Checks: Query β†’ Retrieved Docs match

↓
3

LLM Generates Answer

Uses retrieved docs to create response

↓

🎭 Faithfulness Eval

Did LLM stick to the docs?

Checks: Answer β†’ Docs consistency

βœ… Answer Relevance Eval

Did we answer the question?

Checks: Query β†’ Answer match

2. Eval 2a: Context Relevance (Did We Retrieve the Right Stuff?)

The first step is to find the correct documents/ data related to the user query. Unless this step is correct, output will be garbage irrespective of all further steps

Question: When the customer asked about billing changes, did the system retrieve documents about billing, not unrelated articles?

2.1. How It Works

A judge (either a simple rule or an LLM) looks at:

  • The user's question
  • The retrieved documents

And decides: "Is this document relevant to answering the question?"

2.2. Example

2.2.1. SaaS Example

Customer question: "How do I change my billing cycle?"

Retrieved doc 1: "Account Settings Guide – Changing your billing cycle" β†’ βœ“ RELEVANT

Retrieved doc 2: "Security Features – Two‑factor authentication setup" β†’ βœ— IRRELEVANT (About security, not billing)

Retrieved doc 3: "Enterprise Plan – For organizations with 500+ users" β†’ βœ— IRRELEVANT (About plans, not billing cycle changes)

2.2.2. Insurance Example

Policyholder question: "Are mental health treatments covered?"

Retrieved doc 1: "Policy Section 5 – Mental Health Coverage Details" β†’ βœ“ RELEVANT

Retrieved doc 2: "Claims Process – How to File a Claim" β†’ ⚠️ SOMEWHAT RELEVANT (Related but not specifically about coverage)

Retrieved doc 3: "Car Rental Coverage – Included with auto policies" β†’ βœ— IRRELEVANT (Different coverage type)

2.3. How to improve Context Relevance

Your system retrieves irrelevant documents. For a SaaS billing question, it pulls up marketing blog posts. For an insurance claim, it retrieves the wrong policyholder's documents.

Problem Solution
Poor Chunking Strategy: Your documents are split into chunks that are too large (diluting the meaning) or too small (losing context).
  • For structured documents like insurance policies, use semantic chunking that splits by section or paragraph, not by a fixed number of characters.
  • Tools like LangChain and LlamaIndex have advanced splitters.
  • Experiment with different chunk sizes.
Simple Vector Search is Not Enough: A user query for "annual plan cost" might not semantically match a document chunk that only contains a pricing table.
  • Implement Hybrid Search: Combine semantic (vector) search with traditional keyword search (like BM25). This ensures that exact matches for critical terms (like product names or policy numbers) are always found.
  • Use Query Expansion: Use an LLM to rewrite the user's query into several variations before searching. For "annual plan cost," it might generate "yearly subscription price," "12-month billing," and "annual license fee," increasing the chances of a match.
Context missed Retrieval: Retrieval happens only based on user query, ignoring user details and conversation history
  • History-Aware Retrieval: Condense conversation into standalone query.
  • Memory-Based Retrieval: Leverage Long-term user memory and Session context to retrieve relevant information.

3. Eval 2b: Hallucinations (Did the LLM Make Stuff Up?)

Check if LLM is using retrieved information to respond to the user or is it making up stuff?

Question: The LLM gave an answer. Is that answer supported by the documents we retrieved?

3.1. How It Works

Take three things:

  • The user's question
  • The documents retrieved
  • The LLM's answer

A judge checks: "Is everything in the answer actually in the documents? Or did the LLM hallucinate?"

3.2. SaaS Example

Customer question: "What payment methods do you accept?"

Retrieved information: "We accept credit cards (Visa, Mastercard, Amex) and bank transfers (ACH). We do NOT accept PayPal or checks."

Answer A: "We accept Visa, Mastercard, and Amex credit cards, plus bank transfers." β†’ βœ“ FAITHFUL

Answer B: "We accept Visa, Mastercard, Amex, PayPal, and bank transfers." β†’ βœ— Hallucinates (Added PayPal)

Answer C: "We accept Visa, Mastercard, and Amex. We also accept Google Pay and Apple Pay." β†’ βœ— Hallucinates (Not in docs)

3.3. Why It's Critical

This catches hallucinations: one of the biggest problems with LLMs. They can sound very confident while being completely wrong.

For a SaaS onboarding assistant, faithfulness prevents customers from getting wrong instructions. For an insurance Agent, it prevents customers from being misled about coverage, which could result in denied claims and customer rage (or worse, regulatory fines).

Real Cost (SaaS): One SaaS company discovered their AI was telling customers "You can change your billing cycle anytime," when the actual policy was "Annual plans can't be changed mid-cycle." This caused support tickets and refund requests.

3.3. How to improve Context Relevance

The model is hallucinating, making up facts not supported by the retrieved documents. It invents SaaS features that don't exist or promises insurance coverage your company doesn't offer

Problem Solution
The Prompt is Too Creative: Your prompt encourages the model to "be helpful" or "act like an expert" without strictly grounding it
  • Strengthen the Grounding Prompt: Be explicit. Change your prompt from "Answer the user's question using the provided context" to:
    • "You are a fact-based assistant. Answer the user's question using only the information from the provided documents. Do not add any information that is not explicitly stated in the context. If the answer is not in the documents, say 'I do not have that information in our knowledge base.
Contradictory or Noisy Context: The retrieval step returned multiple documents with conflicting information (e.g., an old and a new pricing page). The LLM gets confused and invents a "middle ground" answer.
  • Improve your Context Relevance: If you only retrieve relevant, up-to-date documents, the model has a cleaner source of truth to work with. Implement better document versioning and metadata.
The Model is Too Powerful/Creative: Newer, highly capable models are sometimes more prone to "creative reasoning" that can lead to subtle hallucinations.
  • Experiment with the model's temperature setting: A lower temperature (e.g., `0.1` or `0.0`) makes the output more deterministic and less creative, forcing it to stick closer to the source text.

4. Eval 2c: Answer Relevance (Did the LLM Actually Answer the Question?)

The AI was able to retrieve correct information and the response was also created from the retrieved information, but did it answer the user query?

Question: The LLM gave an answer. But does it actually address what the user asked?

4.1. How It Works

Take two things:

  • The user's original question
  • The LLM's answer

A judge checks: "Does the answer address the question?"

4.2. Insurance Example

Policyholder question: "Can I upgrade my coverage while mid-policy?"

Retrieved info: "Policy modifications allowed during renewal period… For urgent increases, contact sales… Interim increases approved within 24 hours for additional premium."

Answer A: "Yes… during renewal… for urgent requests call sales…" β†’ βœ“ RELEVANT

Answer B: "Your policy is comprehensive and provides excellent protection." β†’ βœ— IRRELEVANT

Answer C: "Many customers ask about coverage changes…" β†’ βœ— IRRELEVANT

4.3. How to improve Context Relevance

The answer is factually correct and faithful to the documents, but it doesn't actually answer the user's specific question. A user asks "How do I add a user?" and the Agent explains what a user is

Problem Solution
The Prompt Ignores the Original Question: The prompt might be overly focused on summarizing the context instead of relating it back to the user's intent.
  • Refine the Prompt to Prioritize the Question:Change your prompt from "Summarize the provided context" to:
    • "Read the user's question and the provided context. Formulate a direct answer to the user's question based on the information in the context."
The Model is "Distracted" by Irrelevant Context: Even if one retrieved document is relevant, others might be noise. The model might latch onto the noisy documents and generate an answer based on them.
  • Implement a Re-ranking Step: After retrieving an initial set of documents (e.g., 10), use a lightweight model or algorithm to re-rank them based on their relevance to the specific query. Pass only the top 2-3 most relevant documents to the final generation model. This might increase cost and latency of the response.
The Question is Ambiguous: The user asks something vague like "Is it good?" and the model doesn't know what to focus on.
  • Program the LLM to ask clarifying questions: If the Answer Relevance score is consistently low for a certain type of query, it might be a signal that your system needs to handle ambiguity better. Prompt it to respond with: "When you say 'good,' are you asking about price, features, or user reviews?"

5. The RAG Triad in Action

Think of a SaaS customer asking "Can I pause my subscription?"

5.1. Step 1: Retrieve

System searches and finds: "Pause Policy: Available for monthly plans. 30-day pause limit. Paused time doesn't count toward an annual commitment."

Eval: Context Relevance = βœ“ PASS (Found the right information)

5.2. Step 2: Generate

LLM says: "Yes, you can pause your subscription for up to 30 days. Monthly plans can pause anytime."

Eval 1: Faithfulness = βœ“ PASS (Answer matches the document)

Eval 2: Answer Relevance = βœ“ PASS (Directly answers "can I pause?")

All Evals Pass β†’ Ship to customers

Eval Type 3: LLM-as-a-Judge (The Flexible One)

Since evaluating subjective qualities is hard and cannot be done at scale manually, ask a different LLM to judge the output.

The Analogy: You can't judge your own homework. You ask a teacher to grade it.

When to Use LLM-as-a-Judge

βœ“ Use LLM Judge For:

  • βœ“ Subjective quality (helpfulness, tone, clarity)
  • βœ“ A/B testing prompts or models
  • βœ“ Production monitoring at scale
  • βœ“ Custom criteria (brand voice, empathy)

βœ— Don't Use LLM Judge For:

  • βœ— Format validation (use deterministic)
  • βœ— Exact matching (use regex/keywords)
  • βœ— RAG faithfulness (use RAG triad)
  • βœ— When cost/latency is critical

How It Works (Conceptually)

Step 1

Your Application LLM

The Student generates a response to the user.

↓

Step 2

Evaluation LLM

The Judge reads the response and grades it against your criteria.

↓

Output

Score / Label

A numeric score (e.g., 1–4) or a pass/fail label + reasoning.

In short: Student β†’ Answer β†’ Judge β†’ Score

Important: Why Can a Judge LLM Grade But Not Generate?

"If the LLM can judge responses, why can't it generate perfect responses?"

Grading is easier than creating.

1. Direct Scoring (Rating One Response)

1.1. Scenario

Your Product Research Agent responds to a customer question. You want to know: "Is this response helpful?"

1.2. How It Works

prompt = """You are a helpful evaluator. Rate the following response for helpfulness.
User Question: {question}
Response: {response}

A helpful response:
- Directly answers the question
- Provides actionable information
- Uses simple language

Rate on a scale of 1-4:
- 1 = Not helpful at all
- 2 = Somewhat helpful but incomplete
- 3 = Helpful and mostly complete
- 4 = Extremely helpful and thorough

Return JSON: {"score": X, "reasoning": "why"}
"""
judge_llm.evaluate(prompt)

# Returns: {"score": 3, 
            "reasoning": "Addresses the main concern but could include more alternatives"}

1.3. Examples

Question: "How do I reset my password?"

Response: "Go to login page, click 'forgot password', enter email, check inbox for reset link."

Judge score: 4/4 βœ“ (Clear, direct, actionable)

Question: "Is this product good?"

Response: "Our products are made with high-quality materials."

Judge score: 2/4 (Didn't answer specifically about this product)

1.4. Advantages

  • Flexible (you define any criteria you want)
  • Fast (one evaluation per response)
  • Cheap (costs pennies per evaluation)

1.5. Disadvantages

  • Not perfect (the judge might be wrong)
  • Takes some prompt engineering (you need to teach the judge what "helpful"means)

2. Pairwise Comparison (A vs B)

2.1. Scenario

You changed a prompt. Want to know if the new version is better than the old one?

2.2. When to Use

  • A/B testing prompts
  • Choosing between two models
  • Benchmarking improvements

2.3. How It Works

prompt = """You are an impartial judge. Compare these two responses to the same question.
Question: {question}
Response A: {old_response}
Response B: {new_response}

Which response is better?
Consider: accuracy, helpfulness, clarity, tone.
Return JSON: {"winner": "A" or "B", "reasoning": "why"}
"""

judge_llm.compare(prompt)

# Returns: {"winner": "B", 
            "reasoning": "More detailed and includes examples"}

2.4. Real Example

Question: "How much does shipping cost?"

Old response (A): "Shipping depends on location."

New response (B): "Standard US shipping is $5–10 depending on weight and region. Express 2‑day shipping is $15–20. International rates vary by country."

Judge: "B is better. Provides specific information instead of vague answers."

3. Teaching the Judge: Chain-of-Thought Reasoning

3.1. Problem

If you ask an LLM to score something 1-5, it sometimes struggles. What's the difference between a "3" and a "4"?

3.2. Solution

Ask AI to think step-by-step first.

3.3. Example

BAD: Vague instructions

prompt = "Rate this response for quality: 1-5"""

GOOD: Step-by-step reasoning

prompt = """
Evaluate this response step-by-step:
1. Does it answer the user's question? Yes/No
2. Is the information accurate based on our knowledge? Yes/No
3. Is it clear and easy to understand? Yes/No
4. Does it have the right tone for customer support? Yes/No

Then score 1-5:
1 = Fails 3 or more criteria
2 = Fails 2 criteria
3 = Fails 1 criterion
4 = Passes all criteria but could be better
5 = Excellent on all criteria
"""

3.4. Why It Works

Breaking down complex judgment into smaller steps makes the judge more consistent and accurate.

3.4. Handling Judge Bias

Even good judges have biases:

3.4.1. Bias 1: Verbosity Bias

The judge prefers longer answers, even if they're just fluff.

  • Answer A: "Return policies vary by product."
  • Answer B: "We have different return windows depending on product type. Electronics have 30 days, clothing has 45 days, furniture has 60 days..."

The judge almost always picks B (even if A was more accurate for the question).

Fix: Tell the judge "Conciseness is good. Longer doesn't mean better."

3.4.2. Bias 2: Position Bias

The judge prefers whichever answer appears first.

Test both orders:

  • (A, B): Judge picks A
  • (B, A): Judge picks B (flipped!)

Fix: Always randomize the order, or test both ways and count only consistent picks.

3.4.3. Bias 3: Self-Preference Bias

GPT-4 judges tend to favor outputs from same LLM as compared to other LLMs. Claude judges favor Claude outputs.

Fix: Use multiple different judges and take a majority vote for critical decisions.

Summary: Which Eval to Use When?

Which Eval Should I Use?

Does your system use document retrieval (RAG)?

YES β†’ Use the RAG triad

  • Context relevance
  • Faithfulness
  • Answer relevance

NO β†’ Skip RAG evals

Use deterministic checks and/or LLM-as-judge depending on what you need to measure.

Does your output need a specific format?

YES β†’ Deterministic evals

  • JSON schema validation
  • Regex / keyword rules

NO β†’ Continue

Format checks may be optional.

Do you need to judge subjective quality (helpfulness, tone)?

YES β†’ LLM-as-a-judge

Define criteria and score/label the output.

NO β†’ Deterministic might be enough

Ship with fast rules + monitor edge cases.

Are you choosing between foundation models?

YES β†’ Run benchmarks

  • MMLU (reasoning)
  • Chatbot Arena (conversational)

NO β†’ Focus on system evals

Your evals should match your product's failure modes.

Eval Type What It Checks Cost Speed When to Use
MMLU (Benchmark) Does the model know general knowledge? Free N/A Model selection only
Chatbot Arena (Benchmark) Does the model feel human-like? Free N/A Model selection for conversational AI
JSON Validation (Deterministic) Is output properly formatted? Free Instant Format compliance, safety gates
Regex Matching (Deterministic) Does text contain/avoid keywords? Free Instant Safety, brand compliance
Code Execution (Deterministic) Does generated SQL/code run? Free Milli seconds When LLM writes executable code
Context Relevance (RAG) Did retrieval find relevant docs? Low Seconds RAG systems
Faithfulness (RAG) Did LLM make up facts? Medium Seconds Preventing hallucinations
Answer Relevance (RAG) Did LLM answer the right question? Medium Seconds Conversational AI, Q&A
Direct Scoring (LLM Judge) Score response on custom criteria Medium Seconds Production monitoring, quality checks
Pairwise Comparison (LLM Judge) Which response is better? Higher Seconds A/B testing, prompt iteration

Coming Next: Part III - The Strategy & Implementation Guide

Now that you understand all the eval types, Part III shows you how to actually implement them in production. Get hands-on guidance, avoid common pitfalls, and learn which tools to use.

🎯 Implementation Strategy by Use Case

Learn which evals to run for Product Discovery Agents, Product Research Agents, and Sales Reachout Agents. Know what to skip and why.

πŸ› οΈ How to Build Your First Eval

Step-by-step hands-on guide: from picking your first eval to deploying and monitoring in production. Includes code examples and a 90-day rollout roadmap.

⚠️ Common Mistakes & How to Avoid Them

Learn from real-world failures: over-optimizing for benchmarks, using weak judges, running too many evals, and more.

πŸ”§ Choosing Tools

Compare building your own vs. using frameworks (DeepEval, Ragas, Promptfoo) vs. hosted platforms (Evidently Cloud, LangSmith).

Continue to Part III β†’