Product & Technology
The Complete Guide to LLM Evaluations
Part II - The Cognitive Engine
Master the RAG Triad (context relevance, faithfulness, answer relevance) and LLM-as-a-judge patterns (direct scoring, pairwise comparison, chain-of-thought reasoning). Learn how to evaluate context-aware systems and handle judge bias.
About This Series
This guide is divided into three parts to take you from concept to production-grade implementation.
The Foundations
Focus: Buying vs. Building, Model Selection (MMLU/GSM8K), and Safety Checks.
Best for: Decision-makers establishing their AI stack.
The Cognitive Engine
You are hereFocus: Advanced RAG Evaluation (Context/Faithfulness) and LLM-as-a-Judge.
Best for: Engineers, Data Scientists, Product Managers building the AI Agents.
Operationalizing AI Quality
βFocus: Strategy by Use Case, Hands-on Implementation and Tooling.
Best for: Tech Leads and Product Leaders managing production AI.
Eval Type 2: Context-Aware Evals (The RAG Triad)
The RAG Triad: Three Critical Checks
π
Context Relevance
Did we retrieve the right documents?
π
Faithfulness
Did the model stick to the docs (no made-up facts)?
β
Answer Relevance
Did we actually answer what the user asked?
Your system retrieves documents and then generates an answer based on them. We need to check both steps. This is fundamental difference between user using open LLM like ChatGPT vs you AI agent to engage
For this eval while both LLM and Humans can be used to judge the output, it is recommended to use humans to judge the output for at least first few iterations
1. Overview
1.1. The Scenario (SaaS Onboarding)
User asks: "What's your return policy?"
System retrieves: Documents about returns
System generates: "We offer 30-day returns..."
We need to check:
- Step 2: Did we retrieve the right documents?
- Step 3: Did we answer correctly based on those documents?
1.2. What is RAG?
RAG stands for "Retrieval Augmented Generation."
Without RAG
User Question β LLM β Answer
(LLM uses only pre-trained knowledge)
The LLM might make up facts because it doesn't have access to your company's specific policies.
With RAG
Customer Question β Search Your Knowledge Base β Retrieve Documents β LLM + Question + Retrieved Documents β Answer
The system first finds your actual policies, then feeds them to the LLM. This prevents hallucinations (making stuff up).
How RAG Works: Without vs With
β Without RAG
1) User question
The user asks something specific.
2) LLM (guessing)
The model answers from training data.
3) Answer (risk)
Higher hallucination risk.
Example
User asks: "How long do I have to file a claim?"
LLM (guessing): "Insurance companies typically give you 30 daysβ¦"
Result: β WRONG (Your company gives 60 days, your competitor gives 30)
β With RAG
1) User question
Same questionβ¦
2) Search knowledge base
Find relevant internal docs.
3) Retrieve docs
Attach the best chunks as context.
4) LLM + docs
Generate grounded output.
5) Accurate answer
Lower hallucination risk.
Same example (grounded)
Retrieved: "Claim filing deadline: 60 days from loss date"
LLM (informed): "You have 60 days from the date of loss to file your claim."
Result: β CORRECT
1.3. Cost
Cost will depend on #tokens LLM Prompt, User Query, Information retrieved and final LLM Response + Cost of verification (manually or by LLM)
2.1.3.1. Cost Drivers
- LLM Prompt length
- # responses generated
- The amount of context you retrieve. If you retrieve 5 documents instead of 2, your token count and costs multiply. This is why a high Context Relevance score is crucial, it ensures you're not paying to evaluate irrelevant information.
The RAG Pipeline: Where Each Eval Fits
User Query
"What's your return policy?"
π Context Relevance Eval
Did we retrieve the right documents?
Checks: Query β Retrieved Docs match
LLM Generates Answer
Uses retrieved docs to create response
π Faithfulness Eval
Did LLM stick to the docs?
Checks: Answer β Docs consistency
β Answer Relevance Eval
Did we answer the question?
Checks: Query β Answer match
2. Eval 2a: Context Relevance (Did We Retrieve the Right Stuff?)
The first step is to find the correct documents/ data related to the user query. Unless this step is correct, output will be garbage irrespective of all further steps
Question: When the customer asked about billing changes, did the system retrieve documents about billing, not unrelated articles?
2.1. How It Works
A judge (either a simple rule or an LLM) looks at:
- The user's question
- The retrieved documents
And decides: "Is this document relevant to answering the question?"
2.2. Example
2.2.1. SaaS Example
Customer question: "How do I change my billing cycle?"
Retrieved doc 1: "Account Settings Guide β Changing your billing cycle" β β RELEVANT
Retrieved doc 2: "Security Features β Twoβfactor authentication setup" β β IRRELEVANT (About security, not billing)
Retrieved doc 3: "Enterprise Plan β For organizations with 500+ users" β β IRRELEVANT (About plans, not billing cycle changes)
2.2.2. Insurance Example
Policyholder question: "Are mental health treatments covered?"
Retrieved doc 1: "Policy Section 5 β Mental Health Coverage Details" β β RELEVANT
Retrieved doc 2: "Claims Process β How to File a Claim" β β οΈ SOMEWHAT RELEVANT (Related but not specifically about coverage)
Retrieved doc 3: "Car Rental Coverage β Included with auto policies" β β IRRELEVANT (Different coverage type)
2.3. How to improve Context Relevance
Your system retrieves irrelevant documents. For a SaaS billing question, it pulls up marketing blog posts. For an insurance claim, it retrieves the wrong policyholder's documents.
| Problem | Solution |
| Poor Chunking Strategy: Your documents are split into chunks that are too large (diluting the meaning) or too small (losing context). |
|
| Simple Vector Search is Not Enough: A user query for "annual plan cost" might not semantically match a document chunk that only contains a pricing table. |
|
| Context missed Retrieval: Retrieval happens only based on user query, ignoring user details and conversation history |
|
3. Eval 2b: Hallucinations (Did the LLM Make Stuff Up?)
Check if LLM is using retrieved information to respond to the user or is it making up stuff?
Question: The LLM gave an answer. Is that answer supported by the documents we retrieved?
3.1. How It Works
Take three things:
- The user's question
- The documents retrieved
- The LLM's answer
A judge checks: "Is everything in the answer actually in the documents? Or did the LLM hallucinate?"
3.2. SaaS Example
Customer question: "What payment methods do you accept?"
Retrieved information: "We accept credit cards (Visa, Mastercard, Amex) and bank transfers (ACH). We do NOT accept PayPal or checks."
Answer A: "We accept Visa, Mastercard, and Amex credit cards, plus bank transfers." β β FAITHFUL
Answer B: "We accept Visa, Mastercard, Amex, PayPal, and bank transfers." β β Hallucinates (Added PayPal)
Answer C: "We accept Visa, Mastercard, and Amex. We also accept Google Pay and Apple Pay." β β Hallucinates (Not in docs)
3.3. Why It's Critical
This catches hallucinations: one of the biggest problems with LLMs. They can sound very confident while being completely wrong.
For a SaaS onboarding assistant, faithfulness prevents customers from getting wrong instructions. For an insurance Agent, it prevents customers from being misled about coverage, which could result in denied claims and customer rage (or worse, regulatory fines).
Real Cost (SaaS): One SaaS company discovered their AI was telling customers "You can change your billing cycle anytime," when the actual policy was "Annual plans can't be changed mid-cycle." This caused support tickets and refund requests.
3.3. How to improve Context Relevance
The model is hallucinating, making up facts not supported by the retrieved documents. It invents SaaS features that don't exist or promises insurance coverage your company doesn't offer
| Problem | Solution |
| The Prompt is Too Creative: Your prompt encourages the model to "be helpful" or "act like an expert" without strictly grounding it |
|
| Contradictory or Noisy Context: The retrieval step returned multiple documents with conflicting information (e.g., an old and a new pricing page). The LLM gets confused and invents a "middle ground" answer. |
|
| The Model is Too Powerful/Creative: Newer, highly capable models are sometimes more prone to "creative reasoning" that can lead to subtle hallucinations. |
|
4. Eval 2c: Answer Relevance (Did the LLM Actually Answer the Question?)
The AI was able to retrieve correct information and the response was also created from the retrieved information, but did it answer the user query?
Question: The LLM gave an answer. But does it actually address what the user asked?
4.1. How It Works
Take two things:
- The user's original question
- The LLM's answer
A judge checks: "Does the answer address the question?"
4.2. Insurance Example
Policyholder question: "Can I upgrade my coverage while mid-policy?"
Retrieved info: "Policy modifications allowed during renewal period⦠For urgent increases, contact sales⦠Interim increases approved within 24 hours for additional premium."
Answer A: "Yesβ¦ during renewalβ¦ for urgent requests call salesβ¦" β β RELEVANT
Answer B: "Your policy is comprehensive and provides excellent protection." β β IRRELEVANT
Answer C: "Many customers ask about coverage changesβ¦" β β IRRELEVANT
4.3. How to improve Context Relevance
The answer is factually correct and faithful to the documents, but it doesn't actually answer the user's specific question. A user asks "How do I add a user?" and the Agent explains what a user is
| Problem | Solution |
| The Prompt Ignores the Original Question: The prompt might be overly focused on summarizing the context instead of relating it back to the user's intent. |
|
| The Model is "Distracted" by Irrelevant Context: Even if one retrieved document is relevant, others might be noise. The model might latch onto the noisy documents and generate an answer based on them. |
|
| The Question is Ambiguous: The user asks something vague like "Is it good?" and the model doesn't know what to focus on. |
|
5. The RAG Triad in Action
Think of a SaaS customer asking "Can I pause my subscription?"
5.1. Step 1: Retrieve
System searches and finds: "Pause Policy: Available for monthly plans. 30-day pause limit. Paused time doesn't count toward an annual commitment."
Eval: Context Relevance = β PASS (Found the right information)
5.2. Step 2: Generate
LLM says: "Yes, you can pause your subscription for up to 30 days. Monthly plans can pause anytime."
Eval 1: Faithfulness = β PASS (Answer matches the document)
Eval 2: Answer Relevance = β PASS (Directly answers "can I pause?")
All Evals Pass β Ship to customers
Eval Type 3: LLM-as-a-Judge (The Flexible One)
Since evaluating subjective qualities is hard and cannot be done at scale manually, ask a different LLM to judge the output.
The Analogy: You can't judge your own homework. You ask a teacher to grade it.
When to Use LLM-as-a-Judge
β Use LLM Judge For:
- β Subjective quality (helpfulness, tone, clarity)
- β A/B testing prompts or models
- β Production monitoring at scale
- β Custom criteria (brand voice, empathy)
β Don't Use LLM Judge For:
- β Format validation (use deterministic)
- β Exact matching (use regex/keywords)
- β RAG faithfulness (use RAG triad)
- β When cost/latency is critical
How It Works (Conceptually)
Step 1
Your Application LLM
The Student generates a response to the user.
Step 2
Evaluation LLM
The Judge reads the response and grades it against your criteria.
Output
Score / Label
A numeric score (e.g., 1β4) or a pass/fail label + reasoning.
In short: Student β Answer β Judge β Score
Important: Why Can a Judge LLM Grade But Not Generate?
"If the LLM can judge responses, why can't it generate perfect responses?"
Grading is easier than creating.
1. Direct Scoring (Rating One Response)
1.1. Scenario
Your Product Research Agent responds to a customer question. You want to know: "Is this response helpful?"
1.2. How It Works
prompt = """You are a helpful evaluator. Rate the following response for helpfulness.
User Question: {question}
Response: {response}
A helpful response:
- Directly answers the question
- Provides actionable information
- Uses simple language
Rate on a scale of 1-4:
- 1 = Not helpful at all
- 2 = Somewhat helpful but incomplete
- 3 = Helpful and mostly complete
- 4 = Extremely helpful and thorough
Return JSON: {"score": X, "reasoning": "why"}
"""
judge_llm.evaluate(prompt)
# Returns: {"score": 3,
"reasoning": "Addresses the main concern but could include more alternatives"} 1.3. Examples
Question: "How do I reset my password?"
Response: "Go to login page, click 'forgot password', enter email, check inbox for reset link."
Judge score: 4/4 β (Clear, direct, actionable)
Question: "Is this product good?"
Response: "Our products are made with high-quality materials."
Judge score: 2/4 (Didn't answer specifically about this product)
1.4. Advantages
- Flexible (you define any criteria you want)
- Fast (one evaluation per response)
- Cheap (costs pennies per evaluation)
1.5. Disadvantages
- Not perfect (the judge might be wrong)
- Takes some prompt engineering (you need to teach the judge what "helpful"means)
2. Pairwise Comparison (A vs B)
2.1. Scenario
You changed a prompt. Want to know if the new version is better than the old one?
2.2. When to Use
- A/B testing prompts
- Choosing between two models
- Benchmarking improvements
2.3. How It Works
prompt = """You are an impartial judge. Compare these two responses to the same question.
Question: {question}
Response A: {old_response}
Response B: {new_response}
Which response is better?
Consider: accuracy, helpfulness, clarity, tone.
Return JSON: {"winner": "A" or "B", "reasoning": "why"}
"""
judge_llm.compare(prompt)
# Returns: {"winner": "B",
"reasoning": "More detailed and includes examples"} 2.4. Real Example
Question: "How much does shipping cost?"
Old response (A): "Shipping depends on location."
New response (B): "Standard US shipping is $5β10 depending on weight and region. Express 2βday shipping is $15β20. International rates vary by country."
Judge: "B is better. Provides specific information instead of vague answers."
3. Teaching the Judge: Chain-of-Thought Reasoning
3.1. Problem
If you ask an LLM to score something 1-5, it sometimes struggles. What's the difference between a "3" and a "4"?
3.2. Solution
Ask AI to think step-by-step first.
3.3. Example
BAD: Vague instructions
prompt = "Rate this response for quality: 1-5""" GOOD: Step-by-step reasoning
prompt = """
Evaluate this response step-by-step:
1. Does it answer the user's question? Yes/No
2. Is the information accurate based on our knowledge? Yes/No
3. Is it clear and easy to understand? Yes/No
4. Does it have the right tone for customer support? Yes/No
Then score 1-5:
1 = Fails 3 or more criteria
2 = Fails 2 criteria
3 = Fails 1 criterion
4 = Passes all criteria but could be better
5 = Excellent on all criteria
""" 3.4. Why It Works
Breaking down complex judgment into smaller steps makes the judge more consistent and accurate.
3.4. Handling Judge Bias
Even good judges have biases:
3.4.1. Bias 1: Verbosity Bias
The judge prefers longer answers, even if they're just fluff.
- Answer A: "Return policies vary by product."
- Answer B: "We have different return windows depending on product type. Electronics have 30 days, clothing has 45 days, furniture has 60 days..."
The judge almost always picks B (even if A was more accurate for the question).
Fix: Tell the judge "Conciseness is good. Longer doesn't mean better."
3.4.2. Bias 2: Position Bias
The judge prefers whichever answer appears first.
Test both orders:
- (A, B): Judge picks A
- (B, A): Judge picks B (flipped!)
Fix: Always randomize the order, or test both ways and count only consistent picks.
3.4.3. Bias 3: Self-Preference Bias
GPT-4 judges tend to favor outputs from same LLM as compared to other LLMs. Claude judges favor Claude outputs.
Fix: Use multiple different judges and take a majority vote for critical decisions.
Summary: Which Eval to Use When?
Which Eval Should I Use?
Does your system use document retrieval (RAG)?
YES β Use the RAG triad
- Context relevance
- Faithfulness
- Answer relevance
NO β Skip RAG evals
Use deterministic checks and/or LLM-as-judge depending on what you need to measure.
Does your output need a specific format?
YES β Deterministic evals
- JSON schema validation
- Regex / keyword rules
NO β Continue
Format checks may be optional.
Do you need to judge subjective quality (helpfulness, tone)?
YES β LLM-as-a-judge
Define criteria and score/label the output.
NO β Deterministic might be enough
Ship with fast rules + monitor edge cases.
Are you choosing between foundation models?
YES β Run benchmarks
- MMLU (reasoning)
- Chatbot Arena (conversational)
NO β Focus on system evals
Your evals should match your product's failure modes.
| Eval Type | What It Checks | Cost | Speed | When to Use |
| MMLU (Benchmark) | Does the model know general knowledge? | Free | N/A | Model selection only |
| Chatbot Arena (Benchmark) | Does the model feel human-like? | Free | N/A | Model selection for conversational AI |
| JSON Validation (Deterministic) | Is output properly formatted? | Free | Instant | Format compliance, safety gates |
| Regex Matching (Deterministic) | Does text contain/avoid keywords? | Free | Instant | Safety, brand compliance |
| Code Execution (Deterministic) | Does generated SQL/code run? | Free | Milli seconds | When LLM writes executable code |
| Context Relevance (RAG) | Did retrieval find relevant docs? | Low | Seconds | RAG systems |
| Faithfulness (RAG) | Did LLM make up facts? | Medium | Seconds | Preventing hallucinations |
| Answer Relevance (RAG) | Did LLM answer the right question? | Medium | Seconds | Conversational AI, Q&A |
| Direct Scoring (LLM Judge) | Score response on custom criteria | Medium | Seconds | Production monitoring, quality checks |
| Pairwise Comparison (LLM Judge) | Which response is better? | Higher | Seconds | A/B testing, prompt iteration |
Coming Next: Part III - The Strategy & Implementation Guide
Now that you understand all the eval types, Part III shows you how to actually implement them in production. Get hands-on guidance, avoid common pitfalls, and learn which tools to use.
π― Implementation Strategy by Use Case
Learn which evals to run for Product Discovery Agents, Product Research Agents, and Sales Reachout Agents. Know what to skip and why.
π οΈ How to Build Your First Eval
Step-by-step hands-on guide: from picking your first eval to deploying and monitoring in production. Includes code examples and a 90-day rollout roadmap.
β οΈ Common Mistakes & How to Avoid Them
Learn from real-world failures: over-optimizing for benchmarks, using weak judges, running too many evals, and more.
π§ Choosing Tools
Compare building your own vs. using frameworks (DeepEval, Ragas, Promptfoo) vs. hosted platforms (Evidently Cloud, LangSmith).