Part III - The Strategy & Implementation Guide

1. Implementation Strategy by Use Case

Now you understand what each eval does. Here's which ones to actually run based on what you're building.

1.1. Use Case 1: Product Discovery Agent

Mission: Hook users with personalized product recommendations.

1.1. Key Evals

Deterministic: Does recommendation include product ID? (JSON validation)
Deterministic: Is recommendation within 500 characters? (Engagement hook must be snappy)
LLM Judge: Is the recommendation relevant to the user's stated need?

1.2. Evals to Skip

Faithfulness (You're not summarizing docs, you're matching patterns)
Context Relevance (Recommendation isn't based on retrieved documents)
MMLU score (Doesn't matter if model knows history; matters if it matches products to needs)

Why: Discovery is about engagement and relevance, not accuracy. Simple checks + quick LLM scoring.

1.2. Use Case 2: Product Research Agent (Assistant)

Mission: Answer customer questions about products accurately.

2.1. Key Evals

Context Relevance: Did retrieval find docs about this product?
Faithfulness: Did LLM stick to the product manual (critical!)
Answer Relevance: Did it answer this specific question?
LLM Judge (Custom): Is the tone friendly?

2.2. Evals to Skip

Creativity metrics (You want boring, accurate facts, not creative content)
MMLU (Doesn't measure whether it knows your products)
Verbosity isn't bad here (customers like detailed product info)

Why: Customers are reading product details. Trust is everything. Every claim must be verifiable.

1.3. Use Case 3: Sales Reachout Agent (Analyst)

Mission: Analyze chat interactions and score purchase intent 1-100.

3.1. Key Evals

Deterministic: Does output JSON parse correctly? (CRM needs valid JSON)
Deterministic: Is intent_score between 1-100?
Custom Eval: Compare agent's intent score vs human labels on 50 chats (accuracy metric)

3.2. Evals to Skip

Faithfulness (This isn't summarizing documents; it's analyzing signals)
Chatbot Arena (This backend system doesn't chat with humans)
Context Relevance (No document retrieval here)

Why: This agent is analytical, not conversational. What matters is structural correctness (JSON valid) and analytical accuracy (does intent score match human judgment).

2. How to Build Your First Eval (Hands-On Guide)

You've read all the types. Now: how do you actually implement one?

1.1. Step 1: Pick Your First Eval (Start Simple)

Don't try to evaluate everything. Pick one thing your system absolutely must do.

Examples:

"Product name must appear in every response"
"Sentiment score must be 1-10"
"No competitor names"

2.2. Step 2: Gather 50 Real Examples

Go into production logs (or user testing) and collect 50 actual inputs/outputs.

2.2.1. Format

Input: "What's your return policy?"
Output: "You can return within 30 days..."
Expected: Should mention "30 days"

2.3. Step 3: Label Them Manually

Go through all 50 and decide what "correct" means.

Example for "Faithfulness":

Document: "30-day return policy"
Response: "You can return within 30 days"

Label: ✓ FAITHFUL

Response: "You can return within 60 days"

Label: ✗ NOT FAITHFUL

2.4. Step 4: Write the Eval Logic

2.4.1. For deterministic evals

def eval_contains_product_name(response, product_name):
    if product_name.lower() in response.lower():
        return True
    else:
        return False

2.4.2. For LLM judges

prompt = f"""Check if this response mentions the return policy window.
Response: {response}
Document: {document}
Is the response faithful? Return "yes" or "no".
"""

score = judge_llm.generate(prompt)

2.5. Step 5: Run the Eval on Your 50 Examples

Count how many you got right.
Target: 80%+ accuracy before deploying.

2.6. Step 6: Deploy and Monitor

Every time your system runs, score 1-5% of outputs with your eval.
Plot the score over time. If it drops, something's broken.

3. Common Mistakes & How to Avoid Them

3.1. Mistake 1: Over-Optimizing for Public Benchmarks

The Problem: You read that GPT-4 scores 92% on MMLU and Claude scores 89%. You switch to Claude to save costs, but your app gets worse.
Why: MMLU doesn't test what your app cares about. You optimized for the wrong metric.
The Fix: Use MMLU only for initial model selection. After choosing a model, optimize for your evals.

3.2. Mistake 2: Judging with the Wrong Judge

The Problem: You use a weak judge (like GPT-3.5) to evaluate your system, and it gives inaccurate grades.
The Fix:
- Use a capable judge (GPT-4, Claude 3.5)
- Or combine multiple judges and vote
- Validate the judge against human labels before trusting it

3.3. Mistake 3: Eval Isn't Catching Real Problems

The Problem: Your eval says "all responses are helpful" but customers complain they're useless.
Why: Your eval is wrong or incomplete.
The Fix: Sample 10 random responses and have a human read them. If humans disagree with your eval, redesign it.

3.4. Mistake 4: Running Too Many Evals

The Problem: You're running 20 different evaluations. It takes 10 minutes per response. Your system becomes slow.
The Fix:
- Start with 2-3 critical evals
- Add more only if problems emerge
- Run expensive evals (LLM judges) only on a sample (e.g., 5% of traffic)
- Run cheap evals (regex, JSON) on 100%

3.5. Mistake 5: Static Evals Don't Adapt

The Problem: You built an eval 6 months ago. Your system evolved. The eval is now measuring the wrong things.
The Fix: Review your evals quarterly. Update them based on:

New failure patterns you've seen
User feedback
Business priorities changing

4. Choosing Tools

You can build evals yourself, or use existing frameworks.

4.1. Option 1: Build It Yourself

# JSON validation
import json
try:
    data = json.loads(response)
    assert data["score"] >= 1 and data["score"] <= 100
    print("PASS")
except:
    print("FAIL")

Pros: Full control, no external dependencies, free.
Cons: You have to write everything from scratch.

4.2. Option 2: Use Open-Source Frameworks

4.2.1. DeepEval (Python)

Pre-built RAG evals (faithfulness, relevance, etc.)
LLM judge templates
Easy to customize

4.2.2. Ragas (Python)

Specifically for RAG systems
Automatic eval without manual labeling
Good defaults

4.2.3. Promptfoo (JavaScript, Cloud)

Configuration-based (no code)
Great for A/B testing prompts
Good visualization

4.3. Option 3: Use Hosted Platforms

4.3.1. Evidently Cloud

Run evals without building infrastructure
Built-in dashboards
Integrates with CI/CD

4.3.2. LangSmith (by LangChain)

Evaluation + monitoring
Works with LangChain applications
Hosted solution

5. Conclusion: Your Eval Strategy

Here's a simple rollout plan you can actually execute:

Your 90-Day Eval Implementation Roadmap

D1 Day 1: Choose your model

Run public benchmarks once (MMLU + Chatbot Arena if conversational). Pick your foundation model.

D2 Day 2–3: Gather real data

Collect ~50 real examples from your app and label what "correct" looks like.

W1 Week 1: Build your first eval

Start simple: JSON validation or keyword rules. Run on your 50 examples (target ~80%+ accuracy).

W2 Week 2: Deploy & monitor

Run on ~5% of production traffic. Track the score over time and alert on drops.

M2 Month 2: Add RAG evals (if applicable)

If you use retrieval, add faithfulness + relevance checks to reduce hallucinations.

M3 Month 3: Optimize & scale

Add pairwise comparison for prompt iteration. Review human feedback and refine criteria.

Golden Rule: Don't build the perfect eval system. Build the simplest eval system that catches the most important failures. You can always add more later.

The best eval is the one you actually use.

About This Series

The Foundations

The Cognitive Engine

Operationalizing AI Quality