LangChain 系列·(六)：RAG 评估——你怎么知道它够好？

LangChain 系列 · 第六篇：RAG 评估------你怎么知道它够好？

🎯 适合人群：已搭建 RAG 系统，想用量化指标衡量和持续优化检索质量的工程师

⏱️ 阅读时间：约 25 分钟

💬 本文介绍 RAGAS 评估框架的核心指标与使用方法，以及如何构建评估数据集、A/B 对比不同 RAG 策略的效果

一、为什么需要量化评估

"感觉还行"不是工程答案。在没有量化指标之前，RAG 系统的优化往往陷入以下困境：

调整了 chunk_size，不知道是变好了还是变差了
加了 Reranking，延迟增加了 200ms，不清楚准确率提升是否值得
换了 Embedding 模型，两种模型"感觉差不多"，无法决策

量化评估解决的核心问题是：让每一次优化都有数据支撑，让技术决策不依赖主观感受。

RAG 系统的评估分为三个维度：

复制代码

RAG Evaluation Dimensions

  +------------------+    +------------------+    +------------------+
  |   Retrieval      |    |   Generation     |    |   End-to-End     |
  |                  |    |                  |    |                  |
  | Context Recall   |    | Faithfulness     |    | Answer           |
  | Context          |    | Answer           |    | Correctness      |
  | Precision        |    | Relevancy        |    |                  |
  +------------------+    +------------------+    +------------------+
         |                        |                        |
   "Did we retrieve          "Did the LLM           "Is the final
    the right chunks?"        use them correctly?"    answer right?"

二、RAGAS 评估框架

RAGAS （Retrieval Augmented Generation Assessment）是目前最广泛使用的 RAG 评估框架，核心特点是用 LLM 代替人工打分------不需要大量人工标注，也能对 RAG 系统的各个维度进行量化。

2.1 什么是 ground_truth

在正式介绍指标之前，需要先理解 ground_truth（标准答案）这个概念，因为它决定了哪些指标可以使用。

ground_truth 是针对每个评估问题，由人工预先编写的"理想答案"。它代表了"如果 RAG 系统表现完美，它应该给出的答案"。

python 复制代码

# ground_truth 示例：
question    = "LangChain 的 LCEL 管道操作符是什么？"
ground_truth = (
    "LCEL（LangChain Expression Language）使用 | 操作符将多个 Runnable 组件串联，"
    "形成 RunnableSequence。| 是 Python 的 __or__ 方法重载，语义上等同于 Unix 管道。"
    "前一个组件的输出自动作为后一个组件的输入，所有 Runnable 均支持 "
    "invoke、batch、stream、ainvoke 四种调用方式。"
)

ground_truth 有什么用？RAGAS 用它来衡量两件事：

复制代码

ground_truth 的两个用途：

  1. 衡量检索质量（Context Recall）
     ground_truth 中提到的知识点，RAG 检索到的 context 覆盖了多少？
     → 覆盖越多，说明检索越全

  2. 衡量答案质量（Answer Correctness）
     RAG 最终生成的答案，与 ground_truth 的事实重合度有多高？
     → 重合越多，说明答案越准确

为什么 ground_truth 难以大规模获取？

维护 ground_truth 意味着：对知识库中每一个可能被问到的问题，都需要人工写一遍理想答案。当知识库有数千个文档、问题空间无边界时，这个成本极高。这也是为什么 RAGAS 同时提供了不需要 ground_truth 的无监督指标------用于日常低成本监控。

2.2 安装

bash 复制代码

pip install ragas langchain-openai

2.3 RAGAS 指标全览

RAGAS 提供的指标分为两类：需要 ground_truth 和不需要 ground_truth。

需要 ground_truth 的指标（有监督）：

指标	评估维度	计算方式	分数范围
`Context Recall`	检索层：相关内容是否被检索到	ground truth 中可被 context 支持的句子比例	0~1，越高越好
`Context Precision`	检索层：检索内容的精准率	context 中与 ground truth 相关的 chunk 占比（考虑排名）	0~1，越高越好
`Answer Correctness`	端到端：答案是否正确	与 ground truth 的事实重合度（F1）+ 语义相似度加权	0~1，越高越好

不需要 ground_truth 的指标（无监督）：

指标	评估维度	计算方式	分数范围
`Faithfulness`	生成层：答案是否忠实于检索内容	答案中可从 context 推断的陈述比例	0~1，越高越好
`Answer Relevancy`	生成层：答案是否切题	由答案反向生成问题与原问题的相似度均值	0~1，越高越好
`Context Entity Recall`	检索层：命名实体召回率	ground truth 中的实体出现在 context 中的比例	0~1，越高越好
`Noise Sensitivity`	检索层：噪声 chunk 对答案的影响程度	引入不相关 chunk 后答案错误率的变化	0~1，越低越好

💡 实际使用建议 ：生产中难以为每个问题维护 ground_truth，日常监控优先使用 Faithfulness + Answer Relevancy 这两个无监督指标。只在重大版本迭代时构建评估集，计算需要 ground_truth 的全套指标。

理解这四个核心指标的关系：

复制代码

                    Retrieval Quality
                   /                 \
        Context Recall          Context Precision
        (did we get              (are retrieved
         everything?)             chunks relevant?)
                   \                 /
                    Generation Quality
                   /                 \
           Faithfulness         Answer Relevancy
           (did LLM stick        (does answer
            to context?)          address question?)

2.4 各指标的计算原理

Faithfulness（忠实度）

复制代码

Faithfulness = 可从 context 推断的陈述数 / 答案中的总陈述数

例：
答案包含 5 条陈述：
  ✅ "LCEL 使用 | 操作符"         <- context 有支持
  ✅ "Runnable 接口有 invoke 方法" <- context 有支持
  ✅ "batch 方法支持并发"          <- context 有支持
  ❌ "LCEL 比传统 Chain 快 10 倍"  <- context 中未提及（幻觉）
  ✅ "with_retry 可以自动重试"     <- context 有支持

Faithfulness = 4/5 = 0.8

Faithfulness 低 → LLM 在生成答案时引入了 context 以外的信息（幻觉），需要加强 Prompt 约束。

Answer Relevancy（答案相关性）

复制代码

Answer Relevancy = mean(cosine_similarity(generated_questions, original_question))

计算过程：
1. 给定答案，让 LLM 生成 N 个可能引出该答案的问题
2. 计算这些生成问题与原始问题的 Embedding 余弦相似度
3. 取平均值

直觉：一个好答案应该"只回答被问到的内容"，
如果由答案反推出的问题和原始问题很不像，说明答案偏题了

Answer Relevancy 低 → 答案虽然正确但不切题（废话太多、答非所问）。

Context Recall（上下文召回率）

复制代码

Context Recall = ground truth 中被 context 支持的句子数 / ground truth 总句子数

需要提供 ground_truth（标准答案）才能计算。

例：
ground_truth 包含 3 个关键点：
  ✅ "RunnableParallel 并发执行多个分支" <- 检索到的 context 中有
  ✅ "结果合并为字典返回"               <- 检索到的 context 中有
  ❌ "lambda_mult 控制多样性"           <- context 中未包含

Context Recall = 2/3 = 0.67

Context Recall 低 → 检索遗漏了关键信息，需要优化 chunk 策略或提高 k 值。

Context Precision（上下文精准率）

复制代码

Context Precision = Σ (precision@k × relevance_k) / 相关 chunk 总数

考虑排名：排名靠前的相关 chunk 比靠后的贡献更多分数

例：检索到 4 个 chunk，相关性如下（1=相关，0=不相关）
  Rank 1: 相关 (1)  -> precision@1 = 1/1 = 1.0
  Rank 2: 不相关(0) -> 不计入
  Rank 3: 相关 (1)  -> precision@3 = 2/3 = 0.67
  Rank 4: 不相关(0) -> 不计入

Context Precision = (1.0 + 0.67) / 2 = 0.83
（分母是相关 chunk 总数 2，而不是总 chunk 数 4）

Context Precision 低 → 检索到了太多不相关的 chunk，噪声多，Reranker 未能将相关内容排到前面。

Answer Correctness（答案正确性）

复制代码

Answer Correctness = weights[0] * F1_score + weights[1] * semantic_similarity
                   （默认 weights = [0.75, 0.25]）

F1_score：answer 与 ground_truth 的事实重叠（TP / (TP + 0.5*(FP+FN))）
semantic_similarity：answer 与 ground_truth 的 Embedding 余弦相似度

直觉：既要求事实准确（F1），也考虑语义表达的接近程度（Embedding 相似度）

Noise Sensitivity（噪声敏感度）

复制代码

Noise Sensitivity = 引入不相关 chunk 后生成幻觉答案的比例

RAGAS 通过在 context 中混入不相关的干扰 chunk 来测试系统的鲁棒性：
- 系统对噪声不敏感（低 Noise Sensitivity）：说明 Prompt 约束有效，模型能忽略无关内容
- 系统对噪声敏感（高 Noise Sensitivity）：模型容易被干扰 chunk 带偏，需加强 Prompt 约束

三、无 ground_truth 时的评估方案

构建高质量的 ground_truth 需要大量人工投入。在无法提供 ground_truth 的场景下，可以使用以下替代方案：

3.1 仅使用无监督指标

python 复制代码

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# 不需要 ground_truth 的评估
eval_dataset_no_gt = Dataset.from_dict({
    "question": questions,
    "answer":   answers,
    "contexts": contexts,
    # 不提供 ground_truth
})

result = evaluate(
    dataset=eval_dataset_no_gt,
    metrics=[faithfulness, answer_relevancy],   # 只用无监督指标
    llm=ChatOpenAI(model="gpt-4o-mini"),
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
)
print(result)
# {'faithfulness': 0.84, 'answer_relevancy': 0.89}

3.2 LLM-as-Judge：自定义评估维度

当 RAGAS 的内置指标无法覆盖业务需求时，可以用 LLM 实现自定义评估器------让 LLM 按照特定标准对 RAG 输出打分：

python 复制代码

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser

judge_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "你是一个 RAG 系统质量评估专家，按照以下标准对问答结果打分。\n\n"
        "评分标准（每项 0~10 分）：\n"
        "1. completeness：答案是否完整覆盖了问题的所有方面\n"
        "2. conciseness：答案是否简洁，没有冗余信息\n"
        "3. groundedness：答案是否有充分的上下文支撑\n\n"
        "输出 JSON 格式：\n"
        '{{"completeness": 分数, "conciseness": 分数, "groundedness": 分数, '
        '"reasoning": "简短的评分理由"}}'
    ),
    (
        "human",
        "问题：{question}\n\n"
        "检索到的上下文：\n{context}\n\n"
        "RAG 回答：{answer}"
    ),
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
judge_chain = judge_prompt | llm | JsonOutputParser()

def llm_judge_eval(questions, answers, contexts_list):
    """用 LLM 对 RAG 输出进行多维度评分"""
    scores = []
    for q, a, ctx in zip(questions, answers, contexts_list):
        context_str = "\n\n---\n\n".join(ctx)
        score = judge_chain.invoke({
            "question": q,
            "context": context_str[:2000],  # 截断避免 Token 超限
            "answer": a,
        })
        scores.append(score)

    # 汇总统计
    import statistics
    for dim in ["completeness", "conciseness", "groundedness"]:
        vals = [s[dim] for s in scores if dim in s]
        print(f"{dim:<15} mean={statistics.mean(vals):.2f}  "
              f"min={min(vals)}  max={max(vals)}")
    return scores

scores = llm_judge_eval(questions, answers, contexts_list)

⚠️ LLM-as-Judge 的主要风险是"位置偏差"（Position Bias）：部分模型偏好排在第一的选项，或在 A/B 比较时偏好"更长的答案"。评估时建议随机化问答对的顺序，并对同一样本评估两次取均值。

四、构建评估数据集

RAGAS 评估需要以下数据：

python 复制代码

from datasets import Dataset

# 最小评估数据集结构
eval_data = {
    "question":    [...],   # 用户问题列表
    "answer":      [...],   # RAG 系统生成的答案
    "contexts":    [...],   # 检索到的 chunk 列表（每条问题对应一个列表）
    "ground_truth":[...],   # 标准答案（仅 Context Recall 需要）
}

3.1 方式一：人工构建（小规模，高质量）

适合核心场景的精准评估，20~50 条足够：

python 复制代码

# 人工编写测试用例
manual_eval_data = [
    {
        "question": "LangChain 中 LCEL 的管道操作符是什么？",
        "ground_truth": "LCEL 使用 | 操作符将多个 Runnable 组件串联，"
                        "左侧组件的输出自动成为右侧组件的输入。"
                        "这是通过重载 Python 的 __or__ 方法实现的。",
    },
    {
        "question": "ChatPromptTemplate 和 PromptTemplate 的区别是什么？",
        "ground_truth": "PromptTemplate 用于单轮文本输入，适合纯文本 LLM。"
                        "ChatPromptTemplate 用于多轮对话，由多条消息（system/human/ai）组成，"
                        "适合 Chat Model，是日常首选。",
    },
    # ...更多测试用例
]

3.2 方式二：LLM 自动生成（大规模，快速）

适合快速覆盖大量知识点，但需要人工抽样验证质量：

python 复制代码

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader

# 加载文档
loader = DirectoryLoader("./docs/", glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
chunks = splitter.split_documents(docs)

# 用 LLM 从每个 chunk 生成问答对
generate_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "你是一个评估数据集生成专家。根据给定的文档片段，生成 2 个有代表性的问答对。\n\n"
        "要求：\n"
        "- 问题必须能从文档片段中找到答案\n"
        "- 问题应覆盖文档的核心知识点\n"
        "- 答案应简洁准确，直接基于文档内容\n\n"
        "输出严格为 JSON 数组格式：\n"
        '[{{"question": "...", "ground_truth": "..."}}, ...]'
    ),
    ("human", "文档片段：\n\n{chunk}"),
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
parser = JsonOutputParser()
generate_chain = generate_prompt | llm | parser

# 批量生成（采样部分 chunk，避免成本过高）
import random
sampled_chunks = random.sample(chunks, min(30, len(chunks)))

all_qa_pairs = []
for chunk in sampled_chunks:
    try:
        qa_pairs = generate_chain.invoke({"chunk": chunk.page_content})
        all_qa_pairs.extend(qa_pairs)
    except Exception as e:
        print(f"生成失败：{e}")
        continue

print(f"生成了 {len(all_qa_pairs)} 个问答对")

⚠️ 自动生成的问答对中约 10%~20% 质量不佳（问题过于简单、答案有误）。使用前建议人工抽查 10~15 条，过滤明显有问题的样本。

五、运行 RAGAS 评估

4.1 构建完整评估 Pipeline

python 复制代码

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

load_dotenv()

# --------- 被评估的 RAG 系统 ---------
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

rag_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "严格根据以下上下文回答问题，不得使用上下文以外的信息。\n"
        "若上下文不足以回答，输出：'文档中未找到相关信息。'\n\n"
        "上下文：{context}"
    ),
    ("human", "{question}"),
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
parser = StrOutputParser()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {
        "context":  retriever | format_docs,
        "question": RunnablePassthrough(),
    }
    | rag_prompt | llm | parser
)

# --------- 收集评估所需数据 ---------
def run_rag_and_collect(questions: list[str]) -> dict:
    """运行 RAG 并收集 answer 和 contexts"""
    answers = []
    contexts = []

    for question in questions:
        # 获取检索到的 chunk
        retrieved_docs = retriever.invoke(question)
        context_texts = [doc.page_content for doc in retrieved_docs]

        # 获取 RAG 答案
        answer = rag_chain.invoke(question)

        answers.append(answer)
        contexts.append(context_texts)

    return {"answer": answers, "contexts": contexts}

# --------- 准备评估数据集 ---------
questions = [qa["question"] for qa in all_qa_pairs]
ground_truths = [qa["ground_truth"] for qa in all_qa_pairs]

rag_results = run_rag_and_collect(questions)

eval_dataset = Dataset.from_dict({
    "question":     questions,
    "answer":       rag_results["answer"],
    "contexts":     rag_results["contexts"],
    "ground_truth": ground_truths,
})

# --------- 运行 RAGAS 评估 ---------
result = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ],
    llm=ChatOpenAI(model="gpt-4o-mini"),
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
)

print(result)
# {'faithfulness': 0.82, 'answer_relevancy': 0.91,
#  'context_recall': 0.74, 'context_precision': 0.68}

4.2 转为 DataFrame 深入分析

python 复制代码

import pandas as pd

df = result.to_pandas()
print(df.columns.tolist())
# ['question', 'answer', 'contexts', 'ground_truth',
#  'faithfulness', 'answer_relevancy', 'context_recall', 'context_precision']

# --------- 整体分布 ---------
print(df[["faithfulness", "answer_relevancy", "context_recall", "context_precision"]].describe())

# --------- 找出各指标表现最差的问题 ---------
for metric in ["faithfulness", "context_recall", "context_precision"]:
    worst = df.nsmallest(3, metric)[["question", metric]]
    print(f"\n{metric} 最低的 3 条：")
    for _, row in worst.iterrows():
        print(f"  [{row[metric]:.3f}] {row['question']}")

# --------- 识别系统性问题：哪些问题多个指标同时偏低 ---------
df["avg_score"] = df[["faithfulness", "answer_relevancy",
                       "context_recall", "context_precision"]].mean(axis=1)
critical = df[df["avg_score"] < 0.5].sort_values("avg_score")
print(f"\n综合评分 < 0.5 的问题数：{len(critical)}")
print(critical[["question", "avg_score"]].to_string())

# --------- 相关性分析：context_precision 低是否导致 faithfulness 低 ---------
correlation = df[["context_precision", "faithfulness"]].corr()
print(f"\ncontext_precision 与 faithfulness 的相关系数：{correlation.iloc[0,1]:.3f}")
# 正相关说明：检索精准度直接影响答案忠实度（噪声 chunk 导致幻觉）

4.3 可视化评估结果

python 复制代码

import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family'] = 'Arial Unicode MS'  # Mac 中文支持

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
metrics = ["faithfulness", "answer_relevancy", "context_recall", "context_precision"]
colors = ["#4C9BE8", "#E87C4C", "#4CE87C", "#E84C9B"]

for ax, metric, color in zip(axes.flatten(), metrics, colors):
    ax.hist(df[metric].dropna(), bins=20, color=color, edgecolor="white", alpha=0.85)
    ax.axvline(df[metric].mean(), color="black", linestyle="--", linewidth=1.5,
               label=f"mean={df[metric].mean():.3f}")
    ax.set_title(metric)
    ax.set_xlabel("Score")
    ax.set_ylabel("Count")
    ax.legend()
    ax.set_xlim(0, 1)

plt.suptitle("RAG Evaluation Score Distribution", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("rag_eval_distribution.png", dpi=150)
plt.show()

六、A/B 对比不同 RAG 策略

量化评估最大的价值在于支持策略对比。下面展示如何系统地对比两种 RAG 配置：

python 复制代码

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset

def evaluate_rag_config(retriever, rag_chain, questions, ground_truths, name: str):
    """评估一个 RAG 配置，返回各指标分数"""
    answers, contexts = [], []

    for question in questions:
        retrieved_docs = retriever.invoke(question)
        contexts.append([doc.page_content for doc in retrieved_docs])
        answers.append(rag_chain.invoke(question))

    dataset = Dataset.from_dict({
        "question":     questions,
        "answer":       answers,
        "contexts":     contexts,
        "ground_truth": ground_truths,
    })

    result = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
        llm=ChatOpenAI(model="gpt-4o-mini"),
        embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    )

    print(f"\n=== {name} ===")
    for metric, score in result.items():
        print(f"  {metric:<25} {score:.4f}")
    return result

# --------- 配置 A：基础 RAG（chunk=500, k=4）---------
retriever_a = Chroma(
    persist_directory="./chroma_db_a",
    embedding_function=embeddings,
).as_retriever(search_kwargs={"k": 4})

# --------- 配置 B：进阶 RAG（chunk=1000, k=10 + Reranking top_n=3）---------
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3"),
    top_n=3,
)
retriever_b = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=Chroma(
        persist_directory="./chroma_db_b",
        embedding_function=embeddings,
    ).as_retriever(search_kwargs={"k": 10}),
)

# --------- 运行对比 ---------
result_a = evaluate_rag_config(retriever_a, rag_chain, questions, ground_truths, "Basic RAG")
result_b = evaluate_rag_config(retriever_b, rag_chain, questions, ground_truths, "Advanced RAG + Reranking")

# --------- 输出对比表 ---------
metrics = ["faithfulness", "answer_relevancy", "context_recall", "context_precision"]
print("\n=== A/B Comparison ===")
print(f"{'Metric':<25} {'Basic RAG':>12} {'Advanced RAG':>14} {'Delta':>8}")
print("-" * 62)
for m in metrics:
    a, b = result_a[m], result_b[m]
    delta = b - a
    sign = "+" if delta > 0 else ""
    print(f"{m:<25} {a:>12.4f} {b:>14.4f} {sign}{delta:>7.4f}")

输出示例：

复制代码

=== A/B Comparison ===
Metric                    Basic RAG  Advanced RAG    Delta
--------------------------------------------------------------
faithfulness               0.7823        0.8541   +0.0718
answer_relevancy           0.8912        0.9034   +0.0122
context_recall             0.6341        0.7892   +0.1551
context_precision          0.5823        0.7134   +0.1311

七、建立持续评估体系

单次评估是快照，持续评估才能发现系统退化。以下是一套适合生产环境的轻量级持续评估方案。

6.1 在线采样评估

不对每次请求评估（成本太高），而是按比例采样，异步评估：

python 复制代码

import random
import asyncio
from datetime import datetime
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset

class RAGEvaluationSampler:
    """对 RAG 请求进行采样评估，异步写入评估结果"""

    def __init__(self, rag_chain, retriever, sample_rate: float = 0.05):
        self.rag_chain = rag_chain
        self.retriever = retriever
        self.sample_rate = sample_rate  # 5% 采样率
        self.pending_samples = []

    def should_sample(self) -> bool:
        return random.random() < self.sample_rate

    def record(self, question: str, answer: str, contexts: list[str]):
        """记录一条样本，等待批量评估"""
        self.pending_samples.append({
            "question": question,
            "answer": answer,
            "contexts": contexts,
            "timestamp": datetime.now().isoformat(),
        })

    def flush_and_evaluate(self, min_batch: int = 10):
        """当积累足够样本时，批量运行评估"""
        if len(self.pending_samples) < min_batch:
            return None

        batch = self.pending_samples[:min_batch]
        self.pending_samples = self.pending_samples[min_batch:]

        dataset = Dataset.from_dict({
            "question": [s["question"] for s in batch],
            "answer":   [s["answer"] for s in batch],
            "contexts": [s["contexts"] for s in batch],
        })

        result = evaluate(
            dataset=dataset,
            metrics=[faithfulness, answer_relevancy],
            llm=ChatOpenAI(model="gpt-4o-mini"),
            embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
        )

        # 将结果写入监控系统（示例：打印，实际接入 Prometheus/Grafana 等）
        print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M')}] "
              f"faithfulness={result['faithfulness']:.3f}  "
              f"answer_relevancy={result['answer_relevancy']:.3f}  "
              f"(n={min_batch})")
        return result

# 使用示例
sampler = RAGEvaluationSampler(rag_chain, retriever, sample_rate=0.1)

def handle_question(question: str) -> str:
    """处理用户问题，同时进行采样评估"""
    retrieved_docs = retriever.invoke(question)
    contexts = [doc.page_content for doc in retrieved_docs]
    answer = rag_chain.invoke(question)

    # 采样记录
    if sampler.should_sample():
        sampler.record(question, answer, contexts)
        sampler.flush_and_evaluate(min_batch=10)

    return answer

6.2 版本对比看板

每次迭代都记录评估结果，形成版本历史，方便追踪优化效果：

python 复制代码

import json
import os
from datetime import datetime

EVAL_HISTORY_FILE = "./eval_history.jsonl"

def save_eval_result(version: str, config: dict, result: dict):
    """将评估结果追加写入历史文件"""
    record = {
        "version": version,
        "timestamp": datetime.now().isoformat(),
        "config": config,
        "metrics": {k: round(v, 4) for k, v in result.items()},
    }
    with open(EVAL_HISTORY_FILE, "a") as f:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

def print_eval_history():
    """打印版本历史对比表"""
    if not os.path.exists(EVAL_HISTORY_FILE):
        print("暂无评估历史")
        return

    records = []
    with open(EVAL_HISTORY_FILE) as f:
        for line in f:
            records.append(json.loads(line))

    print(f"\n{'版本':<12} {'时间':<20} {'Faithfulness':>14} {'Ans.Relevancy':>14} "
          f"{'Ctx.Recall':>11} {'Ctx.Precision':>14}")
    print("-" * 86)
    for r in records:
        m = r["metrics"]
        print(f"{r['version']:<12} {r['timestamp'][:19]:<20} "
              f"{m.get('faithfulness', '-'):>14.4f} "
              f"{m.get('answer_relevancy', '-'):>14.4f} "
              f"{m.get('context_recall', '-'):>11.4f} "
              f"{m.get('context_precision', '-'):>14.4f}")

# 每次迭代后记录
save_eval_result(
    version="v1.2-reranker",
    config={"chunk_size": 1000, "k": 10, "reranker": "bge-reranker-v2-m3", "top_n": 3},
    result=result_b,
)
print_eval_history()

输出示例：

复制代码

版本           时间                 Faithfulness  Ans.Relevancy  Ctx.Recall  Ctx.Precision
--------------------------------------------------------------------------------------
v1.0-baseline  2026-04-20 10:32:00         0.7823         0.8912      0.6341         0.5823
v1.1-chunk1000 2026-04-22 14:15:00         0.7991         0.8934      0.7102         0.6241
v1.2-reranker  2026-04-24 09:40:00         0.8541         0.9034      0.7892         0.7134

指标低	根本原因	优化方向
`faithfulness` 低	LLM 在答案中引入幻觉	强化 Prompt 的上下文约束；使用更大的模型
`answer_relevancy` 低	答案偏题，废话多	优化 Prompt 的输出格式约束；加字数/格式限制
`context_recall` 低	检索遗漏了关键内容	增大 `k`；优化 chunk 策略；用 Multi-query 扩大召回
`context_precision` 低	检索到大量不相关 chunk	减小 `k`；加 score_threshold；用 Reranking 精排
`context_recall` 和 `context_precision` 同时低	chunk 切分质量差	重新审视 chunk_size 和 separators（见第四篇 3.3）

📝 在实际项目中，context_recall 和 context_precision 之间天然存在张力------提高 k 会改善 recall 但损害 precision，反之亦然。Reranking 是同时改善两者的少数手段之一（通过扩大 k 提升 recall，再通过精排恢复 precision）。

八、常见坑与最佳实践

坑一：评估数据集太小，结论不可靠

python 复制代码

# ❌ 只用 5~10 条测试，分数波动大，无法反映真实质量
# 5 条测试集上 faithfulness=0.8，换 5 条可能变成 0.5

# ✅ 最少 30 条，覆盖不同类型的问题
# - 事实性问题（"X 是什么"）
# - 操作性问题（"如何做 X"）
# - 比较性问题（"X 和 Y 的区别"）
# - 边界问题（文档中没有答案的问题）

坑二：评估数据集与实际用户问题不一致

python 复制代码

# ❌ 只测"理想问题"，忽略真实用户的模糊/口语化提问
test_cases = [
    "LangChain 的 LCEL 管道操作符是什么？",  # 太规范
]

# ✅ 同时包含规范问法和真实用法
test_cases = [
    "LangChain 的 LCEL 管道操作符是什么？",  # 规范问法
    "langchain 里那个竖线符号是干嘛的",       # 口语化
    "chain 怎么串起来",                       # 模糊问法
]

坑三：ground_truth 质量差导致指标失真

python 复制代码

# ❌ ground_truth 过于简短，Context Recall 被人为拉低
ground_truth = "用 | 连接"
# 太短，文档中许多相关 chunk 都无法"支持"这个答案

# ✅ ground_truth 应完整覆盖关键知识点
ground_truth = (
    "LCEL 使用 | 操作符（Python 的 __or__ 方法重载）将多个 Runnable 组件串联，"
    "形成 RunnableSequence。前一个组件的输出自动作为后一个组件的输入。"
    "所有 Runnable 均支持 invoke、batch、stream、ainvoke 四种调用方式。"
)

坑四：用 RAGAS 评估时 LLM 成本超出预期

python 复制代码

# RAGAS 每条样本需要多次 LLM 调用（各指标独立评估）
# 30 条样本 × 4 个指标 × 平均 3 次 LLM 调用 ≈ 360 次 LLM 调用

# ✅ 开发阶段：用 gpt-4o-mini 降低评估成本（效果足够用于迭代）
result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
    llm=ChatOpenAI(model="gpt-4o-mini"),  # 不要用 gpt-4o
)

# ✅ 只在关键节点（上线前、大版本变更后）运行全量评估
# 日常迭代用命中率测试（见第四篇 3.3 第二步）代替完整 RAGAS 评估

九、指标解读与优化方向

拿到 RAGAS 分数之后，最关键的问题不是"分数高不高"，而是"分数低在哪里，该从哪里入手"。

8.1 各指标低分的根因定位

指标低分	最可能的根因	优先排查方向
`Context Recall` 低	相关文档没检索到	k 值太小；chunk 过大导致相关内容被稀释；缺少 Multi-query；Embedding 模型与领域不匹配
`Context Precision` 低	检索了太多不相关 chunk	k 值太大；score_threshold 未设置；缺少 Reranking；chunk 粒度太细导致噪声多
`Faithfulness` 低	LLM 生成了 context 未支持的内容（幻觉）	System Prompt 缺少"仅基于提供的文档回答"约束；模型温度过高；context 太长导致模型"发散"
`Answer Relevancy` 低	答案与问题不相关或过于冗长	Prompt 缺少格式约束；模型倾向于输出额外解释；答案长度限制未设置
`Answer Correctness` 低	答案事实错误	ground_truth 不够完整（自查）；Context Recall 低（源头问题）；生成模型能力不足
`Noise Sensitivity` 高	不相关 chunk 干扰了生成	Reranking 未过滤噪声；score_threshold 太低；Prompt 中未要求模型忽略不相关内容

8.2 召回与精准的"指标张力"

Context Recall 和 Context Precision 存在天然的对立关系：

复制代码

   k 值增大 ─────────────────────────────────────────────►

   Context Recall    ████████████████████████████████
   （越来越高）         检索的越多，越不容易漏掉关键内容

   Context Precision ████████████
   （越来越低）         检索的越多，噪声 chunk 占比也越高

   最优区间              ↑
                       k=5~8 通常是甜点区（具体看数据集）

这就是为什么单纯提高 k 值并不是好策略------Reranking 的价值恰好在于在高 k 值召回后，用精排把精准度拉回来：

复制代码

   策略对比（示意）：

   k=3,  无 Reranking：Recall=0.62  Precision=0.78  <- 精准但容易漏
   k=10, 无 Reranking：Recall=0.85  Precision=0.52  <- 召回高但噪声多
   k=10, 有 Reranking：Recall=0.84  Precision=0.79  <- 两者兼顾

8.3 系统性优化路径

遇到指标问题时，建议按以下顺序排查，而不是随机调参：

复制代码

   Step 1: 检查 Context Recall
   ├── < 0.7？先解决检索召回问题
   │   ├── 增大 k 值（从 3→5→8）
   │   ├── 加入 Multi-query（第五篇）
   │   └── 审查 chunk_size（太大则细分，太小则合并）
   └── ≥ 0.7？继续 Step 2

   Step 2: 检查 Context Precision
   ├── < 0.6？解决噪声问题
   │   ├── 加入 score_threshold 过滤低分 chunk
   │   ├── 引入 Reranking（Cross-encoder）
   │   └── 减小 k 值或结合 Step 1 用 Reranking 平衡
   └── ≥ 0.6？继续 Step 3

   Step 3: 检查 Faithfulness
   ├── < 0.8？优化生成侧约束
   │   ├── 加强 System Prompt："仅根据以下文档回答，无法回答时说明"
   │   ├── 降低 temperature（0 → 0 已是最低，考虑换模型）
   │   └── 缩短 context 长度，用 Reranking 只保留 top-3
   └── ≥ 0.8？继续 Step 4

   Step 4: 检查 Answer Relevancy
   ├── < 0.8？优化输出格式
   │   ├── 在 Prompt 中明确答案长度和格式要求
   │   └── 加入负向约束："不要提供超出问题范围的补充信息"
   └── ≥ 0.8？系统状态良好，记录基准分数

8.4 综合总结表

评估环节	核心指标	健康基准	指标低时的优先排查方向
检索召回	`Context Recall`	≥ 0.75	k 值、chunk 策略、Multi-query
检索精准	`Context Precision`	≥ 0.65	score_threshold、Reranking、减小 k
生成忠实	`Faithfulness`	≥ 0.85	Prompt 约束、模型选择、缩短 context
生成相关	`Answer Relevancy`	≥ 0.80	Prompt 格式约束、答案长度控制
端到端	`Answer Correctness`	≥ 0.70	检索召回 + 生成忠实的综合优化
噪声抵抗	`Noise Sensitivity`	≤ 0.20	Reranking、score_threshold

💡 上表的"健康基准"是经验参考值，不同领域和任务类型差异显著。建议先建立自己系统的历史基准线，再用 delta 衡量优化效果，而不是追求绝对分数。

十、总结

🎯 RAG 评估的价值不在于"得到一个分数"，而在于让每次优化都有可量化的依据。建立基准分数 → 修改策略 → 重新评估 → 对比 delta，这个循环是 RAG 系统从"能用"走向"好用"的唯一可靠路径。

参考资料

下期预告

RAG 三篇到此结束。下一个大模块是让 LLM 能够使用工具。

第七篇《Tools：给大模型加上手脚》 将介绍如何定义工具、工具调用（Tool Calling / Function Calling）的底层机制、结构化工具输入与错误处理------这是构建 Agent 的基础。