28.Paper RAG Agent 开发记录：修复 LLM Rerank 的解析、Fallback 与可验证性

前言

项目前面做的 Rerank 是一个脆弱的LLM黑箱，回顾前面的代码rag_system.py:rerank 让 LLM 输出一个 [2,0,1] 这样的索引列表，然后 ast.literal_eval 解析。但是这种操作存在两个问题：

如果 LLM 返回 python\n[2,0,1]\n或者 "我认为顺序是 [2,0,1]"------直接 fallback 到原始顺序（return list(range(len(chunks)))），rerank 等于没做。
没有打日志显示 fallback 触发了，所以我不知道有多少 query 实际上 rerank 失败了。

由于上面两个问题的存在，所以我根本不能说明我 Rerank 的有效性。今天主要做两件事：

Rerank 解析更稳一点，允许 LLM 模型返回下面这种包裹格式：
- $2, 0, 1$
- 我认为顺序是[2，0，1]
- python[2，0，1]
Rerank fallback 要打印日志，一旦解析失败，要明确记录：
- 原始LLM输出是什么
- 为什么 fallback
- fallback 后是否使用原始顺序

第一版改进

修改所属项目层次

RAG 层 ：app/rag_system.py
Trace 层 ：context_metrics 里记录 rerank_used 、 rerank_fallback、rerank_error

第一版修改rag_system.py:rerank，整体修改为下面的内容：

python 复制代码

def rerank(self, query, chunks, return_trace=False):
    prompt = f"""You are a ranking assistant.

Query:
{query}

Rank the following passages from most relevant to least relevant.

Passages:
"""

    for i, c in enumerate(chunks):
        prompt += f"\n[{i}] {c}\n"

    prompt += (
        "\nReturn ONLY a Python list of indices in sorted order, like [2,0,1]. "
        "No explanation, no code fences."
    )

    response = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    raw_output = response.choices[0].message.content.strip()

    import ast
    import re

    cleaned = raw_output

    # 1. Clean markdown code fences, such as:
    # ```python
    # [2, 0, 1]
    # ```
    if "```" in cleaned:
        match = re.search(r"```(?:python|json)?\s*(\[.*?\])\s*```", cleaned, re.DOTALL)
        if match:
            cleaned = match.group(1)

    # 2. Extract first bracket list from prose, such as:
    # "I think the order is [2, 0, 1]."
    bracket_match = re.search(r"\[[\d,\s]+\]", cleaned)
    if bracket_match:
        cleaned = bracket_match.group(0)

    rerank_trace = {
        "rerank_used": False,
        "rerank_fallback": False,
        "rerank_error": "",
        "rerank_raw_output": raw_output[:300],
        "rerank_cleaned_output": cleaned[:300],
        "rerank_indices": [],
    }

    try:
        result = ast.literal_eval(cleaned)

        if not isinstance(result, list):
            raise ValueError(f"Rerank output is not a list: {type(result)}")

        if not all(isinstance(i, int) for i in result):
            raise ValueError(f"Rerank output contains non-integer indices: {result}")

        if not all(0 <= i < len(chunks) for i in result):
            raise ValueError(f"Rerank output contains out-of-range indices: {result}")

        # 去重，同时保留顺序
        seen = set()
        deduped = []
        for i in result:
            if i not in seen:
                deduped.append(i)
                seen.add(i)

        # 如果 LLM 没返回完整索引，把缺失的补到后面，避免 best_chunks 数量不稳定
        missing = [i for i in range(len(chunks)) if i not in seen]
        final_indices = deduped + missing

        rerank_trace.update({
            "rerank_used": True,
            "rerank_fallback": False,
            "rerank_error": "",
            "rerank_indices": final_indices,
        })

        logger.info(f"[rerank] success, parsed_indices={final_indices[:self.rerank_k]}")

        if return_trace:
            return final_indices, rerank_trace

        return final_indices

    except Exception as e:
        fallback_indices = list(range(len(chunks)))

        rerank_trace.update({
            "rerank_used": False,
            "rerank_fallback": True,
            "rerank_error": str(e),
            "rerank_indices": fallback_indices,
        })

        logger.warning(
            f"[rerank] FALLBACK to original order. "
            f"reason={e}, raw_output={raw_output[:300]}"
        )

        if return_trace:
            return fallback_indices, rerank_trace

        return fallback_indices

这一版的修改和以前版本的不同在于对 LLM 的返回结果引入了正则提取模型返回的结果，这样能接住一些包含非索引列表的结果，比如"我认为顺序是[2，0，1]"：

python 复制代码

# 1. Clean markdown code fences, such as:
    # ```python
    # [2, 0, 1]
    # ```
    if "```" in cleaned:
        match = re.search(r"```(?:python|json)?\s*(\[.*?\])\s*```", cleaned, re.DOTALL)
        if match:
            cleaned = match.group(1)

    # 2. Extract first bracket list from prose, such as:
    # "I think the order is [2, 0, 1]."
    bracket_match = re.search(r"\[[\d,\s]+\]", cleaned)
    if bracket_match:
        cleaned = bracket_match.group(0)

其次是引入了 rerank_trace 作为 Rerank 的记录：

python 复制代码

rerank_trace = {
        "rerank_used": False, # 是否进行了 Rerank
        "rerank_fallback": False, # Rerank 是否失败，并发送fallback
        "rerank_error": "",
        "rerank_raw_output": raw_output[:300],
        "rerank_cleaned_output": cleaned[:300],
        "rerank_indices": [],
    }

随后更新rag_system.py:ask_with_trace将：

python 复制代码

texts = [c["text"] for c in retrieved]
sorted_indices = self.rerank(question, texts)
best_chunks = [retrieved[i] for i in sorted_indices[:self.rerank_k]]

替换为：

python 复制代码

texts = [c["text"] for c in retrieved]
sorted_indices, rerank_trace = self.rerank(
    question,
    texts,
    return_trace=True
)
best_chunks = [retrieved[i] for i in sorted_indices[:self.rerank_k]]

最后把 rerank_trace 放进 context_metrics：

python 复制代码

context_metrics.update({
    "distance_gate_passed": distance_sufficient,
    "llm_relevance_check": relevance_metrics.get("llm_relevance_check"),
    "llm_relevance_verdict": relevance_metrics.get("llm_relevance_verdict"),
    "llm_relevance_reason": relevance_metrics.get("llm_relevance_reason"),
    "llm_relevance_error": relevance_metrics.get("llm_relevance_error"),
    "rerank_used": rerank_trace.get("rerank_used"),
    "rerank_fallback": rerank_trace.get("rerank_fallback"),
    "rerank_error": rerank_trace.get("rerank_error"),
    "rerank_indices": rerank_trace.get("rerank_indices", [])[:self.rerank_k],
    "rerank_raw_output": rerank_trace.get("rerank_raw_output", ""),
})

测试结果

#	Question	Expected Tool	Rerank Used	Rerank Fallback	Distance Gate	LLM Relevance Gate	Final Context Sufficient	Notes
1	What is the main contribution of paper1?	rag	true	false	true	false	false	Rerank parsed successfully, but relevance gate was conservative and rejected the context. This should be refined later.
2	What is the difference between paper1 and paper2?	rag	true	false	true	false	false	Rerank parsed successfully. The retrieved chunks mainly covered Paper1, so the system refused to answer the comparison.
3	What does paper1 say about reinforcement learning for robot navigation?	rag	true	false	true	false	false	This is the key unrelated-question case. Distance gate passed, but LLM relevance gate correctly rejected the retrieved chunks.
4	What does paper1 do in their study?	rag	true	false	true	true	true	Rerank parsed successfully. Context passed both gates and the system generated an evidence-based answer.

统计总结

总测试问题数：4
重新排序回退计数：0 / 4
重新排序解析成功次数：4次/4次
LLM相关性门返回"否"：3/4
LLM相关性门返回"是"：1/4
被相关性门拦截的不相关问题案例：1/1

新问题产生

#	问题	LLM Relevance Gate	应该是
1	What is the main contribution of paper1?	false	true ❌
2	What is the difference between paper1 and paper2?	false	true ❌
3	What does paper1 say about RL for robot navigation?	false	false ✅
4	What does paper1 do in their study?	true	true ✅

4 题里 2 题假阴性，错误率 50%。问题 1 和 2 是 RAG 系统的标配问题------"主要贡献是什么"、"两篇论文有什么区别"------ LLM gate 把它们都拒了。

问题原因：LLM relevance judge 用同一套 prompt 处理两类完全不同的问题。

"Paper1 在 X 具体问题上怎么说？" ← gate 擅长
"Paper1 主要贡献是什么？" ← gate 不擅长，因为"主要贡献"这种宽泛 query 跟任何具体段落都难以"直接对应"

修改思路：

把 AND 改成更软的判断：

python 复制代码

# 现状（在 ask_with_trace 里）：
context_sufficient = distance_sufficient and context_relevant

# 改成：
if not distance_sufficient:
    # 距离都不够，肯定不行
    context_sufficient = False
elif not context_relevant:
    # 距离够，但 LLM 说不相关------给一个软警告，不直接拒
    context_sufficient = True
    context_metrics["soft_warning"] = "LLM gate flagged low relevance, but distance gate passed."
else:
    context_sufficient = True

让 gate 知道问题类型：

python 复制代码

# 在 relevance prompt 里加问题分类，对宽泛问题宽松、对具体问题严格：
prompt = f"""You are a strict relevance judge for a RAG system.

First, classify the question type:
- BROAD: asks about overall content (e.g., "main contribution", "what is X about", "differences between")
- SPECIFIC: asks about a concrete fact, method, or topic

Then judge:
- For BROAD questions: reply YES if passages are clearly from/about the referenced paper(s).
- For SPECIFIC questions: reply YES only if passages directly mention the specific topic.

Question: {question}
Retrieved passages:
{preview}

Reply format:
TYPE: BROAD|SPECIFIC
VERDICT: YES|NO - reason
"""

第二版改进

修改所属项目层次

RAG 层 ：app/rag_system.py
Trace 层 ：context_metrics 里记录 question_type 、 gate_mode、soft_warning

第二版修改rag_system.py:assess_context_relevance_with_llm，整体修改为下面的内容：

python 复制代码

def assess_context_relevance_with_llm(self, question, retrieved_chunks):
    """
    LLM-based relevance gate with question type awareness.

    FAISS distance only tells which chunks are nearest in vector space.
    This gate checks whether the retrieved passages are useful for the current question.

    Question types:
    - BROAD: overall paper content, main contribution, what the paper does
    - SPECIFIC: concrete method/topic/fact, whether a paper mentions something
    - COMPARISON: differences/comparison between papers or methods
    """
    if not retrieved_chunks:
        return False, {
            "llm_question_type": "UNKNOWN",
            "llm_gate_mode": "hard",
            "llm_relevance_check": False,
            "llm_relevance_verdict": "NO",
            "llm_relevance_reason": "No chunks retrieved.",
            "llm_relevance_error": "",
            "llm_soft_warning": "",
        }

    preview = ""
    preview_chunks = retrieved_chunks[:RELEVANCE_GATE_PREVIEW_CHUNKS]

    for i, c in enumerate(preview_chunks, start=1):
        source = c.get("source", "unknown")
        distance = c.get("distance")
        text = c.get("text", "")

        snippet = text[:350]

        preview += (
            f"[Chunk {i}]\n"
            f"Source: {source}\n"
            f"Distance: {distance}\n"
            f"Text: {snippet}\n\n"
        )

    prompt = f"""
				You are a relevance judge for a RAG system.
				
				Your job is NOT to answer the user's question.
				Your job is to judge whether the retrieved passages are useful enough for answering it.
				
				First classify the question type:
				
				- BROAD:
				  The question asks about overall paper content, main contribution, summary, motivation,
				  what the paper does, or the general study direction.
				  Examples:
				  "What is the main contribution of this paper?"
				  "What does paper1 do in their study?"
				  "What is this paper about?"
				
				- SPECIFIC:
				  The question asks about a concrete method, fact, topic, term, formula, dataset, metric,
				  or whether the paper mentions a specific subject.
				  Examples:
				  "What does paper1 say about reinforcement learning?"
				  "Does paper2 mention GAN?"
				  "What attention mechanism is used?"
				
				- COMPARISON:
				  The question asks about differences, comparison, similarities, or contrast between
				  two or more papers, methods, or systems.
				  Examples:
				  "What is the difference between paper1 and paper2?"
				  "Compare the methods in these two papers."
				
				Judging rules:
				
				- For BROAD questions:
				  Reply YES if the passages are clearly from or about the referenced paper(s) and contain
				  abstract, method, contribution, experiment, or conclusion information.
				  Do NOT require the exact words "main contribution" to appear.
				
				- For SPECIFIC questions:
				  Reply YES only if the passages directly mention or support the concrete topic asked.
				  Reply NO if the passages are about a different topic.
				
				- For COMPARISON questions:
				  Reply YES only if the passages contain enough information about the compared items.
				  If the passages only cover one side of the comparison, reply NO.
				
				Question:
				{question}
				
				Retrieved passages:
				{preview}
				
				Return exactly two lines:
				
				TYPE: BROAD|SPECIFIC|COMPARISON
				VERDICT: YES|NO - short reason
			"""

    try:
        response = client.chat.completions.create(
            model=CHAT_MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )

        verdict = response.choices[0].message.content.strip()

        import re

        type_match = re.search(
            r"TYPE:\s*(BROAD|SPECIFIC|COMPARISON)",
            verdict,
            re.IGNORECASE
        )

        verdict_match = re.search(
            r"VERDICT:\s*(YES|NO)\s*[-:]\s*(.*)",
            verdict,
            re.IGNORECASE | re.DOTALL
        )

        if not type_match or not verdict_match:
            logger.warning(f"[relevance_gate] unexpected verdict format: {verdict}")

            return True, {
                "llm_question_type": "UNKNOWN",
                "llm_gate_mode": "distance_only_fallback",
                "llm_relevance_check": True,
                "llm_relevance_verdict": verdict[:200],
                "llm_relevance_reason": "Unexpected judge output format. Falling back to distance-based result.",
                "llm_relevance_error": "unexpected_verdict_format",
                "llm_soft_warning": "LLM relevance judge returned unexpected format.",
            }

        question_type = type_match.group(1).upper()
        yes_or_no = verdict_match.group(1).upper()
        reason = verdict_match.group(2).strip()

        is_relevant = yes_or_no == "YES"

        logger.info(
            f"[relevance_gate] type={question_type}, "
            f"verdict={yes_or_no}, reason={reason[:120]}"
        )

        return is_relevant, {
            "llm_question_type": question_type,
            "llm_gate_mode": "typed_relevance_gate",
            "llm_relevance_check": is_relevant,
            "llm_relevance_verdict": f"{yes_or_no} - {reason}"[:200],
            "llm_relevance_reason": reason[:200],
            "llm_relevance_error": "",
            "llm_soft_warning": "",
        }

    except Exception as e:
        logger.warning(f"[relevance_gate] LLM relevance check failed: {e}")

        return True, {
            "llm_question_type": "UNKNOWN",
            "llm_gate_mode": "distance_only_fallback",
            "llm_relevance_check": True,
            "llm_relevance_verdict": "FALLBACK",
            "llm_relevance_reason": "LLM relevance check failed. Falling back to distance-based result.",
            "llm_relevance_error": str(e),
            "llm_soft_warning": "LLM relevance judge failed, so the system used distance gate only.",
        }

这一版的修改除了完善提示词外，与以往的不同在于增加了 question_type 、 gate_mode、soft_warning信息：

python 复制代码

if not retrieved_chunks:
        return False, {
            "llm_question_type": "UNKNOWN",
            "llm_gate_mode": "hard",
            "llm_relevance_check": False,
            "llm_relevance_verdict": "NO",
            "llm_relevance_reason": "No chunks retrieved.",
            "llm_relevance_error": "",
            "llm_soft_warning": "",
        }

随后修改app/rag_system.py:ask_with_trace中的硬约束将：

python 复制代码

context_metrics.update({
    "distance_gate_passed": distance_sufficient,
    "llm_relevance_check": relevance_metrics.get("llm_relevance_check"),
    "llm_relevance_verdict": relevance_metrics.get("llm_relevance_verdict"),
    "llm_relevance_reason": relevance_metrics.get("llm_relevance_reason"),
    "llm_relevance_error": relevance_metrics.get("llm_relevance_error"),
    "rerank_used": rerank_trace.get("rerank_used"),
    "rerank_fallback": rerank_trace.get("rerank_fallback"),
    "rerank_error": rerank_trace.get("rerank_error"),
    "rerank_indices": rerank_trace.get("rerank_indices", [])[:self.rerank_k],
    "rerank_raw_output": rerank_trace.get("rerank_raw_output", ""),
})

context_sufficient = distance_sufficient and context_relevant

if context_sufficient:
    context_metrics["final_sufficiency_reason"] = (
        "Context passed both the distance gate and the LLM relevance gate."
    )
elif not distance_sufficient:
    context_metrics["final_sufficiency_reason"] = (
        "Context failed the distance-based retrieval gate."
    )
elif not context_relevant:
    context_metrics["final_sufficiency_reason"] = (
        "Context passed the distance gate but failed the LLM relevance gate."
    )
else:
    context_metrics["final_sufficiency_reason"] = (
        "Context failed the lightweight sufficiency check."
    )

替换为：

python 复制代码

question_type = relevance_metrics.get("llm_question_type", "UNKNOWN")
llm_relevance_error = relevance_metrics.get("llm_relevance_error", "")

context_metrics.update({
    "distance_gate_passed": distance_sufficient,
    "llm_question_type": question_type,
    "llm_gate_mode": relevance_metrics.get("llm_gate_mode"),
    "llm_relevance_check": relevance_metrics.get("llm_relevance_check"),
    "llm_relevance_verdict": relevance_metrics.get("llm_relevance_verdict"),
    "llm_relevance_reason": relevance_metrics.get("llm_relevance_reason"),
    "llm_relevance_error": llm_relevance_error,
    "llm_soft_warning": relevance_metrics.get("llm_soft_warning", ""),
    "rerank_used": rerank_trace.get("rerank_used"),
    "rerank_fallback": rerank_trace.get("rerank_fallback"),
    "rerank_error": rerank_trace.get("rerank_error"),
    "rerank_indices": rerank_trace.get("rerank_indices", [])[:self.rerank_k],
    "rerank_raw_output": rerank_trace.get("rerank_raw_output", ""),
})

if not distance_sufficient:   # 距离不满足，一定不充分
    context_sufficient = False
    context_metrics["final_sufficiency_reason"] = (
        "Context failed the distance-based retrieval gate."
    )

elif llm_relevance_error:
	# 判断失败或返回了意外格式。
    # 不要完全屏蔽答案。退而求其次，使用距离门并显示警告。    
    context_sufficient = True
    context_metrics["final_sufficiency_reason"] = (
        "LLM relevance judge was unavailable or malformed; using distance gate as fallback."
    )

elif context_relevant:
    context_sufficient = True
    context_metrics["final_sufficiency_reason"] = (
        "Context passed both the distance gate and the typed LLM relevance gate."
    )

elif question_type == "BROAD":
	# 像"主要贡献"这样宽泛的问题可能会被严格的评判者错误地否决。
    # 将其作为软警告而非硬性阻止。
    context_sufficient = True
    context_metrics["llm_soft_warning"] = (
        "LLM relevance gate flagged low relevance for a BROAD question, "
        "but the system allowed generation because the distance gate passed."
    )
    context_metrics["final_sufficiency_reason"] = (
        "Context passed the distance gate; BROAD-question relevance warning was treated as soft."
    )

else:
    # SPECIFIC（特定）和COMPARISON（比较）问题仍然将相关性门作为硬性限制。
    context_sufficient = False
    context_metrics["final_sufficiency_reason"] = (
        f"Context passed the distance gate but failed the typed LLM relevance gate for {question_type} question."
    )

如果这篇文章对你有帮助，可以点个赞～
完整代码地址：https://github.com/1186141415/Paper-RAG-Agent-with-LangGraph

28.Paper RAG Agent 开发记录：修复 LLM Rerank 的解析、Fallback 与可验证性

目 录

前 言

第一版改进

修改所属项目层次

测试结果

统计总结

新问题产生

第二版改进

修改所属项目层次

目录

前言