28.Paper RAG Agent 开发记录:修复 LLM Rerank 的解析、Fallback 与可验证性

目 录

前 言

项目前面做的 Rerank 是一个脆弱的LLM黑箱,回顾前面的代码rag_system.py:rerank 让 LLM 输出一个 [2,0,1] 这样的索引列表,然后 ast.literal_eval 解析。但是这种操作存在两个问题:

  1. 如果 LLM 返回 python\n[2,0,1]\n或者 "我认为顺序是 [2,0,1]"------直接 fallback 到原始顺序(return list(range(len(chunks)))),rerank 等于没做。
  2. 没有打日志显示 fallback 触发了,所以我不知道有多少 query 实际上 rerank 失败了。

由于上面两个问题的存在,所以我根本不能说明我 Rerank 的有效性。今天主要做两件事:

  1. Rerank 解析更稳一点,允许 LLM 模型返回下面这种包裹格式:
    • 2, 0, 1

    • 我认为顺序是[2,0,1]
    • python[2,0,1]
  2. Rerank fallback 要打印日志,一旦解析失败,要明确记录:
    • 原始LLM输出是什么
    • 为什么 fallback
    • fallback 后是否使用原始顺序

第一版改进

修改所属项目层次

RAG 层app/rag_system.py
Trace 层context_metrics 里记录 rerank_usedrerank_fallbackrerank_error

第一版修改rag_system.py:rerank,整体修改为下面的内容:

python 复制代码
def rerank(self, query, chunks, return_trace=False):
    prompt = f"""You are a ranking assistant.

Query:
{query}

Rank the following passages from most relevant to least relevant.

Passages:
"""

    for i, c in enumerate(chunks):
        prompt += f"\n[{i}] {c}\n"

    prompt += (
        "\nReturn ONLY a Python list of indices in sorted order, like [2,0,1]. "
        "No explanation, no code fences."
    )

    response = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    raw_output = response.choices[0].message.content.strip()

    import ast
    import re

    cleaned = raw_output

    # 1. Clean markdown code fences, such as:
    # ```python
    # [2, 0, 1]
    # ```
    if "```" in cleaned:
        match = re.search(r"```(?:python|json)?\s*(\[.*?\])\s*```", cleaned, re.DOTALL)
        if match:
            cleaned = match.group(1)

    # 2. Extract first bracket list from prose, such as:
    # "I think the order is [2, 0, 1]."
    bracket_match = re.search(r"\[[\d,\s]+\]", cleaned)
    if bracket_match:
        cleaned = bracket_match.group(0)

    rerank_trace = {
        "rerank_used": False,
        "rerank_fallback": False,
        "rerank_error": "",
        "rerank_raw_output": raw_output[:300],
        "rerank_cleaned_output": cleaned[:300],
        "rerank_indices": [],
    }

    try:
        result = ast.literal_eval(cleaned)

        if not isinstance(result, list):
            raise ValueError(f"Rerank output is not a list: {type(result)}")

        if not all(isinstance(i, int) for i in result):
            raise ValueError(f"Rerank output contains non-integer indices: {result}")

        if not all(0 <= i < len(chunks) for i in result):
            raise ValueError(f"Rerank output contains out-of-range indices: {result}")

        # 去重,同时保留顺序
        seen = set()
        deduped = []
        for i in result:
            if i not in seen:
                deduped.append(i)
                seen.add(i)

        # 如果 LLM 没返回完整索引,把缺失的补到后面,避免 best_chunks 数量不稳定
        missing = [i for i in range(len(chunks)) if i not in seen]
        final_indices = deduped + missing

        rerank_trace.update({
            "rerank_used": True,
            "rerank_fallback": False,
            "rerank_error": "",
            "rerank_indices": final_indices,
        })

        logger.info(f"[rerank] success, parsed_indices={final_indices[:self.rerank_k]}")

        if return_trace:
            return final_indices, rerank_trace

        return final_indices

    except Exception as e:
        fallback_indices = list(range(len(chunks)))

        rerank_trace.update({
            "rerank_used": False,
            "rerank_fallback": True,
            "rerank_error": str(e),
            "rerank_indices": fallback_indices,
        })

        logger.warning(
            f"[rerank] FALLBACK to original order. "
            f"reason={e}, raw_output={raw_output[:300]}"
        )

        if return_trace:
            return fallback_indices, rerank_trace

        return fallback_indices

这一版的修改和以前版本的不同在于对 LLM 的返回结果引入了正则提取模型返回的结果,这样能接住一些包含非索引列表的结果,比如"我认为顺序是[2,0,1]":

python 复制代码
# 1. Clean markdown code fences, such as:
    # ```python
    # [2, 0, 1]
    # ```
    if "```" in cleaned:
        match = re.search(r"```(?:python|json)?\s*(\[.*?\])\s*```", cleaned, re.DOTALL)
        if match:
            cleaned = match.group(1)

    # 2. Extract first bracket list from prose, such as:
    # "I think the order is [2, 0, 1]."
    bracket_match = re.search(r"\[[\d,\s]+\]", cleaned)
    if bracket_match:
        cleaned = bracket_match.group(0)

其次是引入了 rerank_trace 作为 Rerank 的记录:

python 复制代码
rerank_trace = {
        "rerank_used": False, # 是否进行了 Rerank
        "rerank_fallback": False, # Rerank 是否失败,并发送fallback
        "rerank_error": "",
        "rerank_raw_output": raw_output[:300],
        "rerank_cleaned_output": cleaned[:300],
        "rerank_indices": [],
    }

随后更新rag_system.py:ask_with_trace将:

python 复制代码
texts = [c["text"] for c in retrieved]
sorted_indices = self.rerank(question, texts)
best_chunks = [retrieved[i] for i in sorted_indices[:self.rerank_k]]

替换为:

python 复制代码
texts = [c["text"] for c in retrieved]
sorted_indices, rerank_trace = self.rerank(
    question,
    texts,
    return_trace=True
)
best_chunks = [retrieved[i] for i in sorted_indices[:self.rerank_k]]

最后把 rerank_trace 放进 context_metrics

python 复制代码
context_metrics.update({
    "distance_gate_passed": distance_sufficient,
    "llm_relevance_check": relevance_metrics.get("llm_relevance_check"),
    "llm_relevance_verdict": relevance_metrics.get("llm_relevance_verdict"),
    "llm_relevance_reason": relevance_metrics.get("llm_relevance_reason"),
    "llm_relevance_error": relevance_metrics.get("llm_relevance_error"),
    "rerank_used": rerank_trace.get("rerank_used"),
    "rerank_fallback": rerank_trace.get("rerank_fallback"),
    "rerank_error": rerank_trace.get("rerank_error"),
    "rerank_indices": rerank_trace.get("rerank_indices", [])[:self.rerank_k],
    "rerank_raw_output": rerank_trace.get("rerank_raw_output", ""),
})

测试结果

# Question Expected Tool Rerank Used Rerank Fallback Distance Gate LLM Relevance Gate Final Context Sufficient Notes
1 What is the main contribution of paper1? rag true false true false false Rerank parsed successfully, but relevance gate was conservative and rejected the context. This should be refined later.
2 What is the difference between paper1 and paper2? rag true false true false false Rerank parsed successfully. The retrieved chunks mainly covered Paper1, so the system refused to answer the comparison.
3 What does paper1 say about reinforcement learning for robot navigation? rag true false true false false This is the key unrelated-question case. Distance gate passed, but LLM relevance gate correctly rejected the retrieved chunks.
4 What does paper1 do in their study? rag true false true true true Rerank parsed successfully. Context passed both gates and the system generated an evidence-based answer.

统计总结

  • 总测试问题数:4
  • 重新排序回退计数:0 / 4
  • 重新排序解析成功次数:4次/4次
  • LLM相关性门返回"否":3/4
  • LLM相关性门返回"是":1/4
  • 被相关性门拦截的不相关问题案例:1/1

新问题产生

# 问题 LLM Relevance Gate 应该是
1 What is the main contribution of paper1? false true ❌
2 What is the difference between paper1 and paper2? false true ❌
3 What does paper1 say about RL for robot navigation? false false ✅
4 What does paper1 do in their study? true true ✅

4 题里 2 题假阴性,错误率 50%。问题 1 和 2 是 RAG 系统的标配问题------"主要贡献是什么"、"两篇论文有什么区别"------ LLM gate 把它们都拒了。

问题原因:LLM relevance judge 用同一套 prompt 处理两类完全不同的问题。

  • "Paper1 在 X 具体问题上怎么说?" ← gate 擅长
  • "Paper1 主要贡献是什么?" ← gate 不擅长,因为"主要贡献"这种宽泛 query 跟任何具体段落都难以"直接对应"

修改思路

  1. 把 AND 改成更软的判断:
python 复制代码
# 现状(在 ask_with_trace 里):
context_sufficient = distance_sufficient and context_relevant

# 改成:
if not distance_sufficient:
    # 距离都不够,肯定不行
    context_sufficient = False
elif not context_relevant:
    # 距离够,但 LLM 说不相关------给一个软警告,不直接拒
    context_sufficient = True
    context_metrics["soft_warning"] = "LLM gate flagged low relevance, but distance gate passed."
else:
    context_sufficient = True
  1. 让 gate 知道问题类型:
python 复制代码
# 在 relevance prompt 里加问题分类,对宽泛问题宽松、对具体问题严格:
prompt = f"""You are a strict relevance judge for a RAG system.

First, classify the question type:
- BROAD: asks about overall content (e.g., "main contribution", "what is X about", "differences between")
- SPECIFIC: asks about a concrete fact, method, or topic

Then judge:
- For BROAD questions: reply YES if passages are clearly from/about the referenced paper(s).
- For SPECIFIC questions: reply YES only if passages directly mention the specific topic.

Question: {question}
Retrieved passages:
{preview}

Reply format:
TYPE: BROAD|SPECIFIC
VERDICT: YES|NO - reason
"""

第二版改进

修改所属项目层次

RAG 层app/rag_system.py
Trace 层context_metrics 里记录 question_typegate_modesoft_warning

第二版修改rag_system.py:assess_context_relevance_with_llm,整体修改为下面的内容:

python 复制代码
def assess_context_relevance_with_llm(self, question, retrieved_chunks):
    """
    LLM-based relevance gate with question type awareness.

    FAISS distance only tells which chunks are nearest in vector space.
    This gate checks whether the retrieved passages are useful for the current question.

    Question types:
    - BROAD: overall paper content, main contribution, what the paper does
    - SPECIFIC: concrete method/topic/fact, whether a paper mentions something
    - COMPARISON: differences/comparison between papers or methods
    """
    if not retrieved_chunks:
        return False, {
            "llm_question_type": "UNKNOWN",
            "llm_gate_mode": "hard",
            "llm_relevance_check": False,
            "llm_relevance_verdict": "NO",
            "llm_relevance_reason": "No chunks retrieved.",
            "llm_relevance_error": "",
            "llm_soft_warning": "",
        }

    preview = ""
    preview_chunks = retrieved_chunks[:RELEVANCE_GATE_PREVIEW_CHUNKS]

    for i, c in enumerate(preview_chunks, start=1):
        source = c.get("source", "unknown")
        distance = c.get("distance")
        text = c.get("text", "")

        snippet = text[:350]

        preview += (
            f"[Chunk {i}]\n"
            f"Source: {source}\n"
            f"Distance: {distance}\n"
            f"Text: {snippet}\n\n"
        )

    prompt = f"""
				You are a relevance judge for a RAG system.
				
				Your job is NOT to answer the user's question.
				Your job is to judge whether the retrieved passages are useful enough for answering it.
				
				First classify the question type:
				
				- BROAD:
				  The question asks about overall paper content, main contribution, summary, motivation,
				  what the paper does, or the general study direction.
				  Examples:
				  "What is the main contribution of this paper?"
				  "What does paper1 do in their study?"
				  "What is this paper about?"
				
				- SPECIFIC:
				  The question asks about a concrete method, fact, topic, term, formula, dataset, metric,
				  or whether the paper mentions a specific subject.
				  Examples:
				  "What does paper1 say about reinforcement learning?"
				  "Does paper2 mention GAN?"
				  "What attention mechanism is used?"
				
				- COMPARISON:
				  The question asks about differences, comparison, similarities, or contrast between
				  two or more papers, methods, or systems.
				  Examples:
				  "What is the difference between paper1 and paper2?"
				  "Compare the methods in these two papers."
				
				Judging rules:
				
				- For BROAD questions:
				  Reply YES if the passages are clearly from or about the referenced paper(s) and contain
				  abstract, method, contribution, experiment, or conclusion information.
				  Do NOT require the exact words "main contribution" to appear.
				
				- For SPECIFIC questions:
				  Reply YES only if the passages directly mention or support the concrete topic asked.
				  Reply NO if the passages are about a different topic.
				
				- For COMPARISON questions:
				  Reply YES only if the passages contain enough information about the compared items.
				  If the passages only cover one side of the comparison, reply NO.
				
				Question:
				{question}
				
				Retrieved passages:
				{preview}
				
				Return exactly two lines:
				
				TYPE: BROAD|SPECIFIC|COMPARISON
				VERDICT: YES|NO - short reason
			"""

    try:
        response = client.chat.completions.create(
            model=CHAT_MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )

        verdict = response.choices[0].message.content.strip()

        import re

        type_match = re.search(
            r"TYPE:\s*(BROAD|SPECIFIC|COMPARISON)",
            verdict,
            re.IGNORECASE
        )

        verdict_match = re.search(
            r"VERDICT:\s*(YES|NO)\s*[-:]\s*(.*)",
            verdict,
            re.IGNORECASE | re.DOTALL
        )

        if not type_match or not verdict_match:
            logger.warning(f"[relevance_gate] unexpected verdict format: {verdict}")

            return True, {
                "llm_question_type": "UNKNOWN",
                "llm_gate_mode": "distance_only_fallback",
                "llm_relevance_check": True,
                "llm_relevance_verdict": verdict[:200],
                "llm_relevance_reason": "Unexpected judge output format. Falling back to distance-based result.",
                "llm_relevance_error": "unexpected_verdict_format",
                "llm_soft_warning": "LLM relevance judge returned unexpected format.",
            }

        question_type = type_match.group(1).upper()
        yes_or_no = verdict_match.group(1).upper()
        reason = verdict_match.group(2).strip()

        is_relevant = yes_or_no == "YES"

        logger.info(
            f"[relevance_gate] type={question_type}, "
            f"verdict={yes_or_no}, reason={reason[:120]}"
        )

        return is_relevant, {
            "llm_question_type": question_type,
            "llm_gate_mode": "typed_relevance_gate",
            "llm_relevance_check": is_relevant,
            "llm_relevance_verdict": f"{yes_or_no} - {reason}"[:200],
            "llm_relevance_reason": reason[:200],
            "llm_relevance_error": "",
            "llm_soft_warning": "",
        }

    except Exception as e:
        logger.warning(f"[relevance_gate] LLM relevance check failed: {e}")

        return True, {
            "llm_question_type": "UNKNOWN",
            "llm_gate_mode": "distance_only_fallback",
            "llm_relevance_check": True,
            "llm_relevance_verdict": "FALLBACK",
            "llm_relevance_reason": "LLM relevance check failed. Falling back to distance-based result.",
            "llm_relevance_error": str(e),
            "llm_soft_warning": "LLM relevance judge failed, so the system used distance gate only.",
        }

这一版的修改除了完善提示词外,与以往的不同在于增加了 question_typegate_modesoft_warning信息:

python 复制代码
if not retrieved_chunks:
        return False, {
            "llm_question_type": "UNKNOWN",
            "llm_gate_mode": "hard",
            "llm_relevance_check": False,
            "llm_relevance_verdict": "NO",
            "llm_relevance_reason": "No chunks retrieved.",
            "llm_relevance_error": "",
            "llm_soft_warning": "",
        }

随后修改app/rag_system.py:ask_with_trace中的硬约束将:

python 复制代码
context_metrics.update({
    "distance_gate_passed": distance_sufficient,
    "llm_relevance_check": relevance_metrics.get("llm_relevance_check"),
    "llm_relevance_verdict": relevance_metrics.get("llm_relevance_verdict"),
    "llm_relevance_reason": relevance_metrics.get("llm_relevance_reason"),
    "llm_relevance_error": relevance_metrics.get("llm_relevance_error"),
    "rerank_used": rerank_trace.get("rerank_used"),
    "rerank_fallback": rerank_trace.get("rerank_fallback"),
    "rerank_error": rerank_trace.get("rerank_error"),
    "rerank_indices": rerank_trace.get("rerank_indices", [])[:self.rerank_k],
    "rerank_raw_output": rerank_trace.get("rerank_raw_output", ""),
})

context_sufficient = distance_sufficient and context_relevant

if context_sufficient:
    context_metrics["final_sufficiency_reason"] = (
        "Context passed both the distance gate and the LLM relevance gate."
    )
elif not distance_sufficient:
    context_metrics["final_sufficiency_reason"] = (
        "Context failed the distance-based retrieval gate."
    )
elif not context_relevant:
    context_metrics["final_sufficiency_reason"] = (
        "Context passed the distance gate but failed the LLM relevance gate."
    )
else:
    context_metrics["final_sufficiency_reason"] = (
        "Context failed the lightweight sufficiency check."
    )

替换为:

python 复制代码
question_type = relevance_metrics.get("llm_question_type", "UNKNOWN")
llm_relevance_error = relevance_metrics.get("llm_relevance_error", "")

context_metrics.update({
    "distance_gate_passed": distance_sufficient,
    "llm_question_type": question_type,
    "llm_gate_mode": relevance_metrics.get("llm_gate_mode"),
    "llm_relevance_check": relevance_metrics.get("llm_relevance_check"),
    "llm_relevance_verdict": relevance_metrics.get("llm_relevance_verdict"),
    "llm_relevance_reason": relevance_metrics.get("llm_relevance_reason"),
    "llm_relevance_error": llm_relevance_error,
    "llm_soft_warning": relevance_metrics.get("llm_soft_warning", ""),
    "rerank_used": rerank_trace.get("rerank_used"),
    "rerank_fallback": rerank_trace.get("rerank_fallback"),
    "rerank_error": rerank_trace.get("rerank_error"),
    "rerank_indices": rerank_trace.get("rerank_indices", [])[:self.rerank_k],
    "rerank_raw_output": rerank_trace.get("rerank_raw_output", ""),
})

if not distance_sufficient:   # 距离不满足,一定不充分
    context_sufficient = False
    context_metrics["final_sufficiency_reason"] = (
        "Context failed the distance-based retrieval gate."
    )

elif llm_relevance_error:
	# 判断失败或返回了意外格式。
    # 不要完全屏蔽答案。退而求其次,使用距离门并显示警告。    
    context_sufficient = True
    context_metrics["final_sufficiency_reason"] = (
        "LLM relevance judge was unavailable or malformed; using distance gate as fallback."
    )

elif context_relevant:
    context_sufficient = True
    context_metrics["final_sufficiency_reason"] = (
        "Context passed both the distance gate and the typed LLM relevance gate."
    )

elif question_type == "BROAD":
	# 像"主要贡献"这样宽泛的问题可能会被严格的评判者错误地否决。
    # 将其作为软警告而非硬性阻止。
    context_sufficient = True
    context_metrics["llm_soft_warning"] = (
        "LLM relevance gate flagged low relevance for a BROAD question, "
        "but the system allowed generation because the distance gate passed."
    )
    context_metrics["final_sufficiency_reason"] = (
        "Context passed the distance gate; BROAD-question relevance warning was treated as soft."
    )

else:
    # SPECIFIC(特定)和COMPARISON(比较)问题仍然将相关性门作为硬性限制。
    context_sufficient = False
    context_metrics["final_sufficiency_reason"] = (
        f"Context passed the distance gate but failed the typed LLM relevance gate for {question_type} question."
    )

如果这篇文章对你有帮助,可以点个赞~
完整代码地址:https://github.com/1186141415/Paper-RAG-Agent-with-LangGraph

相关推荐
sunneo1 小时前
专栏D-团队与组织-03-产品文化
人工智能·产品运营·aigc·产品经理·ai编程
代码小书生2 小时前
statistics,一个统计的 Python 库!
开发语言·python
比昨天多敲两行2 小时前
Linux进程概念
linux·运维·服务器
小呆呆6662 小时前
Codex 穷鬼大救星
前端·人工智能·后端
薛定猫AI2 小时前
【深度解析】Kimi K2.6 的长上下文 Agentic Coding 能力与 OpenAI 兼容 API 接入实践
人工智能·自动化·知识图谱
星爷AG I2 小时前
20-6 记忆整合(AGI基础理论)
人工智能·agi
AI创界者2 小时前
人工智能 GPT-Image DMXAPI Python AI绘画
人工智能
HLC++2 小时前
Linux的基本指令+权限+基础开发工具
linux·运维·服务器
一拳一个娘娘腔2 小时前
红队与蓝队视角:现代网络安全攻防中的Linux命令深度解析
linux·安全