25.Paper RAG Agent 优化记录：上传反馈、计算器安全与 Chunk 参数调整

- [1. 今天为什么不继续加新功能](#1. 今天为什么不继续加新功能)
- [2. 优化一：上传 PDF 后展示知识库重建结果](#2. 优化一：上传 PDF 后展示知识库重建结果)
- [3. 优化二：替换 calculator_tool 中的裸 eval](#3. 优化二：替换 calculator_tool 中的裸 eval)
- [4. 优化三：调整 chunk_size，让论文切分更符合语义粒度](#4. 优化三：调整 chunk_size，让论文切分更符合语义粒度)
- [5. 这次优化对应的工程价值](#5. 这次优化对应的工程价值)

1. 今天为什么不继续加新功能

今天没有继续新增复杂 Agent 节点，而是围绕项目中几个更容易影响实际使用做了一轮收口优化。

当前项目已经具备 PDF 上传、知识库重建、RAG 检索、LangGraph 工具路由、Agent Trace 展示等能力。相比继续堆叠新功能，今天更重要的是把已有功能打磨得更可靠、更容易解释。

2. 优化一：上传 PDF 后展示知识库重建结果

原来的问题是，用户上传 PDF 后，后端虽然会调用 FastAPI 的 /reload_kb 接口重建知识库，但页面上没有清楚展示重建是否成功。

这会带来两个问题：

第一，用户不知道新上传的 PDF 是否已经进入知识库。

第二，调试时不容易判断问题出在上传环节、知识库重建环节，还是后续问答环节。

因此今天对上传后的反馈进行了补充，使页面能够展示 reload 是否成功，以及 total_docs、total_chunks 等信息。

今天这一步需要修改：

Django 业务层：django_shell/documents/views.py
Django 展示层：django_shell/templates/documents/upload.html

具体操作是给upload_page函数增加上传结果的状态增加接收和返回：

python 复制代码

def upload_page(request):
    message = None
    error = None
    reload_result = None

    if request.method == "POST":
        file = request.FILES.get("paper_file")

        if file:
            try:
                if not file.name.lower().endswith(".pdf"):
                    error = "Only PDF files are supported."
                else:
                    os.makedirs(DATA_DIR, exist_ok=True)

                    save_path = os.path.join(DATA_DIR, file.name)

                    with open(save_path, "wb+") as f:
                        for chunk in file.chunks():
                            f.write(chunk)

                    message = f"File uploaded: {file.name}"

                    import requests
                    FASTAPI_URL = "http://127.0.0.1:8000"

                    try:
                        response = requests.post(
                            f"{FASTAPI_URL}/reload_kb",
                            timeout=(5, 180)
                        )

                        if response.status_code == 200:
                            reload_result = response.json()
                        else:
                            error = (
                                f"File uploaded, but reload_kb failed. "
                                f"Status code: {response.status_code}, "
                                f"Response: {response.text}"
                            )

                    except requests.exceptions.ReadTimeout:
                        error = (
                            "File uploaded, but knowledge base reload timed out. "
                            "FastAPI may still be rebuilding the knowledge base in the background. "
                            "Please check the FastAPI terminal logs."
                        )

                    except requests.exceptions.ConnectionError:
                        error = (
                            "File uploaded, but Django could not connect to FastAPI. "
                            "Please make sure FastAPI is running at http://127.0.0.1:8000."
                        )

                    except Exception as e:
                        error = f"File uploaded, but reload_kb request failed: {e}"

            except Exception as e:
                error = str(e)
        else:
            error = "No file selected"

    files = []
    try:
        os.makedirs(DATA_DIR, exist_ok=True)

        for f in os.listdir(DATA_DIR):
            if f.lower().endswith(".pdf"):
                files.append(f)

    except Exception as e:
        print("list files error:", e)

    return render(
        request,
        "documents/upload.html",
        {
            "message": message,
            "error": error,
            "reload_result": reload_result,
            "files": files
        }
    )

随后是给上传页面django_shell/templates/documents/upload.html增加一个卡片，用于显示上传结果：

html 复制代码

{% if reload_result %}
    <div class="reload-box">
        <h4>Knowledge Base Reload Result</h4>

        <p>
            <strong>Status:</strong>
            {{ reload_result.status }}
        </p>

        <p>
            <strong>Message:</strong>
            {{ reload_result.message }}
        </p>

        {% if reload_result.total_docs %}
            <p>
                <strong>Total Docs:</strong>
                {{ reload_result.total_docs }}
            </p>
        {% endif %}

        {% if reload_result.total_chunks %}
            <p>
                <strong>Total Chunks:</strong>
                {{ reload_result.total_chunks }}
            </p>
        {% endif %}
    </div>
{% endif %}

3. 优化二：替换 calculator_tool 中的裸 eval

calculator_tool 原先如果直接使用 eval 执行表达式，会存在安全隐患。裸 eval 是一个比较明显的问题。如果用户输入一些代码，那么eval执行后会导致注入的代码被执行，这种属于注入攻击安全隐患、所以要增加表达式限制和工具调用边界。

因此今天将 calculator_tool 改成了更安全的表达式处理方式，只支持有限范围内的基础计算，并对非法输入返回明确错误信息。

这个改动的价值不在于做一个复杂计算器，而是让工具调用更符合基本安全要求。

这一步完善属于：

Tools 工具层
安全性修复

主要需要修改的文件是app/tools.py，需要在其中增加约束，使得他只完几类特定的表达式计算，随着表达式变复杂，这里我考虑使用MCP工具，但是这一阶段先使用本地工具，并且修复系统BUG：

python 复制代码

def calculator_tool(expression):
    """
    Safe calculator tool for basic arithmetic.

    Supported:
    - numbers
    - +, -, *, /
    - parentheses
    - decimal points
    - spaces

    This intentionally avoids raw eval on unrestricted user input.
    """
    import ast
    import operator as op

    allowed_operators = {
        ast.Add: op.add,
        ast.Sub: op.sub,
        ast.Mult: op.mul,
        ast.Div: op.truediv,
        ast.USub: op.neg,
        ast.UAdd: op.pos,
    }

    def eval_node(node):
        if isinstance(node, ast.Expression):
            return eval_node(node.body)

        if isinstance(node, ast.Constant):
            if isinstance(node.value, (int, float)):
                return node.value
            raise ValueError("Only numbers are allowed.")

        if isinstance(node, ast.BinOp):
            operator_type = type(node.op)
            if operator_type not in allowed_operators:
                raise ValueError("Unsupported operator.")
            left = eval_node(node.left)
            right = eval_node(node.right)
            return allowed_operators[operator_type](left, right)

        if isinstance(node, ast.UnaryOp):
            operator_type = type(node.op)
            if operator_type not in allowed_operators:
                raise ValueError("Unsupported unary operator.")
            operand = eval_node(node.operand)
            return allowed_operators[operator_type](operand)

        raise ValueError("Invalid expression.")

    try:
        tree = ast.parse(str(expression), mode="eval")
        result = eval_node(tree)

        if isinstance(result, float) and result.is_integer():
            result = int(result)

        return str(result)

    except ZeroDivisionError:
        return "Invalid expression: division by zero."

    except Exception:
        return "Invalid expression. Only basic arithmetic is supported."

4. 优化三：调整 chunk_size，让论文切分更符合语义粒度

原先 data_loader.py 中的 chunk_size 偏小。对于英文学术论文来说，过小的 chunk 容易把一个完整语义段落切碎，导致后续检索结果上下文不完整。

因此今天对 chunk_size 和 overlap 做了调整，使单个 chunk 能覆盖更完整的论文语义片段。

这个改动对 RAG 系统比较关键，因为 chunk 切分质量会直接影响：

向量检索召回质量
rerank 的候选质量
最终回答能否获得完整上下文

这一步属于：

RAG 层
数据处理层
检索质量收口

需要修改app/data_loader.py文件，修改如下：

python 复制代码

import os
from pypdf import PdfReader
import re

# 对于论文调整Chunk_size=700   Chunk_over_120，这样单个文本切片能活得相对完整的论文语义片
DEFAULT_CHUNK_SIZE = 700
DEFAULT_CHUNK_OVERLAP = 120

def split_text(text, chunk_size=DEFAULT_CHUNK_SIZE, overlap=DEFAULT_CHUNK_OVERLAP):
    """
    Split text into overlapping character-based chunks.

    Current implementation uses character length instead of token length.
    The default chunk size is tuned for English academic papers:
    - 700 characters keeps more complete local context than 200 characters.
    - 120 characters overlap helps preserve continuity across chunks.
    """
    if not text:
        return []

    if chunk_size <= 0:
        raise ValueError("chunk_size must be positive.")

    if overlap < 0:
        raise ValueError("overlap must be non-negative.")

    if overlap >= chunk_size:
        raise ValueError("overlap must be smaller than chunk_size.")

    chunks = []
    step = chunk_size - overlap

    for i in range(0, len(text), step):
        chunk = text[i:i + chunk_size].strip()
        if chunk:
            chunks.append(chunk)

    return chunks

def process_documents(documents):
    all_chunks = []

    for doc in documents:
        chunks = split_text(doc["text"], chunk_size=200, overlap=50)

        for c in chunks:
            all_chunks.append({
                "text": c,
                "source": doc["source"]
            })

    return all_chunks

5. 这次优化对应的工程价值

这次没有做很大的新功能，但它们都属于项目从"能跑"走向"更可靠"的工程细节：

上传反馈：提升用户可感知性和调试效率
calculator 安全修复：降低工具调用安全风险
chunk 参数调整：改善 RAG 检索的语义完整性

这些改动说明项目不只是把 RAG 和 Agent 跑通，而是在逐步补齐真实 AI 应用工程中会遇到的问题。

如果这篇文章对你有帮助，可以点个赞～
完整代码地址：https://github.com/1186141415/Paper-RAG-Agent-with-LangGraph

25.Paper RAG Agent 优化记录：上传反馈、计算器安全与 Chunk 参数调整

目 录

1. 今天为什么不继续加新功能

2. 优化一：上传 PDF 后展示知识库重建结果

3. 优化二：替换 calculator_tool 中的裸 eval

4. 优化三：调整 chunk_size，让论文切分更符合语义粒度

5. 这次优化对应的工程价值

目录