半小时速通rag基础

从零手写一个本地 RAG 系统

RAG（Retrieval-Augmented Generation，检索增强生成）的核心思路：让 LLM 不靠"记忆"回答问题，而是先从你的文档里找到相关内容，再基于这些内容生成回答。

整体流程

markdown 复制代码

PDF → 提取文本 → 分块 → 向量化 → 存储
                                      ↓
用户提问 → 问题向量化 → 相似度检索 → 取 top5 → 拼 prompt → LLM 生成回答

技术选型

环节	工具
PDF 解析	PyMuPDF
文本分块	正则分句 + 固定句数分块
Embedding	sentence-transformers (all-MiniLM-L6-v2, 384维)
向量存储	numpy (.npy 文件)
相似度计算	点积 (dot product)
LLM 生成	Ollama + qwen2.5:7b

Step 1: PDF 文本提取

用 PyMuPDF 逐页提取文字，把 PDF 排版换行替换成空格。

python 复制代码

import fitz

def extract_text_from_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages = []
    for i, page in enumerate(doc):
        text = page.get_text().replace("\n", " ").strip()
        if text:
            pages.append({"page_number": i + 1, "text": text})
    doc.close()
    return pages

Step 2: 文本分块

按 . ? ! 分句，每 10 句组成一个 chunk。太短的丢弃（页眉页脚噪音）。

为什么要分块？整页文本太长，embedding 模型处理长文本效果差，检索时也只需要最相关的那一小段。

python 复制代码

import re

def split_into_sentences(text: str) -> list[str]:
    sentences = re.split(r'(?<=[.?!])\s+', text)
    return [s.strip() for s in sentences if s.strip()]

def chunk_pages(pages: list[dict], chunk_size: int = 10) -> list[dict]:
    all_chunks = []
    for page in pages:
        sentences = split_into_sentences(page["text"])
        for i in range(0, len(sentences), chunk_size):
            chunk_sentences = sentences[i:i + chunk_size]
            chunk_text = " ".join(chunk_sentences)
            if len(chunk_text) > 30:
                all_chunks.append({
                    "page_number": page["page_number"],
                    "chunk_text": chunk_text,
                })
    return all_chunks

Step 3: 向量化 (Embedding)

把每个 chunk 转成 384 维向量。语义相近的文本，向量在空间中距离也近。

python 复制代码

from sentence_transformers import SentenceTransformer
import numpy as np

def embed_chunks(chunks: list[dict]) -> np.ndarray:
    model = SentenceTransformer("all-MiniLM-L6-v2")
    texts = [c["chunk_text"] for c in chunks]
    embeddings = model.encode(texts, show_progress_bar=True)
    return embeddings

向量本身对人没有意义，但对计算有意义------两个向量方向越一致，点积越大，说明语义越相关。

Step 4: 语义检索

用户提问也转成向量，跟所有 chunk 向量算点积，取最大的 5 个。

python 复制代码

def search(query, embeddings, chunks, model, top_k=5):
    query_embedding = model.encode([query])[0]
    scores = np.dot(embeddings, query_embedding)
    top_indices = np.argsort(scores)[::-1][:top_k]
    results = []
    for i in top_indices:
        results.append({
            "score": float(scores[i]),
            "page_number": chunks[i]["page_number"],
            "text": chunks[i]["chunk_text"],
        })
    return results

关键点：问题和文档必须用同一个 embedding 模型，否则不在同一个坐标系里，没法比较距离。

Step 5: RAG --- 检索 + 生成

把检索到的 chunk 作为上下文塞进 prompt，让 LLM 基于这些内容回答。

python 复制代码

import requests

def generate(prompt: str, model: str = "qwen2.5:7b") -> str:
    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": prompt,
        "stream": False,
    })
    return resp.json()["response"]

def rag(query, embeddings, chunks, model):
    results = search(query, embeddings, chunks, model, top_k=5)
    context = "\n\n".join([
        f"[Page {r['page_number']}]: {r['text']}" for r in results
    ])
    prompt = f"""Based on the following context, answer the question.
If the context doesn't contain enough information, say so.

Context:
{context}

Question: {query}

Answer:"""
    return generate(prompt)

这就是 RAG 的本质：LLM 本身不知道你文档的内容，但你把相关段落喂给它，它就能基于事实回答，而不是靠训练记忆瞎编。

项目结构

perl 复制代码

my-rag/
├── extract.py     # PDF提取文本
├── chunk.py       # 文本分块
├── embed.py       # 向量化并保存
├── search.py      # 语义检索
├── rag.py         # 完整RAG流程
└── data/
    ├── human-nutrition-text.pdf
    ├── embeddings.npy
    └── chunks.json

可优化方向

这是一个最小实现。生产级 RAG 还会加：

向量数据库（Chroma/FAISS/Milvus）--- 支持百万级数据的快速检索
分块优化 --- 按语义分段、相邻 chunk 加 overlap 防止信息被切断
Rerank --- 检索后用另一个模型重新排序，提高精度
多轮对话 --- 带上历史记录支持追问
混合检索 --- 向量搜索 + 关键词搜索结合，互补短板