🧠 LangChain + RAG 全链路实战:构建你的 AI 知识库
本文将带你从零搭建一个基于 LangChain 和 RAG(Retrieval-Augmented Generation) 的 AI 知识库系统,覆盖环境配置、文档加载、向量存储、Prompt 工程、问答链构建等全流程,并附带完整代码。
✅ 一、环境准备
安装依赖
bash
pip install langchain langchain-community langchain-openai faiss-cpu sentence-transformers pypdf python-dotenv
准备 OpenAI API Key
在项目根目录下创建 .env
文件:
env
OPENAI_API_KEY=your_openai_api_key_here
✅ 二、加载文档(以 PDF 为例)
python
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("example.pdf")
docs = loader.load()
✅ 三、文本分块
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)
✅ 四、向量化与存储(使用 FAISS + Sentence Transformers)
python
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
✅ 五、构建检索器
python
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
✅ 六、Prompt 工程(自定义模板)
python
from langchain.prompts import PromptTemplate
template = """
You are an AI assistant specialized in answering questions based on the provided context.
Use the following context to answer the question at the end.
Context:
{context}
Question:
{question}
Answer:
"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
✅ 七、构建问答链(使用 OpenAI)
python
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt}
)
✅ 八、提问测试
python
query = "What is the main idea of the document?"
result = qa_chain({"query": query})
print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])
✅ 九、完整代码整合(可直接运行)
python
# main.py
import os
from dotenv import load_dotenv
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
load_dotenv()
# 1. 加载文档
loader = PyPDFLoader("example.pdf")
docs = loader.load()
# 2. 分块
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)
# 3. 向量化
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
# 4. 检索器
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 5. Prompt 模板
template = """
You are an AI assistant specialized in answering questions based on the provided context.
Use the following context to answer the question at the end.
Context:
{context}
Question:
{question}
Answer:
"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
# 6. 问答链
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt}
)
# 7. 提问
query = "What is the main idea of the document?"
result = qa_chain({"query": query})
print("Answer:", result["result"])
✅ 十、后续可拓展方向
- 使用
Streamlit
构建 Web 界面 - 支持多种文档格式(
.docx
,.md
,.txt
) - 接入
ChatOpenAI
支持多轮对话 - 使用
Chroma
或Weaviate
替代 FAISS - 加入 重排序(rerank) 提升检索质量
✅ 十一、GitHub 模板推荐(可下载运行)
我已为你准备好一个开源模板仓库,包含:
- 支持多格式文档上传
- 基于 Streamlit 的 Web UI
- 支持本地 Embedding 和 OpenAI 双模式
- 支持对话历史记录
👉 GitHub 仓库地址(你可以 fork 后自行修改)