检索增强生成RAG with LangChain、OpenAI and FAISS

参考:RAG with LangChain --- BGE documentation

安装依赖

bash 复制代码
pip install langchain_community langchain_openai langchain_huggingface faiss-cpu pymupdf

注册OpenAI key

API keys - OpenAI APIhttps://platform.openai.com/api-keys

完整代码和注释

LangChainDemo.py

python 复制代码
# For openai key
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# 1. 初始化OpenAI模型
from langchain_openai.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o-mini")

# 测试OpenAI调用
response = llm.invoke("What does M3-Embedding stands for?")
print(response.content)

# 2. 加载PDF文档
from langchain_community.document_loaders import PyPDFLoader

# Or download the paper and put a path to the local file instead
loader = PyPDFLoader("https://arxiv.org/pdf/2402.03216")
docs = loader.load()
print(docs[0].metadata)

# 3. 分割文本
from langchain.text_splitter import RecursiveCharacterTextSplitter

# initialize a splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # Maximum size of chunks to return
    chunk_overlap=150,  # number of overlap characters between chunks
)

# use the splitter to split our paper
corpus = splitter.split_documents(docs)
print("分割后文档片段数:", len(corpus))

# 4. 初始化嵌入模型
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5",
encode_kwargs={"normalize_embeddings": True})

# 5. 构建向量数据库
from langchain_community.vectorstores import FAISS

vectordb = FAISS.from_documents(corpus, embedding_model)

# (optional) save the vector database to a local directory
# 保存向量库(确保目录权限)
if not os.path.exists("vectorstore.db"):
    vectordb.save_local("vectorstore.db")
print("向量数据库已保存")

# 6. 创建检索链
from langchain_core.prompts import ChatPromptTemplate

template = """
You are a Q&A chat bot.
Use the given context only, answer the question.

<context>
{context}
</context>

Question: {input}
"""

# Create a prompt template
prompt = ChatPromptTemplate.from_template(template)

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

doc_chain = create_stuff_documents_chain(llm, prompt)
# Create retriever for later use
retriever = vectordb.as_retriever(search_kwargs={"k": 3})  # 调整检索数量
chain = create_retrieval_chain(retriever, doc_chain)

# 7. 执行查询
response = chain.invoke({"input": "What does M3-Embedding stands for?"})

# print the answer only
print("\n答案:", response['answer'])

运行

bash 复制代码
python LangChainDemo.py

结果

python 复制代码
M3-Embedding refers to "Multimodal, Multi-Task, and Multi-Lingual" embedding techniques that integrate information from multiple modalities (such as text, images, and audio), support multiple tasks (like classification, generation, or translation), and can operate across multiple languages. This approach helps in building versatile models capable of understanding and generating information across various contexts and formats.

If you are looking for a specific context or application of M3-Embedding, please provide more details!
{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-07-01T00:26:51+00:00', 'author': '', 'keywords': '', 'moddate': '2024-07-01T00:26:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'https://arxiv.org/pdf/2402.03216', 'total_pages': 18, 'page': 0, 'page_label': '1'}
分割后文档片段数: 87
向量数据库已保存

答案: M3-Embedding stands for Multi-Linguality, Multi-Functionality, and Multi-Granularity.
相关推荐
神秘的猪头23 分钟前
LangChain Tool 实战:让大模型“长出双手”,通过 Tool 调用连接真实世界
langchain·node.js·aigc
一点晖光1 小时前
本地搭建检索增强生成(RAG)大模型知识库项目
检索增强·知识库·rag
千桐科技1 小时前
qKnow 知识平台商业版 v2.1.1 正式发布:图谱问答与语义检索能力全面升级
大模型·知识图谱·智能问答·知识库·rag·qknow·知识平台
沛沛老爹2 小时前
Advanced-RAG原理:RAG-Fusion 检索增强生成的多查询融合实战
langchain·llm·agent·fusion·rag·advanced·web转型
xhxxx12 小时前
你的 AI 为什么总答非所问?缺的不是智商,是“记忆系统”
前端·langchain·llm
www_stdio17 小时前
让大语言模型拥有“记忆”:多轮对话与 LangChain 实践指南
前端·langchain·llm
重铸码农荣光17 小时前
别再让大模型“胡说八道”了!LangChain 的 JsonOutputParser 教你驯服 AI 输出
langchain·llm·aigc
小白点point18 小时前
别再让单体 Agent 烧 Token 了:LangGraph 多智能体实战指南
langchain·agent
栀秋66618 小时前
LangChain Memory 实战指南:让大模型记住你每一句话,轻松打造“有记忆”的AI助手
javascript·langchain·llm
雪花desu18 小时前
深度解析RAG(检索增强生成)技术
人工智能·深度学习·语言模型·chatgpt·langchain