检索增强生成RAG with LangChain、OpenAI and FAISS

参考:RAG with LangChain --- BGE documentation

安装依赖

bash 复制代码
pip install langchain_community langchain_openai langchain_huggingface faiss-cpu pymupdf

注册OpenAI key

API keys - OpenAI APIhttps://platform.openai.com/api-keys

完整代码和注释

LangChainDemo.py

python 复制代码
# For openai key
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# 1. 初始化OpenAI模型
from langchain_openai.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o-mini")

# 测试OpenAI调用
response = llm.invoke("What does M3-Embedding stands for?")
print(response.content)

# 2. 加载PDF文档
from langchain_community.document_loaders import PyPDFLoader

# Or download the paper and put a path to the local file instead
loader = PyPDFLoader("https://arxiv.org/pdf/2402.03216")
docs = loader.load()
print(docs[0].metadata)

# 3. 分割文本
from langchain.text_splitter import RecursiveCharacterTextSplitter

# initialize a splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # Maximum size of chunks to return
    chunk_overlap=150,  # number of overlap characters between chunks
)

# use the splitter to split our paper
corpus = splitter.split_documents(docs)
print("分割后文档片段数:", len(corpus))

# 4. 初始化嵌入模型
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5",
encode_kwargs={"normalize_embeddings": True})

# 5. 构建向量数据库
from langchain_community.vectorstores import FAISS

vectordb = FAISS.from_documents(corpus, embedding_model)

# (optional) save the vector database to a local directory
# 保存向量库(确保目录权限)
if not os.path.exists("vectorstore.db"):
    vectordb.save_local("vectorstore.db")
print("向量数据库已保存")

# 6. 创建检索链
from langchain_core.prompts import ChatPromptTemplate

template = """
You are a Q&A chat bot.
Use the given context only, answer the question.

<context>
{context}
</context>

Question: {input}
"""

# Create a prompt template
prompt = ChatPromptTemplate.from_template(template)

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

doc_chain = create_stuff_documents_chain(llm, prompt)
# Create retriever for later use
retriever = vectordb.as_retriever(search_kwargs={"k": 3})  # 调整检索数量
chain = create_retrieval_chain(retriever, doc_chain)

# 7. 执行查询
response = chain.invoke({"input": "What does M3-Embedding stands for?"})

# print the answer only
print("\n答案:", response['answer'])

运行

bash 复制代码
python LangChainDemo.py

结果

python 复制代码
M3-Embedding refers to "Multimodal, Multi-Task, and Multi-Lingual" embedding techniques that integrate information from multiple modalities (such as text, images, and audio), support multiple tasks (like classification, generation, or translation), and can operate across multiple languages. This approach helps in building versatile models capable of understanding and generating information across various contexts and formats.

If you are looking for a specific context or application of M3-Embedding, please provide more details!
{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-07-01T00:26:51+00:00', 'author': '', 'keywords': '', 'moddate': '2024-07-01T00:26:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'https://arxiv.org/pdf/2402.03216', 'total_pages': 18, 'page': 0, 'page_label': '1'}
分割后文档片段数: 87
向量数据库已保存

答案: M3-Embedding stands for Multi-Linguality, Multi-Functionality, and Multi-Granularity.
相关推荐
沛沛老爹1 小时前
用 Web 开发思维理解 Agent 的三大支柱——Tools + Memory + LLM
java·人工智能·llm·llama·rag
@我们的天空2 小时前
【AI应用】学习和实践基于 LangChain/LangGraph 的链(Chain)构建、Agent 工具调用以及多轮对话流程的实现
人工智能·gpt·学习·语言模型·chatgpt·langchain·aigc
沛沛老爹4 小时前
Web开发者深度解析Function Calling:Fc全链路机制与实战原理解析
java·人工智能·llm·llama·rag·web转型
Suahi6 小时前
【HuggingFace LLM】FAISS语义搜索
faiss
Elwin Wong7 小时前
GraphRAG简介
人工智能·大模型·llm·rag·graphrag
沛沛老爹9 小时前
Web开发者快速上手AI Agent:基于Function Calling的多步交互提示词优化实战
java·人工智能·交互·rag·企业开发·发展趋势·web转型ai
薛定谔的猫19829 小时前
RAG(一)简单例子-使用WebBaseLoader基于 LangChain + 阿里百炼 + FAISS 构建网页内容智能问答系统
langchain
香蕉君9 小时前
第八品——自我修正和混合增强
langchain
猫头虎10 小时前
价值对齐:“AI+Data”时代技术战略与组织进化的核心命题
人工智能·langchain·prompt·aigc·ai编程·agi·ai-native
南太湖小蚂蚁10 小时前
技术实操:LangChain1.0中MCP的调用流程与报错应对技巧
人工智能·langchain·大模型