参考:RAG with LangChain --- BGE documentation
安装依赖
bash
pip install langchain_community langchain_openai langchain_huggingface faiss-cpu pymupdf
注册OpenAI key
API keys - OpenAI APIhttps://platform.openai.com/api-keys
完整代码和注释
python
# For openai key
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# 1. 初始化OpenAI模型
from langchain_openai.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-4o-mini")
# 测试OpenAI调用
response = llm.invoke("What does M3-Embedding stands for?")
print(response.content)
# 2. 加载PDF文档
from langchain_community.document_loaders import PyPDFLoader
# Or download the paper and put a path to the local file instead
loader = PyPDFLoader("https://arxiv.org/pdf/2402.03216")
docs = loader.load()
print(docs[0].metadata)
# 3. 分割文本
from langchain.text_splitter import RecursiveCharacterTextSplitter
# initialize a splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Maximum size of chunks to return
chunk_overlap=150, # number of overlap characters between chunks
)
# use the splitter to split our paper
corpus = splitter.split_documents(docs)
print("分割后文档片段数:", len(corpus))
# 4. 初始化嵌入模型
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5",
encode_kwargs={"normalize_embeddings": True})
# 5. 构建向量数据库
from langchain_community.vectorstores import FAISS
vectordb = FAISS.from_documents(corpus, embedding_model)
# (optional) save the vector database to a local directory
# 保存向量库(确保目录权限)
if not os.path.exists("vectorstore.db"):
vectordb.save_local("vectorstore.db")
print("向量数据库已保存")
# 6. 创建检索链
from langchain_core.prompts import ChatPromptTemplate
template = """
You are a Q&A chat bot.
Use the given context only, answer the question.
<context>
{context}
</context>
Question: {input}
"""
# Create a prompt template
prompt = ChatPromptTemplate.from_template(template)
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
doc_chain = create_stuff_documents_chain(llm, prompt)
# Create retriever for later use
retriever = vectordb.as_retriever(search_kwargs={"k": 3}) # 调整检索数量
chain = create_retrieval_chain(retriever, doc_chain)
# 7. 执行查询
response = chain.invoke({"input": "What does M3-Embedding stands for?"})
# print the answer only
print("\n答案:", response['answer'])
运行
bash
python LangChainDemo.py
结果
python
M3-Embedding refers to "Multimodal, Multi-Task, and Multi-Lingual" embedding techniques that integrate information from multiple modalities (such as text, images, and audio), support multiple tasks (like classification, generation, or translation), and can operate across multiple languages. This approach helps in building versatile models capable of understanding and generating information across various contexts and formats.
If you are looking for a specific context or application of M3-Embedding, please provide more details!
{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-07-01T00:26:51+00:00', 'author': '', 'keywords': '', 'moddate': '2024-07-01T00:26:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'https://arxiv.org/pdf/2402.03216', 'total_pages': 18, 'page': 0, 'page_label': '1'}
分割后文档片段数: 87
向量数据库已保存
答案: M3-Embedding stands for Multi-Linguality, Multi-Functionality, and Multi-Granularity.