使用 Chainlit, Langchain 及 Elasticsearch 轻松实现对 PDF 文件的查询

在我之前的文章 "Elasticsearch:与多个 PDF 聊天 | LangChain Python 应用教程(免费 LLMs 和嵌入)" 里,我详述如何使用 Streamlit,Langchain, Elasticsearch 及 OpenAI 来针对 PDF 进行聊天。在今天的文章中,我将使用 Chainlit 来展示如使用 Langchain 及 Elasticsearch 针对 PDF 文件进行查询。

为方便大家学习,我的代码在地址 GitHub - liu-xiao-guo/langchain-openai-chainlit: Chat with your documents (pdf, csv, text) using Openai model, LangChain and Chainlit 进行下载。

安装

安装 Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana,那么请参考一下的文章来进行安装:

在安装的时候,请选择 Elastic Stack 8.x 进行安装。在安装的时候,我们可以看到如下的安装信息:

拷贝 Elasticsearch 证书

我们把 Elasticsearch 的证书拷贝到当前的目录下:

bash 复制代码
1.  $ pwd
2.  /Users/liuxg/python/elser
3.  $ cp ~/elastic/elasticsearch-8.12.0/config/certs/http_ca.crt .
4.  $ ls http_ca.crt 
5.  http_ca.crt

安装 Python 依赖包

我们在当前的目录下打入如下的命令:

bash 复制代码
1.  python3 -m venv .venv
2.  source .venv/bin/activate

然后,我们再打入如下的命令:

bash 复制代码
1.  $ pwd
2.  /Users/liuxg/python/langchain-openai-chainlit
3.  $ source .venv/bin/activate
4.  (.venv) $ pip3 install -r requirements.txt

运行应用

有关 Chainlit 的更多知识请参考 Overview - Chainlit。这里就不再赘述。有关 pdf_qa.py 的代码如下:

pdf_qa.py

ini 复制代码
1.  # Import necessary modules and define env variables

3.  # from langchain.embeddings.openai import OpenAIEmbeddings
4.  from langchain_openai import OpenAIEmbeddings
5.  from langchain.text_splitter import RecursiveCharacterTextSplitter
6.  from langchain.chains import RetrievalQAWithSourcesChain
7.  from langchain_openai import ChatOpenAI
8.  from langchain.prompts.chat import (
9.      ChatPromptTemplate,
10.      SystemMessagePromptTemplate,
11.      HumanMessagePromptTemplate,
12.  )
13.  import os
14.  import io
15.  import chainlit as cl
16.  import PyPDF2
17.  from io import BytesIO

19.  from pprint import pprint
20.  import inspect
21.  # from langchain.vectorstores import ElasticsearchStore
22.  from langchain_community.vectorstores import ElasticsearchStore
23.  from elasticsearch import Elasticsearch

25.  from dotenv import load_dotenv

27.  # Load environment variables from .env file
28.  load_dotenv()

30.  OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")
31.  ES_USER = os.getenv("ES_USER")
32.  ES_PASSWORD = os.getenv("ES_PASSWORD")
33.  elastic_index_name='pdf_docs'

35.  # text_splitter and system template

37.  text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

39.  system_template = """Use the following pieces of context to answer the users question.
40.  If you don't know the answer, just say that you don't know, don't try to make up an answer.
41.  ALWAYS return a "SOURCES" part in your answer.
42.  The "SOURCES" part should be a reference to the source of the document from which you got your answer.

44.  Example of your response should be:

46.  ```
47.  The answer is foo
48.  SOURCES: xyz
49.  ```

51.  Begin!
52.  ----------------
53.  {summaries}"""

56.  messages = [
57.      SystemMessagePromptTemplate.from_template(system_template),
58.      HumanMessagePromptTemplate.from_template("{question}"),
59.  ]
60.  prompt = ChatPromptTemplate.from_messages(messages)
61.  chain_type_kwargs = {"prompt": prompt}

64.  @cl.on_chat_start
65.  async def on_chat_start():

67.      # Sending an image with the local file path
68.      elements = [
69.      cl.Image()
70.      ]
71.      await cl.Message(content="Hello there, Welcome to AskAnyQuery related to Data!", elements=elements).send()
72.      files = None

74.      # Wait for the user to upload a PDF file
75.      while files is None:
76.          files = await cl.AskFileMessage(
77.              content="Please upload a PDF file to begin!",
78.              accept=["application/pdf"],
79.              max_size_mb=20,
80.              timeout=180,
81.          ).send()

83.      file = files[0]

85.      # print("type: ", type(file))
86.      # print("file: ", file)
87.      # pprint(vars(file))
88.      # print(file.content)

90.      msg = cl.Message(content=f"Processing `{file.name}`...")
91.      await msg.send()

93.      # Read the PDF file
94.      # pdf_stream = BytesIO(file.content)
95.      with open(file.path, 'rb') as f:
96.          pdf_content = f.read()
97.      pdf_stream = BytesIO(pdf_content)
98.      pdf = PyPDF2.PdfReader(pdf_stream)
99.      pdf_text = ""
100.      for page in pdf.pages:
101.          pdf_text += page.extract_text()

103.      # Split the text into chunks
104.      texts = text_splitter.split_text(pdf_text)

106.      # Create metadata for each chunk
107.      metadatas = [{"source": f"{i}-pl"} for i in range(len(texts))]

109.      # Create a Chroma vector store
110.      embeddings = OpenAIEmbeddings()

112.      url = f"https://{ES_USER}:{ES_PASSWORD}@localhost:9200"

114.      connection = Elasticsearch(
115.          hosts=[url], 
116.          ca_certs = "./http_ca.crt", 
117.          verify_certs = True
118.      )

120.      docsearch = None

122.      if not connection.indices.exists(index=elastic_index_name):
123.          print("The index does not exist, going to generate embeddings")   
124.          docsearch = await cl.make_async(ElasticsearchStore.from_texts)( 
125.                  texts,
126.                  embedding = embeddings, 
127.                  es_url = url, 
128.                  es_connection = connection,
129.                  index_name = elastic_index_name, 
130.                  es_user = ES_USER,
131.                  es_password = ES_PASSWORD,
132.                  metadatas=metadatas
133.          )
134.      else: 
135.          print("The index already existed")

137.          docsearch = ElasticsearchStore(
138.              es_connection=connection,
139.              embedding=embeddings,
140.              es_url = url, 
141.              index_name = elastic_index_name, 
142.              es_user = ES_USER,
143.              es_password = ES_PASSWORD    
144.          )

146.      # Create a chain that uses the Chroma vector store
147.      chain = RetrievalQAWithSourcesChain.from_chain_type(
148.          ChatOpenAI(temperature=0),
149.          chain_type="stuff",
150.          retriever=docsearch.as_retriever(search_kwargs={"k": 4}),
151.      )

153.      # Save the metadata and texts in the user session
154.      cl.user_session.set("metadatas", metadatas)
155.      cl.user_session.set("texts", texts)

157.      # Let the user know that the system is ready
158.      msg.content = f"Processing `{file.name}` done. You can now ask questions!"
159.      await msg.update()

161.      cl.user_session.set("chain", chain)

164.  @cl.on_message
165.  async def main(message:str):

167.      chain = cl.user_session.get("chain")  # type: RetrievalQAWithSourcesChain
168.      print("chain type: ", type(chain))
169.      cb = cl.AsyncLangchainCallbackHandler(
170.          stream_final_answer=True, answer_prefix_tokens=["FINAL", "ANSWER"]
171.      )
172.      cb.answer_reached = True

174.      print("message: ", message)
175.      pprint(vars(message))
176.      print(message.content)
177.      res = await chain.acall(message.content, callbacks=[cb])

179.      answer = res["answer"]
180.      sources = res["sources"].strip()
181.      source_elements = []

183.      # Get the metadata and texts from the user session
184.      metadatas = cl.user_session.get("metadatas")
185.      all_sources = [m["source"] for m in metadatas]
186.      texts = cl.user_session.get("texts")

188.      print("texts: ", texts)

190.      if sources:
191.          found_sources = []

193.          # Add the sources to the message
194.          for source in sources.split(","):
195.              source_name = source.strip().replace(".", "")
196.              # Get the index of the source
197.              try:
198.                  index = all_sources.index(source_name)
199.              except ValueError:
200.                  continue
201.              text = texts[index]
202.              found_sources.append(source_name)
203.              # Create the text element referenced in the message
204.              source_elements.append(cl.Text(content=text, name=source_name))

206.          if found_sources:
207.              answer += f"\nSources: {', '.join(found_sources)}"
208.          else:
209.              answer += "\nNo sources found"

211.      if cb.has_streamed_final_answer:
212.          cb.final_stream.elements = source_elements
213.          await cb.final_stream.update()
214.      else:
215.          await cl.Message(content=answer, elements=source_elements).send()

我们可以使用如下的命令来运行:

ini 复制代码
1.  export ES_USER="elastic"
2.  export ES_PASSWORD="xnLj56lTrH98Lf_6n76y"
3.  export OPENAI_API_KEY="YourOpenAiKey"

5.  chainlit run pdf_qa.py -w
arduino 复制代码
1.  (.venv) $ chainlit run pdf_qa.py -w
2.  2024-02-14 10:58:30 - Loaded .env file
3.  2024-02-14 10:58:33 - Your app is available at http://localhost:8000
4.  2024-02-14 10:58:34 - Translation file for en not found. Using default translation en-US.
5.  2024-02-14 10:58:35 - 2 changes detected

我们先选择项目自带的 pdf 文件:

vbnet 复制代码
Is sample PDF download critical to an organization?
复制代码
Does comprehensive PDF testing have various advantages?
相关推荐
杰克尼4 小时前
天机学堂复习总结(day03-day04)
java·开发语言·redis·elasticsearch·spring cloud
一勺菠萝丶15 小时前
Git Tag 使用教程:如何打 Tag、切换 Tag、推送 Tag 和删除 Tag
大数据·git·elasticsearch
Elastic 中国社区官方博客18 小时前
Kibana 中的 AI Chat 现在可以原生渲染仪表板
大数据·数据库·人工智能·elasticsearch·搜索引擎·云原生
Elastic 中国社区官方博客19 小时前
Elastic 的 ML 与 AI Assistant 如何将 NOC 中 802.1x 故障排查时间从 20 分钟缩短到数秒
大数据·运维·人工智能·elasticsearch·搜索引擎·全文检索·可用性测试
Gavin-Wang19 小时前
swift 项目 commit 规范
大数据·elasticsearch·搜索引擎
Wils0nEdwards2 天前
Windows本地 git 版本管理
windows·git·elasticsearch
不吃鱼的羊2 天前
提交代码添加Change-Id
大数据·elasticsearch·搜索引擎
逸Y 仙X2 天前
文章四:Elasticsearch 的扩容与集群升级
java·大数据·elasticsearch·搜索引擎·全文检索
逆境不可逃2 天前
【与我学 ClaudeCode】协作篇 之 Worktree + Task Isolation :目录隔离的并行执行通道
大数据·elasticsearch·搜索引擎
逸Y 仙X2 天前
Elasticsearch安全集群构建的常见问题
java·大数据·安全·elasticsearch·搜索引擎·全文检索