Question mutiple pdf‘s using openai, pinecone, langchain

题意:使用 OpenAI、Pinecone 和 LangChain 对多个 PDF 文件进行提问。

问题背景:

I am trying to ask questions against a multiple pdf using pinecone and openAI but I dont know how to.

我正在尝试使用 Pinecone 和 OpenAI 对多个 PDF 文件进行提问,但我不知道该怎么做。

The code below works for asking questions against one document. but I would like to have multiple documents to ask questions against:

下面的代码可以用于对一个文档进行提问,但我想要能够对多个文档提问:

python 复制代码
# process_message.py
from flask import request
import pinecone
# from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
import os
import json
# from constants.company import file_company_id_column, file_location_column, file_name_column
from services.files import FileFireStorage
from middleware.auth import check_authorization
import configparser
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


def process_message():
    
    # Create a ConfigParser object and read the config.ini file
    config = configparser.ConfigParser()
    config.read('config.ini')
    # Retrieve the value of OPENAI_API_KEY
    openai_key = config.get('openai', 'OPENAI_API_KEY')
    pinecone_env_key = config.get('pinecone', 'PINECONE_ENVIRONMENT')
    pinecone_api_key = config.get('pinecone', 'PINECONE_API_KEY')


    loader = PyPDFLoader("docs/ops.pdf")
    data = loader.load()
    # data = body['data'][1]['name']
    # Print information about the loaded data
    print(f"You have {len(data)} document(s) in your data")
    print(f"There are {len(data[30].page_content)} characters in your document")

    # Chunk your data up into smaller documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
    texts = text_splitter.split_documents(data)
   

    embeddings = OpenAIEmbeddings(openai_api_key=openai_key)

    pinecone.init(api_key=pinecone_api_key, environment=pinecone_env_key)
    index_name = "pdf-chatbot"  # Put in the name of your Pinecone index here

    docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)
    # Query those docs to get your answer back
    llm = OpenAI(temperature=0, openai_api_key=openai_key)
    chain = load_qa_chain(llm, chain_type="stuff")

    query = "Are there any other documents listed in this document?"
    docs = docsearch.similarity_search(query)
    answer = chain.run(input_documents=docs, question=query)
    print(answer)

    return answer

I added as many comments as I could there. I got this information from

我在代码中添加了尽可能多的注释。我从以下来源获取了这些信息:https://www.youtube.com/watch?v=h0DHDp1FbmQ

I tried to look at other stackoverflow questions about this but could not find anything similar

我试图查看其他与此相关的 Stack Overflow 问题,但没有找到类似的内容。

问题解决:

You can load multiple PDFS with PyPDFDirectoryLoader

你可以使用 `PyPDFDirectoryLoader` 加载多个 PDF 文件。

相关推荐
蹦蹦跳跳真可爱5891 小时前
Python----计算机视觉处理(Opencv:道路检测之提取车道线)
python·opencv·计算机视觉
hello_simon2 小时前
在线小白工具,PPT转PDF支持多种热门工具,支持批量转换,操作简单,高效适合各种需求
pdf·html·powerpoint·excel·pdf转html·excel转pdf格式
Tanecious.3 小时前
机器视觉--python基础语法
开发语言·python
ALe要立志成为web糕手4 小时前
SESSION_UPLOAD_PROGRESS 的利用
python·web安全·网络安全·ctf
Tttian6225 小时前
Python办公自动化(3)对Excel的操作
开发语言·python·excel
蹦蹦跳跳真可爱5896 小时前
Python----机器学习(KNN:使用数学方法实现KNN)
人工智能·python·机器学习
独好紫罗兰6 小时前
洛谷题单2-P5713 【深基3.例5】洛谷团队系统-python-流程图重构
开发语言·python·算法
DREAM.ZL7 小时前
基于python的电影数据分析及可视化系统
开发语言·python·数据分析
碧海饮冰8 小时前
LangChain/Eliza框架在使用场景上的异同,Eliza通过配置实现功能扩展的例子
langchain·eliza
Uncertainty!!8 小时前
python函数装饰器
开发语言·python·装饰器