LangChain(4)检索增强 Retrieval Augmentation

LangChain(4)检索增强 Retrieval Augmentation

Large Language Models (LLMs) 的能力或者知识来自两方面:模型在训练时候的输入;模型训练好后以提示词方式输入到模型中的知识source knowledge。检索增强就是指后期输入到模型中的附加信息。

文本分段

按顺序安装包:

复制代码
!pip install -qU \
    datasets==2.12.0 \
    apache_beam \
    mwparserfromhell
 
 !pip install -qU \
  langchain==0.0.162 \
  openai==0.27.7 \
  tiktoken==0.4.0 \
  "pinecone-client[grpc]"==2.2.2
python 复制代码
from datasets import load_dataset
# 下载维基百科资料
data = load_dataset("wikipedia", "20220301.simple", split='train[:10000]')

# 分词工具
import tiktoken
tiktoken.encoding_for_model('gpt-3.5-turbo')

import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')
# 计算分词后的token数 create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# 使用 RecursiveCharacterTextSplitter 将整段文本分割,限定每个片段的最大token数
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len, #计量token数
    separators=["\n\n", "\n", " ", ""]
)

# 使用方式
chunks = text_splitter.split_text(data[6]['text'])[:3]
# 计算token数
tiktoken_len(chunks[0])

构建 Embedding

python 复制代码
import os
# 设置OPENAI_API_KEY  get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'

from langchain.embeddings.openai import OpenAIEmbeddings
# 向量化的模型
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

# 测试文本
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here']

res = embed.embed_documents(texts)
print(len(res), len(res[0]))
>>>2 1536 # 向量长度为 1536

存储向量

使用 Pinecone 存储向量。

python 复制代码
index_name = 'langchain-retrieval-augmentation'

import pinecone

# find API key in console at app.pinecone.io
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find ENV (cloud region) next to API key in console
PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=YOUR_API_KEY,
    environment=YOUR_ENV
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
        )

# 连接库索引
index = pinecone.GRPCIndex(index_name)
print(index.describe_index_stats()) # 库索引统计信息

>>>{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 27437}},
 'total_vector_count': 27437}

按批将数据插入索引库中

python 复制代码
from tqdm.auto import tqdm
from uuid import uuid4
# 批量大小
batch_limit = 100

texts = []
metadatas = []

for i, record in enumerate(tqdm(data)):
    # 维基百科中文本原始信息 first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # 文本分段 now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # 为每一个分段文本创建元信息:j第几个片段 text片段文本 其它几个维基百科字段:wiki-id、source、title  create individual metadata dicts for each chunk
    record_metadatas = [{"chunk": j, "text": text, **metadata} for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

向量查询

python 复制代码
from langchain.vectorstores import Pinecone

text_field = "text" # 需要查询出来的字段
# 向量化的模型
model_name = 'text-embedding-ada-002'
embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

# 查询信息
query = "who was Benito Mussolini?"
vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

检索信息结合LLM

python 复制代码
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

print(qa.run(query))

>>>'Benito Mussolini was an Italian politician and journalist who served as the Prime Minister of Italy from 1922 until 1943.'

有时 LLM 回答不着边,没有完全按照提供的信息回答,可以通过 RetrievalQAWithSourcesChain 使得回答更可信,模型会返回参考的来源信息

python 复制代码
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)
print(qa_with_sources(query))

>>>{'question': 'who was Benito Mussolini?',
 'answer': 'Benito Mussolini was an Italian politician and journalist who was the Prime Minister of Italy from 1922 until 1943.', 
 'sources': 'https://simple.wikipedia.org/wiki/Benito%20Mussolini, https://simple.wikipedia.org/wiki/Fascism'}

参考:
Fixing Hallucination with Knowledge Bases
Retrieval Augmentation

相关推荐
羑悻的小杀马特1 小时前
OpenCV 引擎:驱动实时应用开发的科技狂飙
人工智能·科技·opencv·计算机视觉
蹦蹦跳跳真可爱5892 小时前
Python----计算机视觉处理(Opencv:道路检测之提取车道线)
python·opencv·计算机视觉
Tanecious.4 小时前
机器视觉--python基础语法
开发语言·python
guanshiyishi4 小时前
ABeam 德硕 | 中国汽车市场(2)——新能源车的崛起与中国汽车市场机遇与挑战
人工智能
ALe要立志成为web糕手4 小时前
SESSION_UPLOAD_PROGRESS 的利用
python·web安全·网络安全·ctf
极客天成ScaleFlash5 小时前
极客天成NVFile:无缓存直击存储性能天花板,重新定义AI时代并行存储新范式
人工智能·缓存
澳鹏Appen6 小时前
AI安全:构建负责任且可靠的系统
人工智能·安全
Tttian6226 小时前
Python办公自动化(3)对Excel的操作
开发语言·python·excel
蹦蹦跳跳真可爱5896 小时前
Python----机器学习(KNN:使用数学方法实现KNN)
人工智能·python·机器学习
视界宝藏库7 小时前
多元 AI 配音软件,打造独特音频体验
人工智能