大家好，我是雨飞。之前讲了一些关于 RAG 的使用技巧和经验，今天我们提供两个进阶版的文本切分和检索的方法，希望对你有所帮助。以下两种方法取自 LangChain 的官方示例，感兴趣也可以直接去阅读官方文档。Retrievers | ️ Langchain

一、父文档检索 Parent Document Retriever

当我们去切分文本的时候，会有两方面的考虑：

细粒度的文档块，这样在使用向量模型进行编码的时候，就能更好的表示这个文档块的语义含义，而且这个文档块所含的噪音数据会更少

粗粒度的文档块，当我们召回后需要调用大模型进行生成回答，细粒度的文档块会丢失一些上下文信息，有一定的语义损失

LangChain 中的 Parent Document Retriever 就相当于结合了不同粒度的文本块去构建检索过程，具体的实现流程如下：

首先就是使用两个文本分割器去将文本切分为父文档块和子文档块，然后建立向量存储区存储子块，建立内存存储区存储父块。之后我们需要创建 Parent Document Retriever，将上面定义好的分割器、存储器，并执行add_documents 方法将文档添加到检索器中。

在使用的时候，调用 get_relevant_documents 方法，这个时候实际上会调用向量检索返回子块的 ID，然后根据子块ID 将对应父块的内容返回给用户。

下面是示例代码

python 复制代码

from langchain.retrievers import ParentDocumentRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader

loaders = [
    TextLoader('../../paul_graham_essay.txt'),
    TextLoader('../../state_of_the_union.txt'),
]
docs = []
for l in loaders:
    docs.extend(l.load())
    
 # This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)

retrieved_docs = retriever.get_relevant_documents("justice breyer")

二、多向量检索 MultiVector Retriever

父文档检索的思路是，我利用相似度召回了一个语义上相似的小块，然后返回一个能提供完整上下文信息的大块，这样可以帮助模型更好的理解上下文的语义，进行高质量的回答。

而多向量检索的思路是为同一篇文档提供不同视角的向量建模，比如：

分割文档，就是将文档切分为不同大小的块，然后分别构建向量

摘要，为每个文档创建摘要，将其与文档一起嵌入（或者替代文档）；可以利用大模型进行文档总结，输出摘要内容。摘要一般包含了整篇文档的语义信息，但是会比原文要更精简。当然，使用大模型进行总结的时候，需要对大模型效果进行评估，正常来说 6B、7B 大小的模型效果不会太好。

假设性问题：为每个文档创建适合回答的假设性问题，将其与文档一起嵌入。这个是一个典型的逆向思维的例子，就是根据现在有的这篇文档，先让大模型提出几个这篇文档能够解答的问题，然后将这些问题编码成向量进行存储。这其实基于一个假设就是如果一篇文章能回答某一个问题，那么也可以回答与其相似的问题，当然，这个假设在一般情况下都是成立的，因此可以拿来使用。

python 复制代码

import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.document import Document
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
def func1():
    # 方法1 分割文档，生成更小的组块
    child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    sub_docs = []
    for i, doc in enumerate(docs):
        _id = doc_ids[i]
        _sub_docs = child_text_splitter.split_documents([doc])
        for _doc in _sub_docs:
            _doc.metadata[id_key] = _id
        sub_docs.extend(_sub_docs)
    return sub_docs

def func2():
    # 方法2 生成摘要
    chain = (
        {"doc": lambda x: x.page_content}
        | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
        | ChatOpenAI(max_retries=0)
        | StrOutputParser()
    )
    summaries = chain.batch(docs, {"max_concurrency": 5})
    summary_docs = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]
    return summary_docs

def func3():
    # 方法3 生成假设性的问题
    functions = [
        {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
            "questions": {
                "type": "array",
                "items": {
                    "type": "string"
                },
            },
            },
            "required": ["questions"]
        }
        }
    ]
    chain = (
        {"doc": lambda x: x.page_content}
        # Only asking for 3 hypothetical questions, but this could be adjusted
        | ChatPromptTemplate.from_template("Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{doc}")
        | ChatOpenAI(max_retries=0, model="gpt-4").bind(functions=functions, function_call={"name": "hypothetical_questions"})
        | JsonKeyOutputFunctionsParser(key_name="questions")
    )
    hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})
    question_docs = []
    for i, question_list in enumerate(hypothetical_questions):
        question_docs.extend([Document(page_content=s,metadata={id_key: doc_ids[i]}) for s in question_list])
    return question_docs

def get_docs(func_num):
    if func_num==1:
        return func1()
    elif func_num==2:
        return func2()
    elif func_num==3:
        return func3()
    else:
        return []

# 检索过程的代码
loaders = [
    TextLoader('../../paul_graham_essay.txt'),
    TextLoader('../../state_of_the_union.txt'),
]
docs = []
for l in loaders:
    docs.extend(l.load())

text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]
# 根据不同方法类型，选择不同策略
func_num =1
candidate_docs = get_docs(func_num) 

retriever.vectorstore.add_documents(candidate_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# search
retriever.vectorstore.similarity_search("justice breyer")[0]

雨飞同行

雨飞
主业是推荐算法
希望通过自媒体开启自己不上班只工作的美好愿景
微信：1060687688
欢迎和我交朋友🫰

好了，我写完了，有启发的欢迎点赞评论🫰。新的一天，愿阳光洒在你的脸上。

从 LangChain 中学习检索增强

一、父文档检索 Parent Document Retriever

二、多向量检索 MultiVector Retriever

雨飞同行