从零开始解析RAG（三）：五级分块——从字符切分到语义感知的演进之路

视频： www.youtube.com/watch?v=8OJ...
代码：github.com/FullStackRe...
文本分割可视化平台：chunkviz.up.railway.app/
community.fullstackretrieval.com/introductio...

写在前面

大家好，这是我们 RAG 系列的第三篇文章，前两篇文章分别从 Query 优化的角度，介绍了通过翻译问题来改进检索质量的查询翻译，为问题选择合适数据源的路由选择，以及面向结构化数据，非结构化数据的查询构建。从优化整个 RAG 的角度来说，下一步，我们要优化索引。回顾一下，索引是如何构建的：

加载文档
分块
嵌入
保存到向量数据库这是一般的索引过程。文档加载自不必说，一个准备过程，从第 2 步分块开始，才真正影响索引的结果。因此，本篇文章将聚焦于构建索引的第一阶段------文本分块，从易到难介绍 5 个级别的分块。

分块理论

首先，思考一个问题，在构建 LLM 应用时，为什么需要分块？主要有两个原因：

LLM 上下文窗口有限，无法将所有的文本都放入。
过多的信息就是一种噪声，它会伤害 LLM 的性能。这就引出了另一个问题，"如何以最佳方式传递 LLM 需要的数据？" 这个问题放在 RAG 的流程中就变成"数据成为一种怎样的形式，才能帮助检索呢"？答案就是 "分块"。

文本分块：就是将数据拆分为更小的块的过程， 也称为 chunking 。文本分块的目的就是在下游任务中帮助 LLM。

从 RAG 的基本流程来看，检索可以看做是为了 LLM 收集正确信息的行为。源数据在整个流程图的底部，最终要进入知识库。

核心概念

文本拆分的基础在于：将文本切分为一个特定大小的块，每个相邻块之间可以设定重叠。主要包括两个核心概念：

chunk_size 代表每个文本块的大小，使用 token 或字符大小来衡量。
chunk_overlap 代表连续的块共享的文本比例，可以保持块边界之间的上下文。

Langchain 中的文本分割器具有两种主要方法：

create_documents()
split_documents（）这两种方法遵循相同的底层逻辑，但是暴露不同的接口，一个接受一个文本字符串列表，另一个接受一个预定义的 documents 对象。、

Document 对象是 LangChain 所顶一个的文本数据的基本载体，通常包含两大部分：

内容（page_content）：存储文本主体（如段落、文章）。

元数据（metadata）：附加信息（如来源、作者、时间），用于上下文增强和检索优化。

级别 1：字符拆分器 CharacterTextSplitter

原理解析

字符拆分是一种最基本的拆分，顾名思义就是按照字符的个数对静态字符串进行拆分。

块大小 chunk size：拆分后的文本块所包含的字符个数。
块重叠 chunk overlap ：块与块之间重叠的字符个数，主要是为了避免将一段上下文拆分到不同的块中去。上图是一个按照字符拆分器拆分文本的实例，可以看到将文本分为了 3 个块，每个块都是由 35 个字符构成，相邻块之间有 4 个字符重叠。

字符拆分的特点是简单，易懂，但是太过死板，完全不考虑文本原本的结构。上图中 Chunk #1 以字母 t 结束，完全不考虑单词尚未结束。

代码实现

Langchain 中有多种可以直接使用的分割器，字符分割对应 CharacterTextSplitter, 定义分割器只需要传入 chunk_size 与 chunk_overlap 即可，当然也可以配置其他项。

python 复制代码

# 要分割的文本
text = "This is the text I would like to chunk up. It is the example text for this exercise"

from langchain.text_splitter import CharacterTextSplitter

# 定义分割器，指定要用于分割的字符，以及chunk_size 与 chunk_overlap。
# strip_whitespace指定分割后的块是否要去除两端的空白，指定为False 之后，保留chunk 2 末尾的空白
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=False)

# 切分文本
text_splitter.create_documents([text])
"""
[Document(page_content='This is the text I would like to ch'), Document(page_content='unk up. It is the example text for '), Document(page_content='this exercise')]
"""

使用 CharacterTextSplitter 做拆分，可以看到 完整单词被分到两个不同的块中。

Seperator 参数

seperator 指定分隔符，默认为空字符串，也就是在任意位置都可以分割，只要满足 chunk_size 即可，该参数也可以手动指定。

python 复制代码

# 指定分隔符为'ch'
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='ch')

text_splitter.create_documents([text])

"""
[Document(page_content='This is the text I would like to'), Document(page_content='unk up. It is the example text for this exercise')]
"""

指定 ch 作为分隔符之后，同样的文本被分为了 2 个 chunk，且 分隔符不会被包含在 chunks 中。

python 复制代码

documents = text_splitter.create_documents([text])
len(documents[0].page_content)  # 32

这里可以看到第一个 chunk 的长度是 32，小于指定的 35。实际上，文本分割的过程是一个从粗到细的过程

按照指定的分割规则（按照段落，行，或者其他规则）将文本切分为语义完整的较大块
如果 chunk 的大小超过预设的 chunk_size，则改用更细分隔符进一步切割。这个过程可能会进行多次。因此，未必每个切割后的 chunk 其长度与设定的 chunk_size 一致。

局限性

CharacterTextSplitter 的设计主要依赖两个机制：

单一分隔符：只能指定一个分隔符，（如 \n\n 或 \n），若文本中不存在该分隔符，则退化为固定长度切割。
固定块大小：会按照固定的长度来分割文本，缺乏对文本结构的感知，可能导致语义不连贯。

这样的设计机制导致了 CharacterTextSplitter 的局限性，可能会将一个单词切分到两个块里。

级别 2：递归字符文本分块 RecursiveCharacterTextSplitter

原理解析

简单的文本切分会带来一个问题，完全不考虑文档的固有结构，可能会将一个单词切分到两个块里。

为了解决这些问题，递归分割以简单字符分割为基础，引入了一些核心概念：

多分隔符 ，默认支持以下分隔符，优先级从上到下。
- \n\n ：双换行符，通常用于切分段落
- \n：单换行符，用于分割行（如列表项、代码块中的行）。
- " " ：空格，按照单词分割。
- ""：空字符串，按照字符切分。
递归切分，按优先级顺序尝试多分隔符，从大粒度到小粒度逐步细化分割。

RecursiveCharacterTextSplitter 的整个工作流程包括：

初始化 ：
指定目标块大小（chunk_size）、块间重叠量（chunk_overlap）以及分隔符列表（separators）。
第一层分割 ：
使用最高优先级的分隔符（如 \n\n）将文本分割为段落。
块大小检查 ：
对每个分割后的块进行检查，如果块大小 > chunk_size，则进入下一层，用次优先级的分隔符（如 \n）继续分割。
递归循环 ：
重复上述过程，直到所有块的大小 ≤ chunk_size 或分隔符用尽。
强制截断 ：
对无法进一步分割的块，按字符截断至目标大小。

代码实现

假设现在要切分的文本如下：

python 复制代码

text = """

One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

  

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

  

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]

"""

在这里定义一个文本分割器，指定块的大小和重叠。

python 复制代码

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)

text_splitter.create_documents([text])

"""
[Document(page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."), Document(page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'), Document(page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]")]
"""

通过打印切分的文本可以看到, 虽然每个段落长度不一， chunk_size=450 可以完美将文本切分为 3 个 chunk。

将 chuns_size=469, 进一步增加 chunk_size 并没有改变切分的结果，这就是 RecursiveCharacterTextSplitter 的递归分割。

在实际应用时，即使是切分纯文本， RecursiveCharacterTextSplitter 也应该是第一选择。

特定文档分割

在前文中，我们介绍了字符文本分块和递归字符文本分块，这两种方法适用于普通文本文档。然而，面对 Markdown、PDF 或代码等特定格式的文档，我们需要更精细的分割策略。

Markdown 文档分割

Markdown 是一种广泛使用的轻量级标记语言，其文档结构复杂，包含标题、列表、代码块等多种元素。为了有效分割 Markdown 文档，我们需要充分利用其层级结构。例如，文档中的标题天然定义了语义单元，其下的内容通常属于同一主题，因此标题是最重要的分割点。

MarkdownTextSplitter 通过定义多种分隔符来拆分固有结构，包括：

\n#{1,6}：标题（H 1-H 6），优先级最高
\n\n：段落，用于分割语义单元
```````\n````：代码块，用于分割技术内容
\n\\*\\*\\*+\n 和 \n---+\n：水平线，用于分割章节
\n：单换行符，用于分割行
" " ：空格，用于分割单词
"" - ：空字符串，用于分割字符

Markdown 文本分割器的流程如下：

初始化分割器，设置 chunk_size 和分隔符优先级。
按照优先级递归分割文本：
- 如果块大于 chunk_size，使用下一个优先级的分隔符继续分割。
- 如果块小于 chunk_size，保留该块。
重复上述过程，直到所有块都满足 chunk_size 或所有分隔符已使用。
合并小块，返回最终分割后的块列表。

这种递归分割方式能够有效处理嵌套结构，确保每个块在语义上保持连贯，同时满足 chunk_size 的要求。然而，对于嵌套列表或复杂表格等特殊结构，仍需进一步优化。接下来，我们将探讨 PDF 文档的分割策略，这种格式更加复杂，需要结合文本、图片和表格等多种元素进行处理。

代码实现

以下代码实现一个针对 markdown 文档的分割器

python 复制代码

from langchain.text_splitter import MarkdownTextSplitter

splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)

对文档进行分割

python 复制代码

markdown_text = """
# Fun in California
## Driving 
Try driving on the 1 down to San Diego
### Food
Make sure to eat a burrito while you're there
## Hiking
Go to Yosemite
"""

splitter.create_documents([markdown_text])
"""
[Document(page_content='# Fun in California\n## Driving'), Document(page_content='Try driving on the 1 down to San Diego'), Document(page_content='### Food'), 
Document(page_content="Make sure to eat a burrito while you're"), Document(page_content='there'), 
Document(page_content='## Hiking\nGo to Yosemite')]
"""

Python 文档分割

Python 代码具有独特的语法结构，如类、函数和缩进块，传统的文本分割器无法有效处理这些结构。因此，LangChain 提供了 PythonCodeTextSplitter，专门用于分割 Python 代码。

PythonCodeTextSplitter 的核心思想是优先分割类，然后是函数和缩进函数，最后是代码行和变量。 这种分层分割方式能够确保每个 chunk 在语义上保持完整。其支持的分隔符包括：

\nclass：分割类
\ndef：分割函数
\n\tdef：分割缩进函数
\n\n：分割段落
\n：分割行
" "：分割单词
""：分割字符

代码实现

python 复制代码

from langchain.text_splitter import PythonCodeTextSplitter

python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age
    
p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)

python_splitter.create_documents([python_text])
"""
[Document(page_content='class Person:\n def __init__(self, name, age):\n self.name = name\n self.age = age'), 
Document(page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n print (i)')]
"""

在分割结果中，第一个 chunk 包含了完整的 Person 类及其初始化方法，这是一个语义完整的单元。第二个 chunk 包含了剩余的代码，包括对象实例化和循环语句。

JS 文档分割

JavaScript 文档分割与 python 文档分割类似，只不过支持不同的分隔符。

\nfunction：分割函数声明
\nconst：分割常量声明
\nlet：分割块作用域变量声明
\nvar：分割变量声明
\nclass：分割类定义
\nif：分割 if 语句
\nfor：分割 for 循环
\nwhile：分割 while 循环
\nswitch：分割 switch 语句
\ncase：分割 case 语句
\ndefault：分割 default 语句
\n\n：分割段落
\n：分割行
" "：分割单词
""：分割字符

以上一部分分隔符实际上对应的就是 JS 代码中常见的代码块。

python 复制代码

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

javascript_text = """
// Function is called, the return value will end up in x
let x = myFunction(4, 3);

function myFunction(a, b) {
// Function returns the product of a and b
  return a * b;
}
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=65, chunk_overlap=0
)

js_splitter.create_documents([javascript_text])
"""
[Document(page_content='// Function is called, the return value will end up in x'), Document(page_content='let x = myFunction(4, 3);'), Document(page_content='function myFunction(a, b) {'), Document(page_content='// Function returns the product of a and b\n return a * b;\n}')]
"""

与之前代码不同的是，这里采用了一个更加通用的分割器定义方法，通过指定 language 属性来定义分割器。

PDF 文档分割

PDF 是一种常见的数据格式，经常包含复杂的信息，比如图片，表格等等。因此，我们需要采用针对性的分割策略，确保每个 chunk 在语义上保持完整。

处理表格

PDF 中的表格是重要的信息载体，需要进行特殊处理。首先，要从 PDF 中提取表格

Unstructured 是一个开源的 Python 库，专门用于从非结构化文档（如 PDF、Word、HTML、图像等）中提取和处理内容。

python 复制代码

import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

filename = "static/SalesforceFinancial.pdf"

# Extracts the elements from the PDF
elements = partition_pdf(
    filename=filename,
    # Unstructured Helpers
	  # 使用高分辨率提取策略，可以更准确地识别文档结构
    strategy="hi_res",
    # 自动识别PDF中的表格结构
    infer_table_structure=True,
    # 使用YOLOX模型进行元素的识别
    model_name="yolox"
)
"""
[<unstructured.documents.elements.NarrativeText at 0x2bdecbfd0>, <unstructured.documents.elements.NarrativeText at 0x2ad356cd0>, <unstructured.documents.elements.NarrativeText at 0x2ba4e0cd0>, <unstructured.documents.elements.NarrativeText at 0x2af4148d0>, <unstructured.documents.elements.NarrativeText at 0x2ba846c90>, <unstructured.documents.elements.NarrativeText at 0x2ba8450d0>, <unstructured.documents.elements.NarrativeText at 0x2ba844990>, <unstructured.documents.elements.NarrativeText at 0x2ba944490>, <unstructured.documents.elements.Table at 0x2ba947b50>, <unstructured.documents.elements.NarrativeText at 0x2ad3d4f50>, <unstructured.documents.elements.Text at 0x2ba969f50>, <unstructured.documents.elements.Text at 0x177a411d0>]
"""

从 PDF 中解析出来一些文本内容以及表格元素，这里重点关注解析出来的 table

python 复制代码

elements[-4].metadata.text_as_html

其 HTML 如下图所示

在解析之后，要将表格以 HTML 的形式传入，虽然表格对于人来说阅读友好，但是对于 LLM 情况恰好相反，它更偏爱 markdown 文档，HTML 这些在预训练语料中大量存在的格式。

处理图片

图片是另一种 PDF 中复杂的元素，针对图片的处理要更加复杂。

回顾 RAG 基本流程中，使用同种嵌入模型对 query 和文档 chunk 做嵌入，之后要进行语义搜索。但是通常图片和文本的嵌入模型不同，这也意味着图像嵌入向量与文本嵌入向量长度不一致，无法进行语义搜索。

那如何将图片处理为可以进行语义搜索的格式，可以使用如下方法：

使用多模态模型对图像生成摘要，对摘要进行嵌入
获取图像的嵌入（比如使用 CLIP 模型对文本和图像同时做嵌入）
直接用图像本身

以下代码使用多模态模型对文本进行摘要，并将其嵌入到语义搜索流程中去。首先，同样加载 PDF，并解析元素。以下代码提取 PDF 中的表格，图片等。

python 复制代码

#!pip3 install "unstructured[all-docs]"

from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Get elements
raw_pdf_elements = partition_pdf(
	# 指定要解析的 PDF 文件路径。
    filename=filepath,
    
    # 提取 PDF 中的嵌入图像功能。解析后的图像会被保存到指定的目录中。
    extract_images_in_pdf=True,

    # - 使用布局模型（如 YOLOX）来识别 PDF 中的表格，并提取其结构和内容。同时，模型还会尝试识别文档中的标题（子章节），这些标题可以作为后续分块的依据。
    infer_table_structure=True,
    
    # 按标题分块，PDF 中的每个子章节（标题）会被视为一个独立的块
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    # 每个块的最大字符数限制为 4000
    max_characters=4000,
    # 当某个块的字符数接近 3800 时，尝试创建一个新的块。这有助于避免单个块过大。
    new_after_n_chars=3800,
    # 如果某些小块的字符数少于 2000，则尝试将它们合并到相邻的块中。这可以减少过小的块数量，保持块的大小合理。
    combine_text_under_n_chars=2000,
    # 指定提取的图像保存的目录路径。解析后的图像会被保存到 `static/pdfImages/` 目录中。
    image_output_dir_path="static/pdfImages/",
)

接下来，对这张图片生成摘要。

python 复制代码

from langchain.chat_models import ChatOpenAI
from langchain.schema.messages import HumanMessage
import os
from dotenv import load_dotenv
from PIL import Image
import base64
import io

load_dotenv()

llm = ChatOpenAI(model="gpt-4-vision-preview")

# Function to convert image to base64
def image_to_base64(image_path):
    with Image.open(image_path) as image:
        buffered = io.BytesIO()
        image.save(buffered, format=image.format)
        img_str = base64.b64encode(buffered.getvalue())
        return img_str.decode('utf-8')

image_str = image_to_base64("static/pdfImages/figure-15-6.jpg")

chat = ChatOpenAI(model="gpt-4-vision-preview",
                  max_tokens=1024)

msg = chat.invoke(
    [
        HumanMessage(
            content=[
                {"type": "text", "text" : "Please give a summary of the image provided. Be descriptive"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_str}"
                    },
                },
            ]
        )
    ]
)

msg.content
# 'The image shows a baking tray with pieces of fried chicken arranged to roughly mimic the continents on Earth as seen from space. The largest piece in the center is intended to represent Africa and Eurasia, while smaller pieces are meant to symbolize the Americas, Australia, and possibly Antarctica. There is text above the image which says, "Sometimes I just look at pictures of the earth from space and I marvel at how beautiful it all is." This text is likely meant to be humorous, as it juxtaposes the grandeur of Earth from space with a whimsical arrangement of chicken on a baking sheet, suggesting a playful comparison between the two.'

以上代码实现了：

将图片编码为 base 64
将图片以 base 64 的格式传递给模型（这里使用的是 gpt-4-vision-preview），生成摘要。

在进行检索时，直接使用摘要代替图片本身。其余流程不变。

针对第三级别的 chunk, 重点是 数据类型决定分块策略。 以上已经展示了在面对不同编程语言的代码，PDF 等不同数据格式时可以使用的策略。归根到底，分块的最终目标是将相似的项放在一起，才能为 LM 生成以便于回答问题的上下文。

级别 4：语义分块 Sematic Chunking

原理解析

前 3 级的分块，都只考虑了物理位置。我们通过假设一个段落的内容具有相似的语义，来制定策略、但如果不是这样的呢？文档本身存在的信息混乱，执行递归字符文本划分就没有意义。

语义分块的核心思想是基于文本的语义相似性进行分割。其实现方法如下：

对文档的每个句子生成 embedding（向量表示），捕捉文本的语义信息。
比较这些 embedding，根据语义距离进行分块。

以下是两种常用的 embedding 比较方法：

带有位置奖励的分层聚类（Heirarchical clustering with postional reward），其核心思想就是对这些 embedding 进行聚类，聚集在一类中的 embedding，就是要分的块。这里需要考虑那些跟在长句后的短句，应该被包含在长句中，因此加入了 position reward 位置奖励。
在连续句子之间找断点 ，这种方法通过比较连续句子的 embedding 距离来寻找断点。
- 比较句子 1 和句子 2 的 embedding 距离。
- 比较句子 2 和句子 3 的 embedding 距离。
- 当两个句子的 embedding 距离较大时，即为 chunk 的切分点。
- 在实际操作中，可以设置一个滑动窗口（如窗口大小为 3）来优化分割结果。例如：
  - 将句子 1、2、3 分为一组，与句子 4、5、6 进行比较。
  - 下一次将句子 2、3、4 分为一组，与句子 5、6、7 进行比较。

代码实现（在连续句子间找断点）

加载文档，并将文档根据标点切分为单句。

python 复制代码

with open('../../data/PGEssays/mit.txt') as file:
    essay = file.read()

import re

# Splitting the essay on '.', '?', and '!'
single_sentences_list = re.split(r'(?<=[.?!])\s+', essay)
print (f"{len(single_sentences_list)} senteneces were found")
# 317 senteneces were found

这篇文章由 317 个句子组成，之后对句子做一些转换，每个句子都是一个 dict，包括 sentence 属性，代表句子的内容，index 属性，代表索引。

python 复制代码

sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(single_sentences_list)]
sentences[:3]
"""
[{'sentence': '\n\nWant to start a startup?', 'index': 0},
 {'sentence': 'Get funded by\nY Combinator.', 'index': 1},
 {'sentence': 'October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.',
  'index': 2}]
"""

将句子进行分组。在实际实现的时候，进行比较的不是单个句子的 embedding，而是设置一个窗口，将这个窗口内的句子分为一个组，比较相邻组嵌入之间的距离。在这里实现一个将多个句子合并的 combine_sentences 函数，该函数接受句子 dict 组成的列表，增加一个名为 combined_sentence 的属性，用于保存一个窗口内的所有句子，这也是之后用来做 embedding 的句子。

python 复制代码

def combine_sentences(sentences, buffer_size=1):
    # Go through each sentence dict
    for i in range(len(sentences)):

        # Create a string that will hold the sentences which are joined
        combined_sentence = ''

        # Add sentences before the current one, based on the buffer size.
        for j in range(i - buffer_size, i):
            # Check if the index j is not negative (to avoid index out of range like on the first one)
            if j >= 0:
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += sentences[j]['sentence'] + ' '

        # Add the current sentence
        combined_sentence += sentences[i]['sentence']

        # Add sentences after the current one, based on the buffer size
        for j in range(i + 1, i + 1 + buffer_size):
            # Check if the index j is within the range of the sentences list
            if j < len(sentences):
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += ' ' + sentences[j]['sentence']

        # Then add the whole thing to your dict
        # Store the combined sentence in the current sentence dict
        sentences[i]['combined_sentence'] = combined_sentence

    return sentences

sentences = combine_sentences(sentences)
sentences[:3]
"""
[{'sentence': '\n\nWant to start a startup?', 'index': 0, 'combined_sentence': '\n\nWant to start a startup? Get funded by\nY Combinator.'}, {'sentence': 'Get funded by\nY Combinator.', 'index': 1, 'combined_sentence': '\n\nWant to start a startup? Get funded by\nY Combinator. October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.'}, {'sentence': 'October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.', 'index': 2, 'combined_sentence': 'Get funded by\nY Combinator. October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school. I think there will increasingly be a third option:\nto start your own startup.'}]
"""

得到每组句子的嵌入首先引入 OpenAI 的 embedding 模型，从句子得到嵌入，将嵌入保存在 dict 的 combined_sentence_embedding 属性中。

python 复制代码

from langchain.embeddings import OpenAIEmbeddings
oaiembeds = OpenAIEmbeddings()

embeddings = oaiembeds.embed_documents([x['combined_sentence'] for x in sentences])

for i, sentence in enumerate(sentences):
    sentence['combined_sentence_embedding'] = embeddings[i]

计算不同嵌入之间的距离这里，我们使用嵌入之间的 cosine 距离作为相似度，1-cosing距离 作为句子之间的距离。

python 复制代码

from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_distances(sentences):
    distances = []
    for i in range(len(sentences) - 1):
        embedding_current = sentences[i]['combined_sentence_embedding']
        embedding_next = sentences[i + 1]['combined_sentence_embedding']
        
        # Calculate cosine similarity
        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
        
        # Convert to cosine distance
        distance = 1 - similarity

        # Append cosine distance to the list
        distances.append(distance)

        # Store distance in the dictionary
        sentences[i]['distance_to_next'] = distance

    # Optionally handle the last sentence
    # sentences[-1]['distance_to_next'] = None  # or a default value

    return distances, sentences
    
distances, sentences = calculate_cosine_distances(sentences)

distances[:3]
# [0.08081114249044896, 0.02726339916925502, 0.04722227403602797]

确定断点不同 Embedding 之间的距离用数值来看没那么直观，我们将其用图来表示。仔细观察，图中的异常值很有可能就是断点的位置。有许多方法可以用于确定文章的断点，这里选择句子之间距离的 95% 作为断点的位置。最终切分之后的情况如下图所示，其中红线代表距离的阈值，超过这个阈值的点被记为断点，其中每种颜色代表一个 chunk，这篇文章总共被切分为 17 个 chunk。

python 复制代码

# 设定阈值为距离的95% 
breakpoint_percentile_threshold = 95
# 计算阈值
breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold) # If you want more chunks, lower the percentile cutoff

# 根据阈值，计算每个断点的位置，这里使用每个chunk 的结束索引表示
indices_above_thresh = [i for i, x in enumerate(distances) if x > breakpoint_distance_threshold] # The indices of those breakpoints on your list


# Initialize the start index
start_index = 0

# Create a list to hold the grouped sentences
chunks = []

# Iterate through the breakpoints to slice the sentences
for index in indices_above_thresh:
    # The end index is the current breakpoint
    end_index = index

    # Slice the sentence_dicts from the current start index to the end index
    group = sentences[start_index:end_index + 1]
    combined_text = ' '.join([d['sentence'] for d in group])
    chunks.append(combined_text)
    
    # Update the start index for the next group
    start_index = index + 1

# The last group, if any sentences remain
if start_index < len(sentences):
    combined_text = ' '.join([d['sentence'] for d in sentences[start_index:]])
    chunks.append(combined_text)

# grouped_sentences now contains the chunked sentences

需要强调的是，这现在还是一种实验性方法，需要在实际环境中进行评估后使用。与前几种方法不同的是，此方法不需要指定 chunk_size 与 chunk_strip，但是仍然可以根据需要来递归分割较大的块。

级别 5：代理切分 Agent Chunking

最后一种方法类似于实现一个 Agent ，它的职责就是用于分块

人类如何分块

首先思考，当面对一篇长文档时，人类如何将其分解成更小、更易理解的部分？

准备工具 ：
通常会使用草稿纸或记事本作为工具。
从文档顶部开始 ：
从文档的顶部开始，假设第一个部分是一个独立的块（因为此时尚未创建任何块）。
逐句评估 ：
逐句或逐段向下阅读文档，评估当前内容是否应归属于当前的块。如果语义或主题发生变化，则创建一个新的块。
重复过程 ：
重复上述步骤，直到文档的末尾，最终将文档分解成多个语义相关的块。

原理解析

如何用实现这个过程呢？可以做一个 Agent 来实现，让 LLM 做出决策，评估当前的内容是否属于当前块。

这里就涉及到一个另一个问题，如何将文档的不同部分提供给 LM，这篇论文提出了一个"命题"的概念，从句子中提取出若干命题，将命题作为上下文传递给模型。

命题被定义为文本中的原子表达式（不可再分解），每个命题封装了一个独立的事实，并以简洁、自包含的自然语言格式呈现。

举例来说，从句子"Greg went to the park. He likes walking"中，可以提取出两个命题， ['Greg went to the park.', 'Greg likes walking']。如果直接按照标点切分，将 He likes walking 输入到模型中，无法知道 He 代指的到底是谁。

代码实现

从文档中得到命题，将文章加载之后，按照段落进行初步划分，调用 LLM 从每个段落中提取若干命题，并将其解析为字符串列表。

python 复制代码

from langchain.output_parsers.openai_tools import JsonOutputToolsParser
from langchain_community.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain.chains import create_extraction_chain
from typing import Optional, List
from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
from langchain import hub

obj = hub.pull("wfh/proposal-indexing")
llm = ChatOpenAI(model='gpt-4-1106-preview', openai_api_key = os.getenv("OPENAI_API_KEY", 'YouKey'))

# use it in a runnable
runnable = obj | llm

# Pydantic data class
class Sentences(BaseModel):
    sentences: List[str]
    
# Extraction
extraction_chain = create_extraction_chain_pydantic(pydantic_schema=Sentences, llm=llm)

def get_propositions(text):
    runnable_output = runnable.invoke({
    	"input": text
    }).content
    
    propositions = extraction_chain.run(runnable_output)[0].sentences
    return propositions
    
    
with open('../../data/PGEssays/superlinear.txt') as file:
    essay = file.read()
    
paragraphs = essay.split("\n\n")
len(paragraphs)

# 这里只针对5个段落生成命题，作为范例
essay_propositions = []

for i, para in enumerate(paragraphs[:5]):
    propositions = get_propositions(para)
    
    essay_propositions.extend(propositions)
    print (f"Done with {i}")

调用 LLM 判断当前命题是否输出当前 chunk，这主要是通过 AgenticChunker 实现的，详细代码在这里，我会在后面用途来展示这段代码的主要原理。

python 复制代码

   from agentic_chunker import AgenticChunker
   ac = AgenticChunker()
   
   ac.add_propositions(essay_propositions)
   """
Adding: 'The month is October.'
No chunks, creating a new one
Created new chunk (fc52f): Date & Times

Adding: 'The year is 2023.'
Chunk Found (fc52f), adding to: Date & Times

Adding: 'I did not understand the degree to which the returns for performance are superlinear when I was a child.'
No chunks found
Created new chunk (a4a7e): Effort-Reward Relationship

Adding: 'The returns for performance are superlinear.'
Chunk Found (a4a7e), adding to: Effort-Reward Relationship

Adding: 'Understanding the degree to which the returns for performance are superlinear is one of the most important things.'
Chunk Found (a4a7e), adding to: Superlinear Returns in Performance

Adding: 'Teachers and coaches implicitly told us the returns were linear.'
No chunks found
Created new chunk (38e4a): Education & Coaching Returns

Adding: 'Teachers and coaches meant well.'
No chunks found
Created new chunk (0402d): Educational Approaches

Adding: 'The phrase 'You get out what you put in' was heard a thousand times.'
Chunk Found (38e4a), adding to: Education & Coaching Returns
...
Chunk Found (a4a7e), adding to: Superlinear Returns & Societal Impact

Adding: 'Understanding the concept of superlinear returns will be the wave that ambitious individuals surf on.'
Chunk Found (a4a7e), adding to: Superlinear Returns in Various Domains
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
   """

从打印的过程中可以看出，整个过程是：

接收到一个命题'The month is October.'，当前没有任何 chunk，因此创建一个新的 chunk fc52f, 在创建新的 chunk 时，会根据当前 chunk 所有包含的命题做摘要，起标题（通过 LLM 完成）。
接下来，逐个加入命题，处理'The year is 2023.'这个命题时，可以找到现存的块与其主题相关，将它划分到 fc52f 中。
处理下一个命题，当前块的主题与命题不符，创建新的块保存，直到所有的命题处理完毕。

解析 `AgentChunker`

AgenticChunker 代码内部主要是用于是保存已经分好的块和对应的命题，其基本结构如图：每个块都包含唯一的 ID，当前块包含的所有命题，以及标题，摘要，以及位置索引。

每次当有一个新的命题，AgentChunker 要针对该命题做处理：

首先，如果当前 chunks 为空的话，创建一个新的块，将命题保存到该块中。
如果不为空，寻找与当前命题相关的块，找到则加入该块，找不到则创建新块。
每次加入命题，确定所属块之后，都要根据已经掌握的信息（命题列表，当前摘要，当前标题）更新摘要或标题。

总结

本文探讨构建索引中分块的优化，主要介绍 5 个级别的分块

字符文本分块，按照给定的单一分隔符切分，如果未指定，则严格按照字符个数切分，其缺点是完全不顾及文本原有的结构。
递归字符文本分块，给定多种分隔符，按照优先级切分，尽可能通过递归切分来让分好的块贴近 chunk_size。
特定文档文本分块，基于递归切分的方式，为不用的数据格式提供不同的切分方式。这里的重点是针对 PDF 的切分，PDF 可以包括文本，图片，表格等多种数据，按照需求选择 chunk 方式。
语义切分，前三个级别的切分都是针对物理结构，从分隔符，标题等等文档可能常见的结构进行切分。语义切分则是基于句子本身语义是否相似进行切分。（这里使用嵌入，并度量嵌入之间的距离来实现）。效果更好，但是更贵更慢。
Agent 切分，用 LLM 来实现人类分块的模式，将文档提取为若干命题，每个命题都是不可再分的原子式，针对每个命题，让 LLM 来决定属于哪个 chunk。

写在最后

这一篇内容比我想象的要多，主要聚焦于在 chunk 阶段优化，下一篇将完成其余索引阶段的优化，敬请期待。

从零开始解析RAG（三）：五级分块——从字符切分到语义感知的演进之路

写在前面

分块理论

核心概念

级别 1：字符拆分器 CharacterTextSplitter

原理解析

代码实现

Seperator 参数

局限性

级别 2：递归字符文本分块 RecursiveCharacterTextSplitter

原理解析

代码实现

特定文档分割

Markdown 文档分割

代码实现

Python 文档分割

代码实现

JS 文档分割

PDF 文档分割

处理表格

处理图片

级别 4：语义分块 Sematic Chunking

原理解析

代码实现（在连续句子间找断点）

级别 5： 代理切分 Agent Chunking

人类如何分块

原理解析

代码实现

解析 AgentChunker

总结

写在最后

级别 5：代理切分 Agent Chunking

解析 `AgentChunker`