一、引言:为什么需要文本分块?
在自然语言处理和信息检索中,处理长文档时面临以下挑战:
- 上下文窗口限制:语言模型(如BERT、GPT)有固定的输入长度限制
- 计算效率:处理过长文本会导致计算复杂度指数级增长
- 信息精度:过长文本可能导致关键信息被稀释
- 存储优化:向量数据库通常对嵌入维度有要求
分块的核心目标:在保持语义连贯性的前提下,将长文本分解为可管理的片段。
二、固定大小分块(Fixed-size Chunking)
基本概念
按照预定义的固定字符数、单词数或标记数分割文本。
实现方式
python
# 字符级固定分块示例
def deffixed_size_chunking(text, chunk_size=500, overlap=50):
chunks =[]
start = 0
while start <len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap # 设置重叠部分
return chunks
if __name__ == "__main__":
sample_text = "这是一个用于测试固定大小分块的示例文本。" * 100 # 示例文本
chunks = deffixed_size_chunking(sample_text, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
参数配置
- chunk_size:典型值:256、512、1024个标记
- overlap:重叠比例(通常10-20%),防止边界信息丢失
优缺点分析
优点:
- 实现简单,计算高效
- 易于批处理和并行化
- 适合结构简单的文本
缺点:
- 可能切断完整句子或段落
- 忽略语义边界
- 对结构复杂文档不友好
适用场景
- 预处理结构简单的文档
- 需要快速原型开发
- 文档结构不重要的情况
三、滑动窗口分块(Sliding Window Chunking)
基本概念
使用固定大小的窗口在文本上滑动,每次移动固定步长。
实现方式
python
def sliding_window_chunking(text, window_size=400, step=200):
tokens = text.split() # 或使用更复杂的标记化
chunks =[]
for i in range(0,len(tokens), step):
chunk =" ".join(tokens[i:i+window_size])
chunks.append(chunk)
return chunks
if __name__ == "__main__":
sample_text = "This is a sample text to demonstrate the sliding window chunking method. " * 50
chunks = defsliding_window_chunking(sample_text, window_size=20, step=10)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
窗口与步长策略
- 窗口大小:决定每个块的信息量
- 步长:决定块之间的重叠程度
- 平衡策略:大窗口+大步长(低重叠) vs 小窗口+小步长(高重叠)
变体:加权滑动窗口
- 给窗口中心位置更高权重
- 边缘位置权重逐渐降低
- 适用于注意力机制模型
优缺点
优点:
- 确保上下文连续性
- 减少边界信息损失
- 适合序列建模任务
缺点:
- 计算开销较大
- 可能产生大量重叠内容
- 存储效率不高
适用场景
- 需要保持局部上下文的任务
- 序列标注和命名实体识别
- 时间序列数据分析
四、基于语义的分块(Semantic Chunking)
核心思想
根据文本的语义边界而非固定长度进行分割。
实现策略
1. 句子边界检测
python
import spacy
nlp = spacy.load("en_core_web_sm")
def semantic_chunk_by_sentence(text, max_chunk_length=500):
doc = nlp(text)
chunks =[]
current_chunk =[]
current_length =0
for sent in doc.sents:
sent_length =len(sent.text)
if current_length + sent_length > max_chunk_length and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk =[sent.text]
current_length = sent_length
else:
current_chunk.append(sent.text)
current_length += sent_length
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
if __name__ == "__main__":
sample_text = ("Natural language processing (NLP) is a subfield of artificial intelligence "
"concerned with the interactions between computers and human (natural) languages. "
"It focuses on enabling computers to understand, interpret, and generate human language "
"in a way that is valuable. Applications of NLP include machine translation, sentiment analysis, "
"and chatbots.")
chunks = semantic_chunk_by_sentence(sample_text, max_chunk_length=100)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
运行结果:
bash
Chunk 1:
Natural language processing (NLP) is a subfield of artificial intelligence concerned with the interactions between computers and human (natural) languages.
Chunk 2:
It focuses on enabling computers to understand, interpret, and generate human language in a way that is valuable.
Chunk 3:
Applications of NLP include machine translation, sentiment analysis, and chatbots.
2. 语义相似度聚类
- 使用句子嵌入计算相似度
- 聚类算法(如k-means)分组
- 动态确定分块边界
3. 主题建模方法
- 使用LDA、BERTopic等识别主题变化
- 在主题转换处分割
安装必要的库:
bash
# 基础依赖
pip install numpy scikit-learn
# 句子分割
pip install spacy
python -m spacy download en_core_web_sm
pip install nltk
# 语义嵌入
pip install sentence-transformers
# 主题建模
pip install bertopic
pip install lda # 可选,用于LDA
# 可视化(可选)
pip install matplotlib seaborn
代码实现:
python
# 安装必要的库
# pip install spacy sentence-transformers scikit-learn bertopic umap-learn hdbscan nltk
# python -m spacy download en_core_web_sm
# python -m nltk.downloader punkt
import numpy as np
import re
from typing import List, Tuple, Dict, Any
import warnings
warnings.filterwarnings('ignore')
class TextPreprocessor:
"""文本预处理工具"""
@staticmethod
def clean_text(text: str) -> str:
"""清理文本"""
# 移除多余空格和换行
text = re.sub(r'\s+', ' ', text)
# 移除特殊字符(保留标点)
text = re.sub(r'[^\w\s.,!?;:\-\'\"]', '', text)
return text.strip()
@staticmethod
def split_into_sentences(text: str) -> List[str]:
"""将文本分割成句子(基础方法)"""
# 使用正则表达式分句
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if s.strip()]
@staticmethod
def calculate_optimal_chunks(text_length: int,
target_chunk_size: int = 500,
min_chunk_size: int = 100,
max_chunk_size: int = 1000) -> int:
"""计算最优分块数量"""
if text_length <= target_chunk_size:
return 1
chunks = text_length // target_chunk_size
remainder = text_length % target_chunk_size
if remainder > min_chunk_size:
chunks += 1
elif chunks > 1 and remainder < min_chunk_size:
# 重新分配
avg_size = text_length / (chunks - 1)
if avg_size <= max_chunk_size:
chunks -= 1
return max(1, chunks)
# 策略1:句子边界检测分块
class SentenceBoundaryChunker:
"""基于句子边界的分块策略"""
def __init__(self, model_type: str = 'spacy'):
"""
初始化分块器
Args:
model_type: 使用的模型类型 ('spacy', 'nltk', 'regex')
"""
self.model_type = model_type
self.nlp = None
if model_type == 'spacy':
try:
import spacy
self.nlp = spacy.load("en_core_web_sm")
print("✓ 使用spaCy模型")
except Exception as e:
print(f"spaCy加载失败: {e}")
print("降级到nltk...")
self.model_type = 'nltk'
if model_type == 'nltk':
try:
import nltk
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
self.sent_tokenize = sent_tokenize
print("✓ 使用nltk模型")
except Exception as e:
print(f"nltk加载失败: {e}")
print("使用正则表达式分句...")
self.model_type = 'regex'
def split_sentences(self, text: str) -> List[str]:
"""分割句子"""
text = TextPreprocessor.clean_text(text)
if self.model_type == 'spacy' and self.nlp:
doc = self.nlp(text)
return [sent.text.strip() for sent in doc.sents]
elif self.model_type == 'nltk':
return self.sent_tokenize(text)
else:
return TextPreprocessor.split_into_sentences(text)
def chunk_by_sentence(self,
text: str,
max_chunk_length: int = 500,
overlap_sentences: int = 1) -> List[str]:
"""
基于句子边界分块
Args:
text: 输入文本
max_chunk_length: 最大块长度(字符数)
overlap_sentences: 重叠的句子数
Returns:
分块列表
"""
sentences = self.split_sentences(text)
chunks = []
current_chunk = []
current_length = 0
for i, sent in enumerate(sentences):
sent_length = len(sent)
# 如果添加当前句子会超过长度限制,且当前块不为空
if current_length + sent_length > max_chunk_length and current_chunk:
# 保存当前块
chunks.append(" ".join(current_chunk))
# 创建新块,包含重叠的句子
if overlap_sentences > 0:
# 取前一个块的最后几个句子作为重叠
overlap = current_chunk[-overlap_sentences:] if len(current_chunk) > overlap_sentences else current_chunk
current_chunk = overlap + [sent]
current_length = sum(len(s) for s in overlap) + sent_length
else:
current_chunk = [sent]
current_length = sent_length
else:
current_chunk.append(sent)
current_length += sent_length
# 添加最后一个块
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def chunk_with_semantic_awareness(self,
text: str,
max_chunk_length: int = 500,
semantic_threshold: float = 0.3) -> List[str]:
"""
增强的语义感知分块(结合语义相似度)
Args:
text: 输入文本
max_chunk_length: 最大块长度
semantic_threshold: 语义相似度阈值
Returns:
分块列表
"""
sentences = self.split_sentences(text)
if len(sentences) <= 1:
return [text]
# 计算句子间相似度(简单版本)
from sentence_transformers import SentenceTransformer
try:
model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embeddings = model.encode(sentences)
# 计算余弦相似度
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(sentence_embeddings)
chunks = []
current_chunk = []
current_length = 0
for i, sent in enumerate(sentences):
sent_length = len(sent)
# 检查语义边界
is_semantic_boundary = False
if i > 0 and i < len(sentences) - 1:
# 如果与前一句的相似度低于阈值,可能是语义边界
prev_similarity = similarity_matrix[i, i-1]
next_similarity = similarity_matrix[i, i+1] if i+1 < len(sentences) else 1.0
if prev_similarity < semantic_threshold and next_similarity > semantic_threshold:
# 与后一句更相似,可能是新主题开始
is_semantic_boundary = True
# 检查长度边界
is_length_boundary = current_length + sent_length > max_chunk_length and current_chunk
if (is_semantic_boundary or is_length_boundary) and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = [sent]
current_length = sent_length
else:
current_chunk.append(sent)
current_length += sent_length
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
except ImportError:
print("sentence-transformers未安装,使用基础分句方法")
return self.chunk_by_sentence(text, max_chunk_length)
# 策略2:语义相似度聚类分块
class SemanticClusteringChunker:
"""基于语义相似度聚类的分块策略"""
def __init__(self, embedding_model: str = 'all-MiniLM-L6-v2'):
"""
初始化聚类分块器
Args:
embedding_model: 句子嵌入模型名称
"""
self.embedding_model_name = embedding_model
self.model = None
def load_embedding_model(self):
"""加载嵌入模型"""
if self.model is None:
try:
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer(self.embedding_model_name)
print(f"✓ 加载嵌入模型: {self.embedding_model_name}")
except ImportError:
raise ImportError("请安装sentence-transformers: pip install sentence-transformers")
def embed_sentences(self, sentences: List[str]) -> np.ndarray:
"""将句子转换为嵌入向量"""
self.load_embedding_model()
return self.model.encode(sentences)
def cluster_sentences(self,
embeddings: np.ndarray,
n_clusters: int = None,
method: str = 'kmeans') -> np.ndarray:
"""
对句子嵌入进行聚类
Args:
embeddings: 句子嵌入向量
n_clusters: 聚类数量(None表示自动确定)
method: 聚类方法 ('kmeans', 'hdbscan', 'agglomerative')
Returns:
每个句子的聚类标签
"""
if method == 'kmeans':
from sklearn.cluster import KMeans
if n_clusters is None:
# 使用肘部法则确定最佳聚类数
inertias = []
max_clusters = min(10, len(embeddings))
for k in range(1, max_clusters + 1):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(embeddings)
inertias.append(kmeans.inertia_)
# 简单的肘部法则实现
diffs = np.diff(inertias)
diff_ratios = diffs[1:] / diffs[:-1]
n_clusters = np.argmax(diff_ratios < 0.5) + 2 if len(diff_ratios) > 0 else 2
n_clusters = min(max_clusters, max(2, n_clusters))
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
return kmeans.fit_predict(embeddings)
elif method == 'hdbscan':
try:
import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=2,
metric='euclidean',
cluster_selection_epsilon=0.5)
return clusterer.fit_predict(embeddings)
except ImportError:
print("hdbscan未安装,使用kmeans")
return self.cluster_sentences(embeddings, n_clusters, 'kmeans')
elif method == 'agglomerative':
from sklearn.cluster import AgglomerativeClustering
if n_clusters is None:
n_clusters = min(5, len(embeddings) // 3)
clustering = AgglomerativeClustering(n_clusters=n_clusters)
return clustering.fit_predict(embeddings)
else:
raise ValueError(f"不支持的聚类方法: {method}")
def chunk_by_clustering(self,
text: str,
sentence_chunker: SentenceBoundaryChunker = None,
n_clusters: int = None,
clustering_method: str = 'kmeans',
merge_small_clusters: bool = True,
min_cluster_size: int = 2) -> List[str]:
"""
基于聚类结果分块
Args:
text: 输入文本
sentence_chunker: 句子分割器
n_clusters: 聚类数量
clustering_method: 聚类方法
merge_small_clusters: 是否合并小聚类
min_cluster_size: 最小聚类大小
Returns:
分块列表
"""
if sentence_chunker is None:
sentence_chunker = SentenceBoundaryChunker()
# 分割句子
sentences = sentence_chunker.split_sentences(text)
if len(sentences) <= 1:
return [text]
# 获取句子嵌入
embeddings = self.embed_sentences(sentences)
# 聚类
cluster_labels = self.cluster_sentences(embeddings, n_clusters, clustering_method)
# 合并小聚类(如果启用)
if merge_small_clusters:
cluster_labels = self._merge_small_clusters(cluster_labels, min_cluster_size)
# 根据聚类结果创建分块
chunks_dict = {}
for sent, label in zip(sentences, cluster_labels):
if label not in chunks_dict:
chunks_dict[label] = []
chunks_dict[label].append(sent)
# 按聚类顺序排序并合并句子
chunks = []
for label in sorted(chunks_dict.keys()):
chunk_text = " ".join(chunks_dict[label])
chunks.append(chunk_text)
return chunks
def _merge_small_clusters(self, labels: np.ndarray, min_size: int = 2) -> np.ndarray:
"""合并小聚类到最近的大聚类"""
from collections import Counter
label_counts = Counter(labels)
small_clusters = [label for label, count in label_counts.items()
if count < min_size and label != -1]
if not small_clusters:
return labels
# 计算聚类中心
unique_labels = np.unique(labels)
label_to_center = {}
for label in unique_labels:
if label == -1: # 噪声点
continue
mask = labels == label
if mask.any():
# 简单使用标签索引作为代理
label_to_center[label] = np.mean(np.where(mask)[0])
# 合并小聚类
new_labels = labels.copy()
for small_label in small_clusters:
if small_label not in label_to_center:
continue
# 找到最近的大聚类
small_center = label_to_center[small_label]
distances = []
for label, center in label_to_center.items():
if label != small_label and label_counts[label] >= min_size:
distances.append((abs(center - small_center), label))
if distances:
distances.sort()
nearest_label = distances[0][1]
new_labels[labels == small_label] = nearest_label
return new_labels
def hierarchical_chunking(self,
text: str,
sentence_chunker: SentenceBoundaryChunker = None,
max_chunk_size: int = 500,
min_chunk_size: int = 100) -> List[str]:
"""
层次化分块:先聚类,再按大小调整
Args:
text: 输入文本
sentence_chunker: 句子分割器
max_chunk_size: 最大块大小
min_chunk_size: 最小块大小
Returns:
分块列表
"""
if sentence_chunker is None:
sentence_chunker = SentenceBoundaryChunker()
# 第一步:基于聚类分块
sentences = sentence_chunker.split_sentences(text)
if len(sentences) <= 1:
return [text]
# 确定聚类数量
n_clusters = TextPreprocessor.calculate_optimal_chunks(
sum(len(s) for s in sentences),
target_chunk_size=(max_chunk_size + min_chunk_size) // 2
)
# 聚类分块
cluster_chunks = self.chunk_by_clustering(
text, sentence_chunker, n_clusters, 'kmeans'
)
# 第二步:调整块大小
final_chunks = []
for chunk in cluster_chunks:
if len(chunk) <= max_chunk_size:
final_chunks.append(chunk)
else:
# 对过大的块进行进一步分割
sub_sentences = sentence_chunker.split_sentences(chunk)
sub_chunk = []
sub_length = 0
for sent in sub_sentences:
sent_length = len(sent)
if sub_length + sent_length > max_chunk_size and sub_chunk:
final_chunks.append(" ".join(sub_chunk))
sub_chunk = [sent]
sub_length = sent_length
else:
sub_chunk.append(sent)
sub_length += sent_length
if sub_chunk:
final_chunks.append(" ".join(sub_chunk))
return final_chunks
# 策略3:主题建模分块
class TopicModelingChunker:
"""基于主题建模的分块策略"""
def __init__(self, model_type: str = 'bertopic'):
"""
初始化主题分块器
Args:
model_type: 主题模型类型 ('bertopic', 'lda')
"""
self.model_type = model_type
self.topic_model = None
def chunk_by_bertopic(self,
text: str,
sentence_chunker: SentenceBoundaryChunker = None,
nr_topics: str = 'auto',
min_topic_size: int = 2) -> List[Tuple[str, int]]:
"""
使用BERTopic进行主题分块
Args:
text: 输入文本
sentence_chunker: 句子分割器
nr_topics: 主题数量 ('auto'或整数)
min_topic_size: 最小主题大小
Returns:
(分块文本, 主题标签) 元组列表
"""
if sentence_chunker is None:
sentence_chunker = SentenceBoundaryChunker()
# 分割句子
sentences = sentence_chunker.split_sentences(text)
if len(sentences) <= 1:
return [(text, -1)]
try:
from bertopic import BERTopic
# 创建主题模型
self.topic_model = BERTopic(
nr_topics=nr_topics,
min_topic_size=min_topic_size,
language='english',
calculate_probabilities=True,
verbose=False
)
# 拟合模型
topics, probabilities = self.topic_model.fit_transform(sentences)
# 根据主题分组
topic_to_sentences = {}
for sent, topic in zip(sentences, topics):
if topic not in topic_to_sentences:
topic_to_sentences[topic] = []
topic_to_sentences[topic].append(sent)
# 创建分块
chunks = []
for topic in sorted(topic_to_sentences.keys()):
chunk_text = " ".join(topic_to_sentences[topic])
chunks.append((chunk_text, topic))
return chunks
except ImportError:
print("BERTopic未安装,使用LDA替代")
return self.chunk_by_lda(text, sentence_chunker)
def chunk_by_lda(self,
text: str,
sentence_chunker: SentenceBoundaryChunker = None,
n_topics: int = 5) -> List[Tuple[str, int]]:
"""
使用LDA进行主题分块
Args:
text: 输入文本
sentence_chunker: 句子分割器
n_topics: 主题数量
Returns:
(分块文本, 主题标签) 元组列表
"""
if sentence_chunker is None:
sentence_chunker = SentenceBoundaryChunker()
# 分割句子
sentences = sentence_chunker.split_sentences(text)
if len(sentences) <= 1:
return [(text, 0)]
try:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk
from nltk.corpus import stopwords
# 下载停用词(如果需要)
try:
nltk.data.find('corpora/stopwords')
except LookupError:
nltk.download('stopwords')
# 创建文档-词矩阵
stop_words = stopwords.words('english')
vectorizer = CountVectorizer(
max_df=0.95,
min_df=2,
stop_words=stop_words,
max_features=1000
)
doc_term_matrix = vectorizer.fit_transform(sentences)
# 训练LDA模型
lda = LatentDirichletAllocation(
n_components=min(n_topics, len(sentences)),
random_state=42,
max_iter=10,
learning_method='online'
)
lda.fit(doc_term_matrix)
# 获取每个句子的主题
topic_distributions = lda.transform(doc_term_matrix)
sentence_topics = topic_distributions.argmax(axis=1)
# 根据主题分组
topic_to_sentences = {}
for sent, topic in zip(sentences, sentence_topics):
if topic not in topic_to_sentences:
topic_to_sentences[topic] = []
topic_to_sentences[topic].append(sent)
# 创建分块
chunks = []
for topic in sorted(topic_to_sentences.keys()):
chunk_text = " ".join(topic_to_sentences[topic])
chunks.append((chunk_text, topic))
return chunks
except Exception as e:
print(f"LDA处理失败: {e}")
# 退回使用句子分块
chunks = sentence_chunker.chunk_by_sentence(text)
return [(chunk, 0) for chunk in chunks]
def dynamic_topic_chunking(self,
text: str,
max_chunk_size: int = 500,
topic_change_threshold: float = 0.3) -> List[str]:
"""
动态主题分块:在主题变化处分割
Args:
text: 输入文本
max_chunk_size: 最大块大小
topic_change_threshold: 主题变化阈值
Returns:
分块列表
"""
# 使用滑动窗口检测主题变化
sentence_chunker = SentenceBoundaryChunker()
sentences = sentence_chunker.split_sentences(text)
if len(sentences) <= 1:
return [text]
try:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
chunks = []
current_chunk = []
current_length = 0
for i, sent in enumerate(sentences):
sent_length = len(sent)
# 检查主题变化
is_topic_change = False
if i > 0:
# 计算当前句子与前一句的相似度
similarity = cosine_similarity(
embeddings[i].reshape(1, -1),
embeddings[i-1].reshape(1, -1)
)[0][0]
if similarity < topic_change_threshold:
is_topic_change = True
# 检查长度边界
is_length_boundary = current_length + sent_length > max_chunk_size and current_chunk
if (is_topic_change or is_length_boundary) and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = [sent]
current_length = sent_length
else:
current_chunk.append(sent)
current_length += sent_length
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
except ImportError:
print("sentence-transformers未安装,使用基础分句")
return sentence_chunker.chunk_by_sentence(text, max_chunk_size)
# 高级语义分块器(集成所有策略)
class AdvancedSemanticChunker:
"""高级语义分块器(集成所有策略)"""
def __init__(self, strategy: str = 'auto'):
"""
初始化高级分块器
Args:
strategy: 分块策略 ('auto', 'sentence', 'clustering', 'topic')
"""
self.strategy = strategy
self.sentence_chunker = SentenceBoundaryChunker()
self.clustering_chunker = SemanticClusteringChunker()
self.topic_chunker = TopicModelingChunker()
# 策略选择器
self.strategy_selector = {
'sentence': self._chunk_by_sentence,
'clustering': self._chunk_by_clustering,
'topic': self._chunk_by_topic,
'hybrid': self._chunk_hybrid
}
def chunk(self,
text: str,
max_chunk_size: int = 500,
min_chunk_size: int = 100,
strategy: str = None) -> Dict[str, Any]:
"""
智能分块
Args:
text: 输入文本
max_chunk_size: 最大块大小
min_chunk_size: 最小块大小
strategy: 分块策略(覆盖初始化设置)
Returns:
包含分块结果和元数据的字典
"""
if strategy is None:
strategy = self.strategy
# 自动选择策略
if strategy == 'auto':
strategy = self._select_strategy(text)
# 执行分块
if strategy in self.strategy_selector:
chunks = self.strategy_selector[strategy](text, max_chunk_size, min_chunk_size)
else:
# 默认使用句子分块
chunks = self._chunk_by_sentence(text, max_chunk_size, min_chunk_size)
# 分析结果
chunk_sizes = [len(chunk) for chunk in chunks]
return {
'chunks': chunks,
'num_chunks': len(chunks),
'avg_chunk_size': np.mean(chunk_sizes) if chunk_sizes else 0,
'min_chunk_size': min(chunk_sizes) if chunk_sizes else 0,
'max_chunk_size': max(chunk_sizes) if chunk_sizes else 0,
'strategy_used': strategy
}
def _select_strategy(self, text: str) -> str:
"""自动选择最适合的分块策略"""
# 简单的启发式规则
text_length = len(text)
sentences = self.sentence_chunker.split_sentences(text)
num_sentences = len(sentences)
if text_length < 1000:
return 'sentence'
elif num_sentences > 20:
return 'clustering'
else:
return 'hybrid'
def _chunk_by_sentence(self, text: str, max_size: int, min_size: int) -> List[str]:
"""句子边界分块"""
return self.sentence_chunker.chunk_by_sentence(text, max_size)
def _chunk_by_clustering(self, text: str, max_size: int, min_size: int) -> List[str]:
"""聚类分块"""
return self.clustering_chunker.hierarchical_chunking(
text, self.sentence_chunker, max_size, min_size
)
def _chunk_by_topic(self, text: str, max_size: int, min_size: int) -> List[str]:
"""主题分块"""
chunks_with_topics = self.topic_chunker.dynamic_topic_chunking(text, max_size)
# 只返回文本,不返回主题标签
return chunks_with_topics if isinstance(chunks_with_topics[0], str) else \
[chunk for chunk, _ in chunks_with_topics]
def _chunk_hybrid(self, text: str, max_size: int, min_size: int) -> List[str]:
"""混合策略分块"""
# 先用主题分块
topic_chunks = self._chunk_by_topic(text, max_size * 2, min_size)
# 如果主题分块结果过大,再进行细分
final_chunks = []
for chunk in topic_chunks:
if len(chunk) <= max_size:
final_chunks.append(chunk)
else:
# 使用聚类进一步分割
sub_chunks = self.clustering_chunker.hierarchical_chunking(
chunk, self.sentence_chunker, max_size, min_size
)
final_chunks.extend(sub_chunks)
return final_chunks
# 示例用法
def run_demo():
"""运行演示"""
# 示例文本
sample_text = """
Artificial Intelligence (AI) is transforming various industries.
Machine learning algorithms can now recognize patterns in data that were previously undetectable.
Natural Language Processing (NLP) is a subfield of AI focused on language understanding.
Recent advances in transformer models like BERT and GPT have revolutionized NLP tasks.
Computer vision enables machines to interpret and understand visual information.
Deep learning models can now achieve human-level performance on image classification tasks.
Robotics combines AI with mechanical engineering to create intelligent machines.
Autonomous vehicles use sensors and AI algorithms to navigate without human intervention.
The ethical implications of AI are becoming increasingly important.
Researchers are developing frameworks for responsible AI development and deployment.
Future trends in AI include explainable AI, federated learning, and neuromorphic computing.
These technologies promise to make AI more transparent, privacy-preserving, and efficient.
"""
print("=" * 80)
print("基于语义分块策略演示")
print("=" * 80)
# 创建分块器
print("\n1. 句子边界分块策略")
sentence_chunker = SentenceBoundaryChunker()
sentence_chunks = sentence_chunker.chunk_by_sentence(sample_text, max_chunk_length=200)
for i, chunk in enumerate(sentence_chunks):
print(f"\n块 {i+1} ({len(chunk)} 字符):")
print(f"{chunk[:150]}..." if len(chunk) > 150 else chunk)
print("\n" + "=" * 80)
print("2. 语义聚类分块策略")
try:
clustering_chunker = SemanticClusteringChunker()
cluster_chunks = clustering_chunker.hierarchical_chunking(
sample_text, sentence_chunker, max_chunk_size=250
)
for i, chunk in enumerate(cluster_chunks):
print(f"\n块 {i+1} ({len(chunk)} 字符):")
print(f"{chunk[:150]}..." if len(chunk) > 150 else chunk)
except Exception as e:
print(f"聚类分块失败: {e}")
print("需要安装: pip install sentence-transformers scikit-learn")
print("\n" + "=" * 80)
print("3. 主题建模分块策略")
try:
topic_chunker = TopicModelingChunker()
topic_chunks = topic_chunker.dynamic_topic_chunking(sample_text, max_chunk_size=250)
for i, chunk in enumerate(topic_chunks):
print(f"\n块 {i+1} ({len(chunk)} 字符):")
print(f"{chunk[:150]}..." if len(chunk) > 150 else chunk)
except Exception as e:
print(f"主题分块失败: {e}")
print("需要安装: pip install sentence-transformers")
print("\n" + "=" * 80)
print("4. 高级智能分块器")
advanced_chunker = AdvancedSemanticChunker(strategy='auto')
result = advanced_chunker.chunk(sample_text, max_chunk_size=300, min_chunk_size=50)
print(f"使用的策略: {result['strategy_used']}")
print(f"分块数量: {result['num_chunks']}")
print(f"平均块大小: {result['avg_chunk_size']:.0f} 字符")
for i, chunk in enumerate(result['chunks']):
print(f"\n块 {i+1} ({len(chunk)} 字符):")
print(f"{chunk[:150]}..." if len(chunk) > 150 else chunk)
print("\n" + "=" * 80)
print("5. 性能对比")
# 测试不同策略的性能
strategies = ['sentence', 'clustering', 'topic', 'hybrid']
for strategy in strategies:
try:
advanced_chunker.strategy = strategy
result = advanced_chunker.chunk(sample_text, max_chunk_size=300)
print(f"\n策略 '{strategy}': {result['num_chunks']} 个块, "
f"平均大小: {result['avg_chunk_size']:.0f} 字符")
except Exception as e:
print(f"\n策略 '{strategy}' 失败: {e}")
def benchmark_strategies():
"""基准测试不同分块策略"""
import time
# 更长的测试文本
test_text = """
The history of artificial intelligence dates back to ancient times, with myths and stories about artificial beings endowed with intelligence.
The modern field of AI research was founded at a workshop at Dartmouth College in 1956.
Early AI research focused on symbolic approaches and problem-solving.
The 1970s and 1980s saw the rise of expert systems, which attempted to capture human expertise in specific domains.
Machine learning emerged as a dominant approach in the 1990s, with statistical methods becoming more prominent.
The development of deep learning in the 2010s, powered by increased computational resources and large datasets, led to breakthroughs in computer vision, natural language processing, and other areas.
Current AI applications span numerous industries. In healthcare, AI assists in diagnosis, drug discovery, and personalized treatment plans.
In finance, algorithms detect fraud, automate trading, and provide personalized financial advice.
Autonomous vehicles combine computer vision, sensor fusion, and control systems to navigate roads safely.
Robotics applications range from manufacturing and logistics to domestic assistance and surgery.
Natural language processing has enabled virtual assistants, real-time translation, and sentiment analysis.
Large language models like GPT-4 can generate human-like text and engage in conversations.
Ethical considerations in AI include bias in algorithms, privacy concerns, and the potential for job displacement.
Researchers are developing techniques for explainable AI to make algorithms more transparent and accountable.
Future directions include artificial general intelligence (AGI), which aims to create machines with human-like cognitive abilities.
Other areas of research include neuromorphic computing, quantum machine learning, and AI safety.
The societal impact of AI continues to be debated, with discussions about regulation, economic effects, and long-term implications for humanity.
International collaborations and standards are emerging to guide the responsible development of AI technologies.
"""
print("\n" + "=" * 80)
print("分块策略性能基准测试")
print("=" * 80)
chunker = AdvancedSemanticChunker()
strategies = ['sentence', 'clustering', 'topic', 'hybrid']
for strategy in strategies:
try:
start_time = time.time()
# 运行多次取平均
runs = 3
for _ in range(runs):
result = chunker.chunk(test_text, strategy=strategy)
avg_time = (time.time() - start_time) / runs
print(f"\n策略: {strategy:15} | "
f"时间: {avg_time:.3f}s | "
f"块数: {result['num_chunks']:3d} | "
f"平均大小: {result['avg_chunk_size']:.0f}")
except Exception as e:
print(f"\n策略: {strategy:15} | 失败: {str(e)[:50]}")
if __name__ == "__main__":
print("基于语义的分块策略完整演示")
print("=" * 80)
# 运行演示
run_demo()
# 运行性能测试(可选)
# benchmark_strategies()
运行结果:
(1)run_demo( )
bash
基于语义的分块策略完整演示
================================================================================
================================================================================
基于语义分块策略演示
================================================================================
1. 句子边界分块策略
✓ 使用spaCy模型
块 1 (160 字符):
Artificial Intelligence AI is transforming various industries. Machine learning algorithms can now recognize patterns in data that were previously und...
块 2 (184 字符):
Machine learning algorithms can now recognize patterns in data that were previously undetectable. Natural Language Processing NLP is a subfield of AI ...
块 3 (173 字符):
Natural Language Processing NLP is a subfield of AI focused on language understanding. Recent advances in transformer models like BERT and GPT have re...
块 4 (167 字符):
Recent advances in transformer models like BERT and GPT have revolutionized NLP tasks. Computer vision enables machines to interpret and understand vi...
块 5 (172 字符):
Computer vision enables machines to interpret and understand visual information. Deep learning models can now achieve human-level performance on image...
块 6 (172 字符):
Deep learning models can now achieve human-level performance on image classification tasks. Robotics combines AI with mechanical engineering to create...
块 7 (170 字符):
Robotics combines AI with mechanical engineering to create intelligent machines. Autonomous vehicles use sensors and AI algorithms to navigate without...
块 8 (157 字符):
Autonomous vehicles use sensors and AI algorithms to navigate without human intervention. The ethical implications of AI are becoming increasingly imp...
块 9 (152 字符):
The ethical implications of AI are becoming increasingly important. Researchers are developing frameworks for responsible AI development and deploymen...
块 10 (176 字符):
Researchers are developing frameworks for responsible AI development and deployment. Future trends in AI include explainable AI, federated learning, a...
块 11 (182 字符):
Future trends in AI include explainable AI, federated learning, and neuromorphic computing. These technologies promise to make AI more transparent, pr...
================================================================================
2. 语义聚类分块策略
modules.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 687kB/s]
config_sentence_transformers.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 480kB/s]
README.md: 10.5kB [00:00, 9.81MB/s]
sentence_bert_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 168kB/s]
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 2.66MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [03:02<00:00, 499kB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 1.48MB/s]
vocab.txt: 232kB [00:01, 148kB/s]
tokenizer.json: 466kB [00:00, 4.24MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 344kB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 576kB/s]
✓ 加载嵌入模型: all-MiniLM-L6-v2
块 1 (172 字符):
Computer vision enables machines to interpret and understand visual information. Deep learning models can now achieve human-level performance on image...
块 2 (89 字符):
Autonomous vehicles use sensors and AI algorithms to navigate without human intervention.
块 3 (244 字符):
The ethical implications of AI are becoming increasingly important. Researchers are developing frameworks for responsible AI development and deploymen...
块 4 (90 字符):
These technologies promise to make AI more transparent, privacy-preserving, and efficient.
块 5 (143 字符):
Artificial Intelligence AI is transforming various industries. Robotics combines AI with mechanical engineering to create intelligent machines.
块 6 (184 字符):
Machine learning algorithms can now recognize patterns in data that were previously undetectable. Natural Language Processing NLP is a subfield of AI ...
块 7 (86 字符):
Recent advances in transformer models like BERT and GPT have revolutionized NLP tasks.
================================================================================
3. 主题建模分块策略
✓ 使用spaCy模型
块 1 (160 字符):
Artificial Intelligence AI is transforming various industries. Machine learning algorithms can now recognize patterns in data that were previously und...
块 2 (173 字符):
Natural Language Processing NLP is a subfield of AI focused on language understanding. Recent advances in transformer models like BERT and GPT have re...
块 3 (172 字符):
Computer vision enables machines to interpret and understand visual information. Deep learning models can now achieve human-level performance on image...
块 4 (238 字符):
Robotics combines AI with mechanical engineering to create intelligent machines. Autonomous vehicles use sensors and AI algorithms to navigate without...
块 5 (176 字符):
Researchers are developing frameworks for responsible AI development and deployment. Future trends in AI include explainable AI, federated learning, a...
块 6 (90 字符):
These technologies promise to make AI more transparent, privacy-preserving, and efficient.
================================================================================
4. 高级智能分块器
✓ 使用spaCy模型
✓ 使用spaCy模型
✓ 加载嵌入模型: all-MiniLM-L6-v2
使用的策略: hybrid
分块数量: 6
平均块大小: 168 字符
块 1 (160 字符):
Artificial Intelligence AI is transforming various industries. Machine learning algorithms can now recognize patterns in data that were previously und...
块 2 (173 字符):
Natural Language Processing NLP is a subfield of AI focused on language understanding. Recent advances in transformer models like BERT and GPT have re...
块 3 (172 字符):
Computer vision enables machines to interpret and understand visual information. Deep learning models can now achieve human-level performance on image...
块 4 (170 字符):
Robotics combines AI with mechanical engineering to create intelligent machines. Autonomous vehicles use sensors and AI algorithms to navigate without...
块 5 (244 字符):
The ethical implications of AI are becoming increasingly important. Researchers are developing frameworks for responsible AI development and deploymen...
块 6 (90 字符):
These technologies promise to make AI more transparent, privacy-preserving, and efficient.
================================================================================
5. 性能对比
策略 'sentence': 6 个块, 平均大小: 236 字符
策略 'clustering': 4 个块, 平均大小: 253 字符
✓ 使用spaCy模型
策略 'topic': 5 个块, 平均大小: 202 字符
✓ 使用spaCy模型
策略 'hybrid': 6 个块, 平均大小: 168 字符
(2)benchmark_strategies( )
bash
基于语义的分块策略完整演示
================================================================================
================================================================================
分块策略性能基准测试
================================================================================
✓ 使用spaCy模型
策略: sentence | 时间: 0.041s | 块数: 6 | 平均大小: 424
✓ 加载嵌入模型: all-MiniLM-L6-v2
策略: clustering | 时间: 3.000s | 块数: 7 | 平均大小: 296
✓ 使用spaCy模型
✓ 使用spaCy模型
✓ 使用spaCy模型
策略: topic | 时间: 5.964s | 块数: 7 | 平均大小: 296
✓ 使用spaCy模型
✓ 使用spaCy模型
✓ 使用spaCy模型
策略: hybrid | 时间: 6.318s | 块数: 7 | 平均大小: 296
关键技术
- 句子分割质量:依赖NLP工具准确性
- 嵌入模型选择:Sentence-BERT、Instructor等
- 相似度阈值:需要调优的超参数
优缺点
优点:
- 保持语义完整性
- 提高下游任务性能
- 更符合人类阅读习惯
缺点:
- 计算成本高
- 依赖语言特定工具
- 实现复杂度高
适用场景
- 文档摘要生成
- 问答系统
- 语义搜索应用
五、基于标题的递归分块(Recursive Chunking by Headings)
核心思想
利用文档的结构信息(标题、章节)进行层次化分割。
文档结构分析
1. 标题层级识别
bash
H1: 第一章
H2: 第一节
H3: 1.1小节
H2: 第二节
H1: 第二章
2. 递归分割算法
python
from typing import List, Dict, Any, Tuple, Optional
from dataclasses import dataclass
@dataclass
class HeadingNode:
"""文档节点结构"""
level: int # 标题层级,1表示最高级
title: str # 标题文本
content: str # 该标题下的内容(不包含子标题内容)
children: List['HeadingNode'] # 子标题节点
class DocumentStructure:
"""文档结构树"""
def __init__(self, root_nodes: List[HeadingNode] = None):
self.root_nodes = root_nodes or []
def add_root_node(self, node: HeadingNode):
self.root_nodes.append(node)
def recursive_heading_chunking(
doc_structure: DocumentStructure,
max_chunk_size: int = 1000,
min_chunk_size: int = 100,
overlap_size: int = 50,
include_metadata: bool = True
) -> List[Dict[str, Any]]:
"""
基于标题的递归分块算法
Args:
doc_structure: 文档结构树
max_chunk_size: 最大块大小(按词数计)
min_chunk_size: 最小块大小(避免过小的块)
overlap_size: 块之间的重叠大小(用于保持上下文)
include_metadata: 是否包含元数据(标题路径、层级等)
Returns:
分块列表,每个块包含内容和元数据
"""
chunks = []
def count_words(text: str) -> int:
"""简单的词数统计"""
return len(text.split())
def semantic_chunk_by_sentence(
text: str,
max_size: int,
min_size: int,
overlap: int
) -> List[str]:
"""
按句子进行语义分块(简化版)
实际应用中可以使用NLP库进行更准确的分句
"""
# 简单的句子分割(按句号、问号、感叹号分割)
sentences = []
current_sentence = ""
for char in text:
current_sentence += char
if char in '.!?。!?':
sentences.append(current_sentence.strip())
current_sentence = ""
if current_sentence.strip():
sentences.append(current_sentence.strip())
# 如果没有找到句子边界,按长度分割
if not sentences:
sentences = [text[i:i+500] for i in range(0, len(text), 500)]
# 合并句子到块中
chunks = []
current_chunk = ""
current_word_count = 0
for sentence in sentences:
sentence_word_count = count_words(sentence)
if current_word_count + sentence_word_count > max_size and current_chunk:
# 保存当前块
if current_word_count >= min_size:
chunks.append(current_chunk)
# 创建重叠块(保留部分上下文)
if overlap > 0:
# 简单实现:保留最后几个句子作为重叠
sentences_in_chunk = current_chunk.split('.')
overlap_text = '.'.join(sentences_in_chunk[-3:]) if len(sentences_in_chunk) > 3 else current_chunk
current_chunk = overlap_text + " "
current_word_count = count_words(overlap_text)
else:
current_chunk = ""
current_word_count = 0
else:
# 如果块太小,继续添加句子
pass
current_chunk += sentence + " "
current_word_count += sentence_word_count
# 添加最后一个块
if current_chunk and current_word_count >= min_size:
chunks.append(current_chunk.strip())
return chunks
def process_node(
node: HeadingNode,
current_path: List[Tuple[int, str]] = None,
parent_content: str = ""
):
"""递归处理节点"""
nonlocal chunks
current_path = current_path or []
# 构建完整的标题路径
path_with_current = current_path + [(node.level, node.title)]
# 构建上下文:父级内容(可选,用于提供上下文)
context = parent_content if len(parent_content) < 200 else parent_content[-200:] # 限制上下文长度
# 构建当前节点的完整内容
if context:
full_content = context + "\n" + node.title + "\n" + node.content
else:
full_content = node.title + "\n" + node.content
word_count = count_words(full_content)
# 决策:是否需要进一步分块
if word_count > max_chunk_size:
if node.children:
# 有子标题:递归处理子节点,传递当前内容作为部分上下文
for child in node.children:
# 传递给子节点的父内容包括当前标题和部分内容
child_parent_content = f"{node.title}\n{node.content[:500]}" # 限制传递给子节点的内容长度
process_node(child, path_with_current, child_parent_content)
else:
# 无子标题:使用语义分块
sub_chunks = semantic_chunk_by_sentence(
node.content,
max_chunk_size,
min_chunk_size,
overlap_size
)
for i, sub_chunk in enumerate(sub_chunks):
# 构建块内容
chunk_content = node.title + "\n" + sub_chunk
# 构建元数据
metadata = {
'heading_path': [title for _, title in path_with_current],
'heading_levels': [level for level, _ in path_with_current],
'current_heading': node.title,
'current_level': node.level,
'chunk_index': i,
'total_chunks': len(sub_chunks),
'word_count': count_words(chunk_content),
'chunk_type': 'semantic'
}
chunks.append({
'content': chunk_content,
'metadata': metadata if include_metadata else {}
})
else:
# 内容大小合适,作为完整块
metadata = {
'heading_path': [title for _, title in path_with_current],
'heading_levels': [level for level, _ in path_with_current],
'current_heading': node.title,
'current_level': node.level,
'chunk_index': 0,
'total_chunks': 1,
'word_count': word_count,
'chunk_type': 'full_heading'
}
chunks.append({
'content': full_content,
'metadata': metadata if include_metadata else {}
})
# 处理子节点
if node.children:
for child in node.children:
process_node(child, path_with_current, node.title)
# 从根节点开始处理
for root_node in doc_structure.root_nodes:
process_node(root_node)
return chunks
def create_document_structure_from_headings(
headings: List[Dict[str, Any]]
) -> DocumentStructure:
"""
从扁平化的标题列表构建文档结构树
Args:
headings: 标题列表,每个元素包含 'level', 'title', 'content'
Returns:
DocumentStructure 对象
"""
# 创建根节点
doc_structure = DocumentStructure()
# 使用栈来构建树结构
stack = []
current_parent = None
for heading in headings:
node = HeadingNode(
level=heading['level'],
title=heading['title'],
content=heading.get('content', ''),
children=[]
)
# 寻找父节点
while stack and stack[-1].level >= node.level:
stack.pop()
if stack:
# 添加到父节点的children中
stack[-1].children.append(node)
else:
# 这是一个根节点
doc_structure.add_root_node(node)
# 将当前节点压入栈
stack.append(node)
return doc_structure
# 使用示例
def example_usage():
# 示例:创建一个文档结构
headings = [
{'level': 1, 'title': '第一章:人工智能简介', 'content': '人工智能是...'},
{'level': 2, 'title': '1.1 人工智能的历史', 'content': '人工智能的历史可以追溯到...' * 100}, # 长内容
{'level': 3, 'title': '1.1.1 早期发展', 'content': '早期的AI研究集中在...'},
{'level': 3, 'title': '1.1.2 现代进展', 'content': '近年来,深度学习...' * 200}, # 很长的内容
{'level': 2, 'title': '1.2 人工智能的应用', 'content': 'AI在各个领域都有应用...'},
{'level': 1, 'title': '第二章:机器学习基础', 'content': '机器学习是AI的一个分支...'},
]
# 构建文档结构
doc_structure = create_document_structure_from_headings(headings)
# 进行递归分块
chunks = recursive_heading_chunking(
doc_structure=doc_structure,
max_chunk_size=500, # 较小的块大小用于演示
min_chunk_size=50,
overlap_size=20,
include_metadata=True
)
# 输出结果
print(f"总共生成 {len(chunks)} 个块:\n")
for i, chunk in enumerate(chunks):
print(f"块 {i+1}:")
print(f"标题路径: {' -> '.join(chunk['metadata']['heading_path'])}")
print(f"内容预览: {chunk['content'][:100]}...")
print(f"词数: {chunk['metadata']['word_count']}")
print(f"分块类型: {chunk['metadata']['chunk_type']}")
print("-" * 50)
def optimize_chunks_for_embedding(chunks: List[Dict], max_tokens: int = 512) -> List[Dict]:
"""
为嵌入模型优化分块
Args:
chunks: 原始分块列表
max_tokens: 嵌入模型的最大token数
Returns:
优化后的分块列表
"""
optimized = []
for chunk in chunks:
content = chunk['content']
metadata = chunk['metadata']
# 如果内容太长,进一步分割
if len(content.split()) > max_tokens * 0.75: # 留一些余量
# 按段落分割
paragraphs = [p.strip() for p in content.split('\n') if p.strip()]
current_chunk = ""
for para in paragraphs:
if len((current_chunk + "\n" + para).split()) > max_tokens * 0.75:
if current_chunk:
optimized.append({
'content': current_chunk.strip(),
'metadata': {**metadata, 'sub_chunk': True}
})
current_chunk = para
else:
current_chunk += "\n" + para if current_chunk else para
if current_chunk:
optimized.append({
'content': current_chunk.strip(),
'metadata': {**metadata, 'sub_chunk': True}
})
else:
optimized.append(chunk)
return optimized
if __name__ == "__main__":
# 运行示例
example_usage()
# 演示优化
print("\n\n优化分块示例:")
headings = [
{'level': 1, 'title': '长文档测试', 'content': 'Next stop is Jinghai Road.' * 200}
]
doc_structure = create_document_structure_from_headings(headings)
chunks = recursive_heading_chunking(doc_structure, max_chunk_size=300)
optimized = optimize_chunks_for_embedding(chunks, max_tokens=200)
print(f"原始分块: {len(chunks)} 个,优化后: {len(optimized)} 个")
结果:
bash
总共生成 6 个块:
块 1:
标题路径: 第一章:人工智能简介
内容预览: 第一章:人工智能简介
人工智能是......
词数: 2
分块类型: full_heading
--------------------------------------------------
块 2:
标题路径: 第一章:人工智能简介 -> 1.1 人工智能的历史
内容预览: 第一章:人工智能简介
1.1 人工智能的历史
人工智能的历史可以追溯到...人工智能的历史可以追溯到...人工智能的历史可以追溯到...人工智能的历史可以追溯到...人工智能的历史可以追溯到...人工...
词数: 4
分块类型: full_heading
--------------------------------------------------
块 3:
标题路径: 第一章:人工智能简介 -> 1.1 人工智能的历史 -> 1.1.1 早期发展
内容预览: 1.1 人工智能的历史
1.1.1 早期发展
早期的AI研究集中在......
词数: 5
分块类型: full_heading
--------------------------------------------------
块 4:
标题路径: 第一章:人工智能简介 -> 1.1 人工智能的历史 -> 1.1.2 现代进展
内容预览: 1.1 人工智能的历史
1.1.2 现代进展
近年来,深度学习...近年来,深度学习...近年来,深度学习...近年来,深度学习...近年来,深度学习...近年来,深度学习...近年来,深度学习......
词数: 5
分块类型: full_heading
--------------------------------------------------
块 5:
标题路径: 第一章:人工智能简介 -> 1.2 人工智能的应用
内容预览: 第一章:人工智能简介
1.2 人工智能的应用
AI在各个领域都有应用......
词数: 4
分块类型: full_heading
--------------------------------------------------
块 6:
标题路径: 第二章:机器学习基础
内容预览: 第二章:机器学习基础
机器学习是AI的一个分支......
词数: 2
分块类型: full_heading
--------------------------------------------------
优化分块示例:
原始分块: 4 个,优化后: 7 个
多级分块策略
- 第一级:按主要章节分割
- 第二级:按子章节分割
- 第三级:对长段落进行语义分割
元数据保留策略
- 保持标题层级信息
- 维护文档结构关系
- 添加块间链接信息
优缺点
优点:
- 保持文档逻辑结构
- 便于导航和检索
- 适合技术文档和学术论文
缺点:
- 依赖文档格式规范
- 需要解析标题结构
- 对非结构化文档效果有限
适用场景
- 技术文档处理
- 学术论文分析
- 法律合同解析
六、分块策略对比与选择指南
策略对比表
|------|------|------|-------|---------|
| 策略类型 | 保持语义 | 计算效率 | 实现复杂度 | 适用文本类型 |
| 固定大小 | 低 | 高 | 低 | 简单、均匀文本 |
| 滑动窗口 | 中 | 中 | 中 | 序列数据 |
| 语义分块 | 高 | 低 | 高 | 自然语言文档 |
| 标题递归 | 高 | 中 | 高 | 结构化文档 |
选择标准
- 考虑文档类型 :
- 小说/文章:语义分块
- 技术文档:标题递归
- 日志数据:滑动窗口
- 考虑任务需求 :
- 检索任务:需要语义连贯
- 分类任务:可接受固定分块
- 生成任务:需要完整上下文
- 考虑资源限制 :
- 计算资源有限:选择简单策略
- 存储成本敏感:减少重叠
- 延迟要求高:优化分块速度
混合策略
在实践中,常使用混合策略:
python
import spacy
import re
from typing import List, Dict, Union, Tuple
from dataclasses import dataclass
# 加载spacy模型
try:
nlp = spacy.load("en_core_web_sm")
except:
print("正在下载spaCy模型...")
import subprocess
subprocess.run(["python", "-m", "spacy", "download", "en_core_web_sm"],
capture_output=True, text=True)
nlp = spacy.load("en_core_web_sm")
@dataclass
class Chunk:
"""分块数据类"""
content: str
metadata: Dict = None
chunk_type: str = "content"
def __post_init__(self):
if self.metadata is None:
self.metadata = {}
self.length = len(self.content)
self.word_count = len(self.content.split())
def __repr__(self):
return f"Chunk(type={self.chunk_type}, words={self.word_count}, chars={self.length})"
class DocumentNode:
"""文档节点类,用于构建文档树结构"""
def __init__(self, title: str = "", content: str = "", level: int = 0):
self.title = title
self.content = content
self.level = level # 标题级别,0为根节点
self.children = []
def add_child(self, child_node):
self.children.append(child_node)
def __repr__(self):
return f"DocumentNode('{self.title}', level={self.level}, children={len(self.children)})"
class HybridChunker:
"""混合分块策略"""
def __init__(self,
max_chunk_size: int = 1000, # 最大块大小(单词数)
min_chunk_size: int = 50, # 最小块大小
window_size: int = 400, # 滑动窗口大小
overlap: int = 100, # 滑动窗口重叠
sentence_model: str = "en_core_web_sm"):
"""
初始化混合分块器
Args:
max_chunk_size: 最大分块大小(单词数)
min_chunk_size: 最小分块大小
window_size: 滑动窗口大小
overlap: 滑动窗口重叠大小
"""
self.max_chunk_size = max_chunk_size
self.min_chunk_size = min_chunk_size
self.window_size = window_size
self.overlap = overlap
self.nlp = nlp
def has_clear_headings(self, text: str) -> bool:
"""
判断文档是否有清晰的标题结构
Args:
text: 输入文本
Returns:
bool: 是否有清晰标题
"""
# 检测Markdown风格标题
markdown_headings = re.findall(r'^(#{1,6})\s+.+', text, re.MULTILINE)
# 检测数字标题(如1., 1.1, 2.等)
numbered_headings = re.findall(r'^\d+(\.\d+)*\s+.+', text, re.MULTILINE)
# 检测大写标题(全大写且长度适中的行)
uppercase_headings = re.findall(r'^[A-Z][A-Z\s]{5,50}$', text, re.MULTILINE)
# 如果有足够多的标题标记,则认为有清晰结构
total_headings = len(markdown_headings) + len(numbered_headings) + len(uppercase_headings)
# 计算标题密度(每100行中的标题数)
lines = text.split('\n')
non_empty_lines = [line.strip() for line in lines if line.strip()]
if non_empty_lines:
heading_density = total_headings / len(non_empty_lines)
return heading_density > 0.05 # 如果超过5%的行是标题,则认为有清晰结构
return False
def parse_headings_structure(self, text: str) -> List[DocumentNode]:
"""
解析文档的标题结构
Args:
text: 输入文本
Returns:
List[DocumentNode]: 文档树结构
"""
lines = text.split('\n')
root_nodes = []
node_stack = [] # 用于跟踪当前节点路径
for line in lines:
line = line.rstrip()
# 跳过空行
if not line.strip():
continue
# 检测Markdown标题
md_match = re.match(r'^(#{1,6})\s+(.+)$', line)
if md_match:
level = len(md_match.group(1)) # #的数量表示级别
title = md_match.group(2).strip()
node = DocumentNode(title=title, level=level)
# 根据级别找到父节点
while node_stack and node_stack[-1].level >= level:
node_stack.pop()
if node_stack:
node_stack[-1].add_child(node)
else:
root_nodes.append(node)
node_stack.append(node)
continue
# 检测数字标题
num_match = re.match(r'^(\d+(\.\d+)*)\s+(.+)$', line)
if num_match:
# 通过点号数量确定级别
level = num_match.group(1).count('.') + 1
title = num_match.group(3).strip()
node = DocumentNode(title=title, level=level)
# 根据级别找到父节点
while node_stack and node_stack[-1].level >= level:
node_stack.pop()
if node_stack:
node_stack[-1].add_child(node)
else:
root_nodes.append(node)
node_stack.append(node)
continue
# 检测大写标题行
if line.isupper() and 10 <= len(line) <= 100:
level = 1 # 顶级标题
node = DocumentNode(title=line, level=level)
# 重置栈,因为大写标题通常是顶级标题
node_stack = [node]
root_nodes.append(node)
continue
# 普通内容行,添加到当前节点
if node_stack:
if node_stack[-1].content:
node_stack[-1].content += '\n' + line
else:
node_stack[-1].content = line
# 如果没有检测到标题结构,创建一个根节点包含所有内容
if not root_nodes:
root_node = DocumentNode(title="Document", content=text, level=0)
root_nodes.append(root_node)
return root_nodes
def semantic_chunk_by_sentence(self, text: str) -> List[Chunk]:
"""
基于句子的语义分块
Args:
text: 输入文本
Returns:
List[Chunk]: 分块列表
"""
doc = nlp(text)
chunks = []
current_chunk = []
current_length = 0
for sent in doc.sents:
sent_text = sent.text.strip()
if not sent_text:
continue
sent_word_count = len(sent_text.split())
# 如果当前块加上新句子超过最大长度,并且当前块不为空,则保存当前块
if current_length + sent_word_count > self.max_chunk_size and current_chunk:
chunk_text = " ".join(current_chunk)
if len(chunk_text.split()) >= self.min_chunk_size:
chunks.append(Chunk(content=chunk_text,
metadata={"chunking_method": "semantic_sentence"}))
current_chunk = [sent_text]
current_length = sent_word_count
else:
current_chunk.append(sent_text)
current_length += sent_word_count
# 添加最后一个块
if current_chunk:
chunk_text = " ".join(current_chunk)
if len(chunk_text.split()) >= self.min_chunk_size:
chunks.append(Chunk(content=chunk_text,
metadata={"chunking_method": "semantic_sentence"}))
return chunks
def sliding_window_chunking(self, text: str) -> List[Chunk]:
"""
滑动窗口分块
Args:
text: 输入文本
Returns:
List[Chunk]: 分块列表
"""
words = text.split()
chunks = []
if len(words) <= self.window_size:
# 如果文本本身小于窗口大小,直接返回
return [Chunk(content=text, metadata={"chunking_method": "sliding_window"})]
# 使用滑动窗口分割
start = 0
while start < len(words):
end = min(start + self.window_size, len(words))
chunk_words = words[start:end]
chunk_text = " ".join(chunk_words)
chunks.append(Chunk(content=chunk_text,
metadata={"chunking_method": "sliding_window"}))
# 移动窗口,考虑重叠
start += self.window_size - self.overlap
return chunks
def recursive_heading_chunking(self, nodes: List[DocumentNode]) -> List[Chunk]:
"""
基于标题的递归分块
Args:
nodes: 文档节点列表
Returns:
List[Chunk]: 分块列表
"""
chunks = []
def process_node(node: DocumentNode, parent_titles: List[str] = None):
if parent_titles is None:
parent_titles = []
# 构建当前标题路径
current_titles = parent_titles + [node.title] if node.title else parent_titles
title_path = " > ".join(current_titles) if current_titles else ""
# 如果有内容,处理内容
if node.content:
full_content = node.content
if title_path:
full_content = f"{title_path}\n\n{full_content}"
word_count = len(full_content.split())
# 如果内容超过最大限制,进一步分割
if word_count > self.max_chunk_size:
# 如果有子节点,优先按子节点分割
if node.children:
for child in node.children:
process_node(child, current_titles)
else:
# 否则使用语义分块
sub_chunks = self.semantic_chunk_by_sentence(node.content)
for i, chunk in enumerate(sub_chunks):
chunk_title = f"{title_path} (Part {i+1})" if title_path else f"Part {i+1}"
chunk_metadata = chunk.metadata.copy()
chunk_metadata["title_path"] = chunk_title
chunks.append(Chunk(
content=chunk.content,
metadata=chunk_metadata
))
else:
# 内容大小合适,直接添加
chunks.append(Chunk(
content=full_content,
metadata={
"title_path": title_path,
"chunking_method": "recursive_heading"
}
))
# 处理子节点
for child in node.children:
process_node(child, current_titles)
# 处理所有根节点
for node in nodes:
process_node(node)
return chunks
def hybrid_chunking_strategy(self, document: str) -> List[Chunk]:
"""
混合分块策略
Args:
document: 输入文档文本
Returns:
List[Chunk]: 分块列表
"""
print("开始混合分块策略...")
# 第一步:尝试按标题分割
if self.has_clear_headings(document):
print("检测到清晰标题结构,使用递归标题分块...")
doc_structure = self.parse_headings_structure(document)
chunks = self.recursive_heading_chunking(doc_structure)
else:
# 第二步:没有清晰标题,使用语义分块
print("未检测到清晰标题结构,使用语义分块...")
chunks = self.semantic_chunk_by_sentence(document)
# 第三步:确保块大小合理
final_chunks = []
for chunk in chunks:
if chunk.word_count > self.max_chunk_size:
print(f"块过大 ({chunk.word_count} 词),使用滑动窗口分割...")
# 对过大的块使用滑动窗口
sub_chunks = self.sliding_window_chunking(chunk.content)
# 保留原始元数据
for sub_chunk in sub_chunks:
if sub_chunk.metadata.get("title_path"):
sub_chunk.metadata["original_title"] = sub_chunk.metadata["title_path"]
if chunk.metadata:
sub_chunk.metadata.update(chunk.metadata)
sub_chunk.metadata["chunking_method"] = "sliding_window"
final_chunks.extend(sub_chunks)
elif chunk.word_count < self.min_chunk_size:
print(f"块过小 ({chunk.word_count} 词),尝试与相邻块合并...")
# 尝试与下一个块合并(如果存在)
final_chunks.append(chunk)
else:
final_chunks.append(chunk)
# 第四步:合并过小的块
merged_chunks = self._merge_small_chunks(final_chunks)
print(f"分块完成!共生成 {len(merged_chunks)} 个块。")
return merged_chunks
def _merge_small_chunks(self, chunks: List[Chunk]) -> List[Chunk]:
"""
合并过小的块
Args:
chunks: 分块列表
Returns:
List[Chunk]: 合并后的分块列表
"""
if not chunks:
return []
merged_chunks = []
current_chunk = chunks[0]
for i in range(1, len(chunks)):
next_chunk = chunks[i]
# 如果当前块太小,尝试与下一个块合并
if current_chunk.word_count < self.min_chunk_size:
combined_content = current_chunk.content + "\n\n" + next_chunk.content
combined_word_count = len(combined_content.split())
# 如果合并后不超过最大限制,则合并
if combined_word_count <= self.max_chunk_size:
# 合并元数据
merged_metadata = current_chunk.metadata.copy()
merged_metadata.update(next_chunk.metadata)
merged_metadata["chunking_method"] = "merged"
current_chunk = Chunk(
content=combined_content,
metadata=merged_metadata
)
else:
# 如果合并后太大,保持当前块,开始新的合并
merged_chunks.append(current_chunk)
current_chunk = next_chunk
else:
# 当前块大小合适,直接添加
merged_chunks.append(current_chunk)
current_chunk = next_chunk
# 添加最后一个块
if current_chunk.word_count >= self.min_chunk_size or not merged_chunks:
merged_chunks.append(current_chunk)
else:
# 如果最后一个块太小,尝试与上一个块合并
if merged_chunks:
last_chunk = merged_chunks[-1]
combined_content = last_chunk.content + "\n\n" + current_chunk.content
combined_word_count = len(combined_content.split())
if combined_word_count <= self.max_chunk_size:
merged_metadata = last_chunk.metadata.copy()
merged_metadata.update(current_chunk.metadata)
merged_metadata["chunking_method"] = "merged"
merged_chunks[-1] = Chunk(
content=combined_content,
metadata=merged_metadata
)
else:
merged_chunks.append(current_chunk)
return merged_chunks
# 示例文档
def create_sample_documents():
"""创建示例文档"""
# 文档1:有清晰标题结构的文档
document_with_headings = """
# Artificial Intelligence: A Comprehensive Overview
## Introduction to AI
Artificial Intelligence (AI) is the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.
## Machine Learning
Machine learning is a subset of AI that enables systems to learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
### Supervised Learning
Supervised learning is a type of machine learning where the model is trained on labeled data. Each training example is paired with an output label, and the algorithm learns to map inputs to outputs.
### Unsupervised Learning
Unsupervised learning deals with unlabeled data. The system tries to learn the patterns and structure from the input data without any explicit supervision.
## Deep Learning
Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data. These neural networks attempt to simulate the behavior of the human brain---albeit far from matching its ability---allowing it to "learn" from large amounts of data.
## Natural Language Processing
Natural Language Processing (NLP) is a branch of AI that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
## Computer Vision
Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do.
## Conclusion
AI continues to evolve and impact various sectors including healthcare, finance, transportation, and entertainment. The future of AI holds immense potential but also presents significant ethical and societal challenges that need to be addressed.
"""
# 文档2:没有清晰标题结构的文档
document_without_headings = """
Artificial intelligence has become increasingly important in today's world. From voice assistants to recommendation systems, AI technologies are transforming how we live and work. The field of AI encompasses a wide range of techniques and approaches, each with its own strengths and limitations.
Machine learning algorithms can analyze vast amounts of data to identify patterns and make predictions. These algorithms are used in various applications, from fraud detection to medical diagnosis. Deep learning, a subset of machine learning, has achieved remarkable success in areas such as image recognition and natural language processing.
Natural language processing enables machines to understand and generate human language. This technology powers chatbots, translation services, and text analysis tools. Recent advances in transformer models have significantly improved the quality of machine-generated text.
Computer vision allows machines to interpret visual information from the world. This technology is used in facial recognition systems, autonomous vehicles, and quality control in manufacturing. Convolutional neural networks have been particularly successful in computer vision tasks.
Reinforcement learning involves training agents to make decisions through trial and error. This approach has been used to develop AI systems that can play complex games like Go and chess at superhuman levels. Reinforcement learning also has applications in robotics and resource management.
The ethical implications of AI are becoming increasingly important as these technologies become more pervasive. Issues such as bias in algorithms, privacy concerns, and the impact on employment need to be carefully considered. Developing transparent and accountable AI systems is crucial for building trust and ensuring beneficial outcomes.
Looking ahead, the future of AI will likely involve more sophisticated systems that can reason, learn, and adapt in complex environments. Research in areas such as explainable AI and AI safety will be important for addressing current limitations and ensuring responsible development.
Collaboration between researchers, policymakers, and industry stakeholders will be essential for harnessing the potential of AI while mitigating risks. Ongoing dialogue about the societal impact of AI will help guide the development of these technologies in ways that benefit humanity.
"""
return {
"with_headings": document_with_headings,
"without_headings": document_without_headings
}
def demonstrate_hybrid_chunking():
"""演示混合分块策略"""
print("=" * 70)
print("混合分块策略演示")
print("=" * 70)
# 创建分块器实例
chunker = HybridChunker(
max_chunk_size=300, # 最大300词
min_chunk_size=30, # 最小30词
window_size=250, # 滑动窗口250词
overlap=50 # 重叠50词
)
# 获取示例文档
documents = create_sample_documents()
# 测试有标题的文档
print("\n1. 处理有清晰标题结构的文档:")
print("-" * 50)
chunks1 = chunker.hybrid_chunking_strategy(documents["with_headings"])
print(f"\n生成的块数量: {len(chunks1)}")
for i, chunk in enumerate(chunks1, 1):
print(f"\n块 {i}:")
print(f" 类型: {chunk.metadata.get('chunking_method', 'unknown')}")
print(f" 大小: {chunk.word_count} 词")
print(f" 标题路径: {chunk.metadata.get('title_path', 'N/A')}")
print(f" 预览: {chunk.content[:100]}...")
# 测试没有标题的文档
print("\n\n2. 处理没有清晰标题结构的文档:")
print("-" * 50)
chunks2 = chunker.hybrid_chunking_strategy(documents["without_headings"])
print(f"\n生成的块数量: {len(chunks2)}")
for i, chunk in enumerate(chunks2, 1):
print(f"\n块 {i}:")
print(f" 类型: {chunk.metadata.get('chunking_method', 'unknown')}")
print(f" 大小: {chunk.word_count} 词")
print(f" 预览: {chunk.content[:100]}...")
# 测试混合分块策略的各个组件
print("\n\n3. 测试分块策略的各个组件:")
print("-" * 50)
# 测试标题检测
print("\na) 标题检测:")
print(f" 文档1是否有清晰标题: {chunker.has_clear_headings(documents['with_headings'])}")
print(f" 文档2是否有清晰标题: {chunker.has_clear_headings(documents['without_headings'])}")
# 测试语义分块
print("\nb) 语义分块测试:")
test_text = "This is a sentence. This is another sentence. And here is a third one."
semantic_chunks = chunker.semantic_chunk_by_sentence(test_text)
print(f" 测试文本: '{test_text}'")
print(f" 生成的块数: {len(semantic_chunks)}")
# 测试滑动窗口分块
print("\nc) 滑动窗口分块测试:")
long_text = " ".join(["Word" + str(i) for i in range(1000)]) # 创建1000个词的文本
window_chunks = chunker.sliding_window_chunking(long_text)
print(f" 长文本词数: 1000")
print(f" 滑动窗口生成的块数: {len(window_chunks)}")
print(f" 第一个块大小: {window_chunks[0].word_count} 词")
print(f" 块之间的重叠: {chunker.overlap} 词")
return chunks1, chunks2
def compare_chunking_methods():
"""比较不同分块方法的效果"""
print("\n\n" + "=" * 70)
print("分块方法比较")
print("=" * 70)
documents = create_sample_documents()
test_document = documents["with_headings"]
# 创建不同的分块器
hybrid_chunker = HybridChunker(max_chunk_size=300)
# 只使用语义分块
semantic_chunks = hybrid_chunker.semantic_chunk_by_sentence(test_document)
# 只使用滑动窗口分块
window_chunks = hybrid_chunker.sliding_window_chunking(test_document)
# 使用混合分块
hybrid_chunks = hybrid_chunker.hybrid_chunking_strategy(test_document)
print("\n不同分块方法的结果对比:")
print("-" * 50)
print(f"文档总词数: {len(test_document.split())}")
print(f"语义分块生成的块数: {len(semantic_chunks)}")
print(f"滑动窗口分块生成的块数: {len(window_chunks)}")
print(f"混合分块生成的块数: {len(hybrid_chunks)}")
print("\n块大小分布:")
methods = [
("语义分块", semantic_chunks),
("滑动窗口", window_chunks),
("混合分块", hybrid_chunks)
]
for method_name, chunks in methods:
if chunks:
sizes = [chunk.word_count for chunk in chunks]
avg_size = sum(sizes) / len(sizes)
min_size = min(sizes)
max_size = max(sizes)
print(f"\n{method_name}:")
print(f" 平均大小: {avg_size:.1f} 词")
print(f" 最小大小: {min_size} 词")
print(f" 最大大小: {max_size} 词")
print(f" 大小标准差: {sum((s - avg_size)**2 for s in sizes)/len(sizes):.1f}")
def main():
"""主函数"""
print("混合分块策略示例")
print("=" * 70)
print("本示例展示了结合多种分块技术的混合策略:")
print("1. 标题检测与结构分析")
print("2. 基于句子的语义分块")
print("3. 滑动窗口分块")
print("4. 块大小优化与合并")
print("=" * 70)
# 演示混合分块策略
chunks1, chunks2 = demonstrate_hybrid_chunking()
# 比较不同分块方法
compare_chunking_methods()
print("\n" + "=" * 70)
print("混合分块策略的优势总结:")
print("=" * 70)
print("✓ 自适应: 根据文档结构选择最合适的分块方法")
print("✓ 保持语义: 优先在句子边界分割")
print("✓ 保持结构: 尊重文档的标题层级")
print("✓ 大小优化: 确保块大小在合理范围内")
print("✓ 元数据丰富: 保留分块方法和结构信息")
print("✓ 鲁棒性: 适用于各种类型的文档")
if __name__ == "__main__":
main()
结果:
bash
混合分块策略示例
======================================================================
本示例展示了结合多种分块技术的混合策略:
1. 标题检测与结构分析
2. 基于句子的语义分块
3. 滑动窗口分块
4. 块大小优化与合并
======================================================================
======================================================================
混合分块策略演示
======================================================================
1. 处理有清晰标题结构的文档:
--------------------------------------------------
开始混合分块策略...
检测到清晰标题结构,使用递归标题分块...
分块完成!共生成 8 个块。
生成的块数量: 8
块 1:
类型: recursive_heading
大小: 53 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Introduction to AI
预览: Artificial Intelligence: A Comprehensive Overview > Introduction to AI
Artificial Intelligence (AI)...
块 2:
类型: recursive_heading
大小: 48 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Machine Learning
预览: Artificial Intelligence: A Comprehensive Overview > Machine Learning
Machine learning is a subset o...
块 3:
类型: recursive_heading
大小: 45 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Machine Learning > Supervised Learning
预览: Artificial Intelligence: A Comprehensive Overview > Machine Learning > Supervised Learning
Supervis...
块 4:
类型: recursive_heading
大小: 34 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Machine Learning > Unsupervised Learning
预览: Artificial Intelligence: A Comprehensive Overview > Machine Learning > Unsupervised Learning
Unsupe...
块 5:
类型: recursive_heading
大小: 57 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Deep Learning
预览: Artificial Intelligence: A Comprehensive Overview > Deep Learning
Deep learning is a subset of mach...
块 6:
类型: recursive_heading
大小: 51 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Natural Language Processing
预览: Artificial Intelligence: A Comprehensive Overview > Natural Language Processing
Natural Language Pr...
块 7:
类型: recursive_heading
大小: 45 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Computer Vision
预览: Artificial Intelligence: A Comprehensive Overview > Computer Vision
Computer vision is an interdisc...
块 8:
类型: recursive_heading
大小: 41 词
标题路径: Artificial Intelligence: A Comprehensive Overview > Conclusion
预览: Artificial Intelligence: A Comprehensive Overview > Conclusion
AI continues to evolve and impact va...
2. 处理没有清晰标题结构的文档:
--------------------------------------------------
开始混合分块策略...
未检测到清晰标题结构,使用语义分块...
分块完成!共生成 2 个块。
生成的块数量: 2
块 1:
类型: semantic_sentence
大小: 296 词
预览: Artificial intelligence has become increasingly important in today's world. From voice assistants to...
块 2:
类型: semantic_sentence
大小: 40 词
预览: Collaboration between researchers, policymakers, and industry stakeholders will be essential for har...
3. 测试分块策略的各个组件:
--------------------------------------------------
a) 标题检测:
文档1是否有清晰标题: True
文档2是否有清晰标题: False
b) 语义分块测试:
测试文本: 'This is a sentence. This is another sentence. And here is a third one.'
生成的块数: 0
c) 滑动窗口分块测试:
长文本词数: 1000
滑动窗口生成的块数: 5
第一个块大小: 250 词
块之间的重叠: 50 词
======================================================================
分块方法比较
======================================================================
开始混合分块策略...
检测到清晰标题结构,使用递归标题分块...
块过小 (48 词),尝试与相邻块合并...
块过小 (45 词),尝试与相邻块合并...
块过小 (34 词),尝试与相邻块合并...
块过小 (45 词),尝试与相邻块合并...
块过小 (41 词),尝试与相邻块合并...
分块完成!共生成 5 个块。
不同分块方法的结果对比:
--------------------------------------------------
文档总词数: 334
语义分块生成的块数: 2
滑动窗口分块生成的块数: 1
混合分块生成的块数: 5
块大小分布:
语义分块:
平均大小: 167.0 词
最小大小: 53 词
最大大小: 281 词
大小标准差: 12996.0
滑动窗口:
平均大小: 334.0 词
最小大小: 334 词
最大大小: 334 词
大小标准差: 0.0
混合分块:
平均大小: 74.8 词
最小大小: 51 词
最大大小: 93 词
大小标准差: 352.2
======================================================================
混合分块策略的优势总结:
======================================================================
✓ 自适应: 根据文档结构选择最合适的分块方法
✓ 保持语义: 优先在句子边界分割
✓ 保持结构: 尊重文档的标题层级
✓ 大小优化: 确保块大小在合理范围内
✓ 元数据丰富: 保留分块方法和结构信息
✓ 鲁棒性: 适用于各种类型的文档
七、最佳实践与调优建议
1. 分块大小优化
- 分析任务性能与块大小的关系
- 考虑模型上下文窗口限制
- 平衡信息密度与计算效率
2. 重叠策略调优
- 关键信息区域增加重叠
- 非关键区域减少重叠
- 动态重叠策略
3. 评估指标
- 信息完整性:关键概念是否被分割
- 检索精度:分块对检索任务的影响
- 处理效率:分块速度和内存使用
4. 实际应用注意事项
- 处理多语言文本
- 处理特殊格式(代码、公式)
- 考虑隐私和安全需求
八、未来发展趋势
- 自适应分块:根据内容和任务动态调整分块策略
- 跨文档分块:考虑多个文档间的关联性
- 多模态分块:处理文本、图像、表格混合内容
- 增量式分块:流式数据处理中的实时分块
总结
文本分块是NLP预处理的关键步骤,没有"一刀切"的最佳策略。选择合适的分块策略需要综合考虑:
- 文档特性
- 任务需求
- 资源约束
- 性能目标
建议从简单策略开始,逐步迭代优化,最终可能采用混合策略以达到最佳效果。在实际应用中,持续监控和评估分块策略对下游任务的影响至关重要。