一、文档格式全覆盖解析方案

客户的数据构成：

| 格式 | 数量 | 占比 | 最大难点 |

|------|------|------|---------|

| PDF（文本型） | 4.2万份 | 35% | 表格提取、多栏排版 |

| PDF（扫描型） | 1.8万份 | 15% | OCR准确性、版面还原 |

| Word（.docx） | 3.1万份 | 26% | 内嵌对象、复杂样式 |

| Excel | 1.5万份 | 12% | 多sheet、合并单元格 |

| PPT | 0.8万份 | 7% | 文本框顺序 |

| 纯文本/HTML | 0.6万份 | 5% | 格式清洗 |

1.1 文本型PDF：PDFPlumber + 坐标还原

最开始用PyPDF2，代码简洁但效果极差。一份双栏排版的PDF解析出来，左栏第一行后面跟着右栏第一行，再跟左栏第二行，阅读顺序完全错乱。

PDFPlumber能获取每个字符的坐标信息，核心是利用坐标还原阅读顺序：

```python

import pdfplumber

def parse_pdf_with_layout(pdf_path: str) -> str:

text_blocks = \[\]

with pdfplumber.open(pdf_path) as pdf:

for page in pdf.pages:

获取页面尺寸

page_width = page.width

page_height = page.height

提取所有文本块及其位置

words = page.extract_words(

x_tolerance=3, # 水平合并容差

y_tolerance=3, # 垂直合并容差

keep_blank_chars=False,

use_text_flow=True # 按阅读顺序排列

)

按y坐标排序（从上到下）

words_sorted = sorted(words, key=lambda w: (w $'top'$ , w $'x0'$ ))

按行聚合

current_line = \[\]

current_top = None

for w in words_sorted:

if current_top is None or abs(w $'top'$ - current_top) < 5:

current_line.append(w $'text'$ )

else:

text_blocks.append(' '.join(current_line))

current_line = $w\['text'$ ]

current_top = w $'top'$

if current_line:

text_blocks.append(' '.join(current_line))

return '\n'.join(text_blocks)

```

x_tolerance和y_tolerance两个参数是关键。默认值2，但对于字号较大的标题需要调到3-5，否则同一个词会被拆成多个字符。调参依据：看解析出来的文本里有没有"今天天气不错"这种被空格隔开的词，有就调大x_tolerance。

**表格处理**：PDFPlumber的extract_table()对有线表格效果不错，但对无线表格（用空格对齐的那种）基本无效。我的方案是先尝试extract_table()，如果返回的表格行数少于3行，改用extract_text()按坐标自己解析：

```python

def extract_table_smart(page) -> List $List\[str$ ]:

table = page.extract_table({

'vertical_strategy': 'lines', # 有线表格用线条

'horizontal_strategy': 'lines'

})

if table and len(table) > 3:

return table

无线表格：用文本坐标推断列边界

words = page.extract_words()

if not words:

return \[\]

按x坐标聚类找到列边界

x_coords = sorted(set( $w\['x0'$ for w in words]))

聚类算法找列

from sklearn.cluster import DBSCAN

import numpy as np

x_clusters = DBSCAN(eps=10, min_samples=5).fit(np.array(x_coords).reshape(-1, 1))

... 根据聚类结果切分每行文本

```

这个方法能覆盖90%以上的表格，剩下10%的复杂表格人工处理或直接转图片存为附件。

1.2 扫描型PDF：PaddleOCR + 版面分析

扫描PDF本质是图片集合。测试过Tesseract、EasyOCR、PaddleOCR三个方案：

|---------|-----------|---------|------------|

| Tesseract(中文训练) | 72% | 差 | 0.5 |

| EasyOCR | 81% | 一般 | 0.2 |

| PaddleOCR | 93% | 好 | 0.3 |

PaddleOCR胜出。关键是用它的版面分析模型PP-Structure，能区分正文、标题、表格、图片四种区域：

```python

from paddleocr import PaddleOCR

ocr = PaddleOCR(

use_angle_cls=True, # 启用方向分类

lang='ch',

use_gpu=True,

show_log=False,

det_db_thresh=0.3, # 检测阈值

det_db_box_thresh=0.5,

rec_batch_num=6 # 批量识别，提高速度

)

def ocr_with_layout(image_path: str) -> dict:

result = ocr.ocr(image_path, cls=True)

result结构: $\[\[坐标$ , (文字, 置信度)], ...]

texts = \[\]

for line in result $0$ :

text = line $1$ $0$

confidence = line $1$ $1$

if confidence > 0.5: # 过滤低置信度

texts.append(text)

return {

'text': '\n'.join(texts),

'raw_result': result

}

```

置信度阈值设0.5是反复试出来的。设太高会漏掉模糊但正确的文字，设太低会引入一堆乱码。

**性能优化**：1.8万份扫描PDF，每份平均15页，单机跑完需要约45天。最终用Ray做了分布式，6台机器并行：

```python

import ray

@ray.remote(num_gpus=0.5)

def process_pdf_remote(pdf_path):

每个worker跑一个PDF

return process_scanned_pdf(pdf_path)

分布式调度

futures = $process_pdf_remote.remote(p) for p in pdf_list$

results = ray.get(futures)

```

6台机器实际跑了7天完成全量处理。

1.3 Word文档：python-docx + 深度解包

Word的主要坑是内嵌对象（嵌入的Excel图表、Visio流程图）无法直接读取。

```python

from docx import Document

import zipfile

import xml.etree.ElementTree as ET

def parse_docx(docx_path: str) -> dict:

doc = Document(docx_path)

1. 正文段落

paragraphs = $p.text for p in doc.paragraphs if p.text.strip()$

2. 表格

tables = \[\]

for table in doc.tables:

table_data = \[\]

for row in table.rows:

row_data = $cell.text.strip() for cell in row.cells$

table_data.append(row_data)

tables.append(table_data)

3. 内嵌对象（解压docx）

embedded_texts = \[\]

with zipfile.ZipFile(docx_path, 'r') as zf:

for name in zf.namelist():

word/embeddings/ 目录下是内嵌对象

if name.startswith('word/embeddings/') and name.endswith('.xlsx'):

解压并解析Excel

excel_data = zf.read(name)

用pandas读取

df = pd.read_excel(io.BytesIO(excel_data), sheet_name=None)

for sheet_name, sheet_df in df.items():

embedded_texts.append(f"表格 ${sheet_name}$ :\n{sheet_df.to_string()}")

return {

'paragraphs': paragraphs,

'tables': tables,

'embedded': embedded_texts

}

```

**样式信息**：标题级别通过p.style.name获取，用于后续的语义切分：

```python

for p in doc.paragraphs:

style = p.style.name

if 'Heading 1' in style:

level = 1

elif 'Heading 2' in style:

level = 2

...

```

1.4 Excel：多sheet + 合并单元格处理

```python

def parse_excel(excel_path: str) -> dict:

all_sheets = {}

xl = pd.ExcelFile(excel_path)

for sheet_name in xl.sheet_names:

读取时保留合并单元格信息

df = pd.read_excel(excel_path, sheet_name=sheet_name, header=None)

前向填充合并单元格（NaT/NaN用前值填充）

df = df.fillna(method='ffill', axis=0)

df = df.fillna(method='ffill', axis=1)

清洗空行空列

df = df.dropna(how='all')

df = df.dropna(how='all', axis=1)

转成文本，保留行列结构提示

text_parts = $f"=== {sheet_name} ==="$

for idx, row in df.iterrows():

row_text = ' | '.join( $str(c) if pd.notna(c) else '' for c in row$ )

text_parts.append(row_text)

all_sheets $sheet_name$ = '\n'.join(text_parts)

return all_sheets

```

1.5 统一输出格式：为什么选Markdown

所有格式解析完成后，统一输出为Markdown。选Markdown的原因：

**标题层级**：`# ## ###`天然表示文档结构，切分时可利用
**表格语法**：`| col1 | col2 |` 在LLM中比纯文本表格更容易被理解
**代码块**：技术文档中的代码、命令可以保留
**LLM友好**：训练语料中大量Markdown，模型熟悉这种格式

转换示例：

```python

def to_markdown(parsed: dict, source_type: str) -> str:

md_parts = \[\]

if source_type == 'pdf':

根据标题样式推断层级

for line in parsed $'text'$ .split('\n'):

if is_title(line):

level = detect_heading_level(line)

md_parts.append(f"{'#' * level} {clean_title(line)}")

else:

md_parts.append(line)

elif source_type == 'docx':

for p in parsed $'paragraphs'$ :

if p.get('style', '').startswith('Heading'):

level = int(p $'style'$ .split() $-1$ )

md_parts.append(f"{'#' * level} {p $'text'$ }")

else:

md_parts.append(p $'text'$ )

表格转markdown

for table in parsed $'tables'$ :

md_parts.append(table_to_markdown(table))

return '\n\n'.join(md_parts)

```

二、智能切分方案迭代

2.1 第一版：固定长度切分（失败）

最初的方案：

```python

def fixed_chunk(text: str, chunk_size=512, overlap=50):

words = text.split()

chunks = \[\]

for i in range(0, len(words), chunk_size - overlap):

chunk = ' '.join(words $i:i + chunk_size$ )

chunks.append(chunk)

return chunks

```

**失败表现**：

一段技术参数"设备型号MC-2023，额定功率7.5kW，工作温度范围-10℃~60℃"被切成了两半
检索"额定功率"时，包含这个信息的chunk只有后半段，但前半段包含了设备型号，导致召回失败
20%的测试query因此漏召回

2.2 第二版：按标题切分（部分成功）

利用Markdown的标题层级：

```python

def split_by_headers(text: str) -> List $Dict$ :

按# ## ###切分

sections = \[\]

current_header = None

current_content = \[\]

for line in text.split('\n'):

if re.match(r'^#{1,6}\s+', line):

新标题，保存上一节

if current_content:

sections.append({

'header': current_header,

'content': '\n'.join(current_content)

})

current_header = line

current_content = \[\]

else:

current_content.append(line)

if current_content:

sections.append({

'header': current_header,

'content': '\n'.join(current_content)

})

return sections

```

**问题**：很多文档没有标题结构（如Excel导出的数据、扫描PDF的纯文本），按标题切分失效。

2.3 最终版：语义切分 + 父子文档

**核心逻辑**：

先用句子边界做初切
再用语义相似度决定是否合并
构建父子两层chunk结构

```python

import re

from sentence_transformers import SentenceTransformer

class SemanticChunker:

def init(self, min_size=200, max_size=800, overlap=50):

self.min_size = min_size

self.max_size = max_size

self.overlap = overlap

self.encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

def split(self, text: str, title: str = '') -> List $Dict$ :

1. 句子切分（中文句子边界）

sentences = self._split_sentences(text)

2. 如果文档有标题结构，先按标题分

if title:

sections = self._split_by_headers(text)

else:

sections = ${'header': '', 'content': text}$

3. 对每个section做语义合并

chunks = \[\]

for section in sections:

chunk_chunks = self._semantic_merge(section $'content'$ )

for c in chunk_chunks:

chunks.append({

'content': c,

'header': section $'header'$ ,

'type': 'parent' if len(c) > 500 else 'child'

})

return chunks

def _split_sentences(self, text: str) -> List $str$ :

中文句子边界：句号、问号、感叹号、换行

pattern = r'(?<= $。！？！\\n$ )\s*'

sents = re.split(pattern, text)

return $s.strip() for s in sents if s.strip()$

def _semantic_merge(self, sentences: List $str$ ) -> List $str$ :

if not sentences:

return \[\]

计算每句的embedding

if len(sentences) == 1:

return sentences

embs = self.encoder.encode(sentences)

贪心合并：相似度高的相邻句子合并

chunks = \[\]

current_chunk = $sentences\[0$ ]

current_len = len(sentences $0$ )

for i in range(1, len(sentences)):

计算当前句与上一句的相似度

sim = cosine_similarity(embs $i-1$ , embs $i$ )

new_len = current_len + len(sentences $i$ )

判断是否合并

if sim > 0.7 and new_len < self.max_size:

current_chunk.append(sentences $i$ )

current_len = new_len

else:

如果当前chunk太小，强制合并

if current_len < self.min_size and i < len(sentences) - 1:

current_chunk.append(sentences $i$ )

current_len = new_len

else:

chunks.append(''.join(current_chunk))

current_chunk = $sentences\[i$ ]

current_len = len(sentences $i$ )

if current_chunk:

chunks.append(''.join(current_chunk))

return chunks

def cosine_similarity(self, a, b):

return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

```

**父子结构**：

```python

def build_parent_child(chunks: List $Dict$ ) -> List $Dict$ :

"""

将相邻的小chunk合并为parent，保留child用于检索

"""

result = \[\]

i = 0

while i < len(chunks):

chunk = chunks $i$

如果chunk已经是parent，直接保留

if chunk $'type'$ == 'parent':

result.append({

'id': f"parent_{i}",

'content': chunk $'content'$ ,

'is_parent': True,

'children': \[\]

})

i += 1

continue

收集连续的child，合并成一个parent

children = \[\]

current_parent_content = \[\]

while i < len(chunks) and len(''.join(current_parent_content)) < 1000:

children.append(chunks $i$ )

current_parent_content.append(chunks $i$ $'content'$ )

i += 1

parent_content = ''.join(current_parent_content)

parent_id = f"parent_{i}"

为每个child建立到parent的关联

for child in children:

result.append({

'id': f"child_{child $'id'$ }",

'content': child $'content'$ ,

'is_parent': False,

'parent_id': parent_id

})

parent单独存储

result.append({

'id': parent_id,

'content': parent_content,

'is_parent': True,

'children': $c\['id'$ for c in children]

})

return result

```

2.4 切分参数调优

在500条测试集上做了参数搜索：

|----------|----------|---------|-------|

| 100 | 500 | 50 | 0.82 |

| 150 | 600 | 50 | 0.84 |

| 200 | 800 | 50 | 0.85 |

| 250 | 1000 | 50 | 0.83 |

| 200 | 800 | 80 | 0.85 |

| 200 | 800 | 30 | 0.84 |

最终选择：min_size=200, max_size=800, overlap=50。这个组合在召回率上最优，且chunk大小适合后续rerank（不超过512 token）。

三、去重与质量过滤

3.1 近似去重

企业文档中有大量重复内容（同一文档在不同部门重复存储）。用MinHash做近似去重：

```python

from datasketch import MinHash, MinHashLSH

def deduplicate_chunks(chunks: List $str$ , threshold=0.85) -> List $str$ :

lsh = MinHashLSH(threshold=threshold, num_perm=128)

构建MinHash

for i, chunk in enumerate(chunks):

m = MinHash(num_perm=128)

for word in chunk.split():

m.update(word.encode('utf8'))

lsh.insert(f"doc_{i}", m)

去重：保留第一个，删除相似的

unique = \[\]

seen = set()

for i, chunk in enumerate(chunks):

m = MinHash(num_perm=128)

for word in chunk.split():

m.update(word.encode('utf8'))

查询相似文档

similar = lsh.query(m)

if not similar:

unique.append(chunk)

lsh.insert(f"doc_{i}", m)

return unique

```

阈值设0.85，即相似度85%以上的视为重复。在12万份文档上去重后，chunk数量从8100万降到7300万，减少了约10%。

3.2 质量过滤

过滤低质量chunk：

```python

def filter_quality(chunk: str) -> bool:

1. 长度过滤

if len(chunk) < 50:

return False

2. 字符比例：中文/英文/数字占比低于30%的过滤

ascii_count = sum(1 for c in chunk if ord(c) < 128)

if ascii_count / len(chunk) > 0.9:

return False # 纯乱码或纯符号

3. 重复字符比例

from collections import Counter

counter = Counter(chunk)

max_freq = max(counter.values())

if max_freq / len(chunk) > 0.5:

return False # 大量重复字符

4. 是否包含有意义的中文词

import jieba

words = jieba.lcut(chunk)

content_words = $w for w in words if len(w) \> 1$

if len(content_words) < 3:

return False

return True

```

这个过滤器过滤掉了约8%的低质量chunk，主要是扫描PDF里的页眉页脚、空白页、乱码页。

四、完整的Pipeline代码

```python

class DocumentProcessor:

def init(self, config: dict):

self.config = config

self.chunker = SemanticChunker(

min_size=config.get('min_chunk_size', 200),

max_size=config.get('max_chunk_size', 800)

)

def process(self, file_path: str) -> List $Dict$ :

1. 识别格式

ext = file_path.split('.') $-1$ .lower()

2. 解析

if ext == 'pdf':

raw = self._parse_pdf(file_path)

elif ext == 'docx':

raw = self._parse_docx(file_path)

elif ext in $'xlsx', 'xls'$ :

raw = self._parse_excel(file_path)

elif ext == 'pptx':

raw = self._parse_pptx(file_path)

else:

raw = self._parse_text(file_path)

3. 转Markdown

md_content = self._to_markdown(raw, ext)

4. 语义切分

chunks = self.chunker.split(md_content)

5. 父子结构

final = build_parent_child(chunks)

6. 质量过滤

final = $c for c in final if filter_quality(c\['content'$ )]

return final

def _parse_pdf(self, path: str) -> dict:

判断是否扫描版

if self._is_scanned_pdf(path):

return self._ocr_pdf(path)

else:

return self._parse_text_pdf(path)

def _is_scanned_pdf(self, path: str) -> bool:

用pdfplumber快速判断：提取不到文字或文字极少

with pdfplumber.open(path) as pdf:

page = pdf.pages $0$

text = page.extract_text()

return not text or len(text.strip()) < 50

```

五、数据统计

处理8000万chunk后的数据分布：

| 指标 | 数值 |

|------|------|

| 总文档数 | 12.4万 |

| 总chunk数(去重后) | 7300万 |

| 平均chunk长度 | 486字符 |

| 平均文档chunk数 | 588 |

| 数据处理总耗时 | 8天(6台GPU机器) |

| 过滤掉的无效chunk | 8.2% |

RAG系统文档解析

1.1 文本型PDF：PDFPlumber + 坐标还原

获取页面尺寸

提取所有文本块及其位置

按y坐标排序（从上到下）

按行聚合

无线表格：用文本坐标推断列边界

按x坐标聚类找到列边界

聚类算法找列

... 根据聚类结果切分每行文本

1.2 扫描型PDF：PaddleOCR + 版面分析

result结构: \[\[坐标, (文字, 置信度)], ...]

每个worker跑一个PDF

分布式调度

1.3 Word文档：python-docx + 深度解包

1. 正文段落

2. 表格

3. 内嵌对象（解压docx）

word/embeddings/ 目录下是内嵌对象

解压并解析Excel

用pandas读取

...

1.4 Excel：多sheet + 合并单元格处理

读取时保留合并单元格信息

前向填充合并单元格（NaT/NaN用前值填充）

清洗空行空列

转成文本，保留行列结构提示

1.5 统一输出格式：为什么选Markdown

根据标题样式推断层级

表格转markdown

二、智能切分方案迭代

2.1 第一版：固定长度切分（失败）

2.2 第二版：按标题切分（部分成功）

按# ## ###切分

新标题，保存上一节

2.3 最终版：语义切分 + 父子文档

1. 句子切分（中文句子边界）

2. 如果文档有标题结构，先按标题分

3. 对每个section做语义合并

中文句子边界：句号、问号、感叹号、换行

计算每句的embedding

贪心合并：相似度高的相邻句子合并

计算当前句与上一句的相似度

判断是否合并

如果当前chunk太小，强制合并

如果chunk已经是parent，直接保留

收集连续的child，合并成一个parent

为每个child建立到parent的关联

parent单独存储

2.4 切分参数调优

三、去重与质量过滤

3.1 近似去重

构建MinHash

去重：保留第一个，删除相似的

查询相似文档

3.2 质量过滤

1. 长度过滤

2. 字符比例：中文/英文/数字占比低于30%的过滤

3. 重复字符比例

4. 是否包含有意义的中文词

四、完整的Pipeline代码

1. 识别格式

2. 解析

3. 转Markdown

4. 语义切分

5. 父子结构

6. 质量过滤

判断是否扫描版

用pdfplumber快速判断：提取不到文字或文字极少

五、数据统计

result结构: $\[\[坐标$ , (文字, 置信度)], ...]