文档解析实战：PDF、Word 与 HTML 的清洗提取指南

① 解析环境搭建与核心库安装

动手之前先把环境搭好。建议用虚拟环境隔离项目依赖，避免各个项目之间的库版本打架。

bash 复制代码

# 创建虚拟环境（Python 3.8+）
python -m venv doc_parser_env

# 激活环境（Windows）
doc_parser_env\Scripts\activate
# 激活环境（Mac/Linux）
source doc_parser_env/bin/activate

然后一次性装上我们接下来要用到的所有核心库：

bash 复制代码

pip install pdfplumber pypdf python-docx beautifulsoup4 lxml openpyxl pandas camelot-py tabula-py ftfy chardet

简单说一下每个库是干嘛的：

pdfplumber：PDF文本和表格提取的主力
pypdf：PDF结构化操作（合并、拆分、旋转、加密等）
python-docx：Word（.docx）文档读写
beautifulsoup4：HTML解析与标签剥离
lxml：BeautifulSoup的解析器后端，解析速度更快
openpyxl：读写Excel文件，用于导出表格数据
pandas：数据处理，配合表格提取使用
camelot/tabula-py：PDF表格专项提取
ftfy：修复乱码文本
chardet：检测文件编码

踩坑提醒：camelot 依赖 Ghostscript，Windows用户需要单独下载安装并添加到环境变量。如果不想折腾，可以先用 tabula-py 替代。

② 三大文档格式核心概念速览

在动手写代码之前，花两分钟了解一下这三种格式的本质差异，能帮你少走很多弯路。

PDF（Portable Document Format） ：PDF的设计目标是"所见即所得"------不管在什么设备上打开，排版都一样。但代价是，PDF里面的文本不一定是按阅读顺序存储的，可能是分散的字符碎片靠坐标拼出来的。这就导致直接用代码读PDF经常得到乱序的文字。另外PDF分两种：一种是文字版（从Word之类导出的，文字可选中的），一种是扫描版（其实就是图片）。

Word（.docx） ：.docx 本质上是一个ZIP压缩包，里面是一堆XML文件。所以它天生是结构化的------段落是段落，表格是表格，标题是标题。用 python-docx 读起来相对规整。

HTML：HTML也是结构化的，标签嵌套表示层级关系。但网页里混着大量样式、脚本、广告等噪声，我们的目标是剥离标签只留正文文本。

③ PDF 文档内容提取与噪声清洗

3.1 用 pdfplumber 提取文本

pdfplumber 是目前提取PDF文本最顺手的库，尤其适合中文文档。

python 复制代码

import pdfplumber
import re

def extract_pdf_text(pdf_path):
    full_text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                full_text.append(text)
    return '\n'.join(full_text)

# 使用
text = extract_pdf_text('document.pdf')
print(text[:500])

3.2 清洗PDF文本的常见噪声

PDF提取出来的文本通常带着一堆乱七八糟的东西------页码、页眉页脚、断词连字符、多余换行。下面这个清洗函数能处理大部分常见问题：

python 复制代码

def clean_pdf_text(text):
    # 1. 移除页码（常见格式：- 1 -、Page 1 of 10、第1页）
    text = re.sub(r'[----]*\s*第?\s*\d+\s*页\s*[----]*', '', text)
    text = re.sub(r'[----]*\s*Page\s*\d+\s*(of\s*\d+)?\s*[----]*', '', text, flags=re.I)
    text = re.sub(r'[----]*\s*\d+\s*[----]*', '', text)
    
    # 2. 修复断词连字符（PDF中常见的词尾 '-' + 换行）
    text = re.sub(r'-\n', '', text)
    
    # 3. 合并被换行打断的段落（连续两行之间没有句号结尾的，合并）
    lines = text.split('\n')
    merged = []
    for line in lines:
        if merged and not merged[-1].endswith(('。', '！', '？', '.', '!', '?')):
            merged[-1] += line
        else:
            merged.append(line)
    text = '\n'.join(merged)
    
    # 4. 压缩多余空白
    text = re.sub(r' +', ' ', text)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    return text.strip()

3.3 用 pypdf 做结构化操作

如果只是想做合并、拆分、加密这类操作，用 pypdf 更合适：

python 复制代码

from pypdf import PdfReader, PdfWriter, PdfMerger

# 合并多个PDF
merger = PdfMerger()
merger.append('file1.pdf')
merger.append('file2.pdf')
merger.write('merged.pdf')
merger.close()

# 拆分PDF（每页存成一个文件）
reader = PdfReader('big_file.pdf')
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f'page_{i+1}.pdf', 'wb') as f:
        writer.write(f)

④ Word 文档结构化数据读取方法

4.1 提取段落文本

python-docx 用起来比PDF简单得多，因为Word天生就是结构化的：

python 复制代码

from docx import Document

def extract_docx_text(file_path):
    doc = Document(file_path)
    return '\n'.join([para.text for para in doc.paragraphs])

text = extract_docx_text('report.docx')

4.2 提取表格数据

表格是Word文档里最有价值的结构化数据：

python 复制代码

def extract_docx_tables(file_path):
    doc = Document(file_path)
    all_tables = []
    for table in doc.tables:
        table_data = []
        for row in table.rows:
            row_data = [cell.text.strip() for cell in row.cells]
            table_data.append(row_data)
        all_tables.append(table_data)
    return all_tables

tables = extract_docx_tables('report.docx')
for i, table in enumerate(tables):
    print(f"表格 {i+1}: {len(table)} 行")
    for row in table[:3]:  # 打印前3行预览
        print(row)

4.3 处理合并单元格

合并单元格是Word表格提取中最头疼的问题。python-docx 不会自动告诉你哪些单元格被合并了，需要自己判断。一个简单的处理思路是：遍历时记录空单元格的位置，用上一个非空值填充：

python 复制代码

def extract_table_with_merged(table):
    rows_data = []
    for row in table.rows:
        row_data = []
        for cell in row.cells:
            text = cell.text.strip()
            if text:
                row_data.append(text)
            elif row_data:
                # 如果是空单元格，用左边最近的非空值填充
                row_data.append(row_data[-1])
            else:
                row_data.append('')
        rows_data.append(row_data)
    return rows_data

注意：这个方法只能处理简单的横向合并。如果遇到复杂的跨行跨列合并，建议用 python-docx 的底层XML操作，或者把文档另存为PDF后再用pdfplumber处理。

4.4 处理老旧的 .doc 格式

python-docx 只能处理 .docx，碰到 .doc 格式需要转换。Windows上可以用 pywin32 调用本地的Word程序：

python 复制代码

# Windows + 已安装Microsoft Word
import win32com.client as win32

def extract_doc_text(file_path):
    word = win32.gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(file_path)
    text = doc.Content.Text
    doc.Close()
    word.Quit()
    return text

跨平台方案是用 LibreOffice 的命令行工具转换：

bash 复制代码

unoconv -f docx input.doc

⑤ HTML 网页标签剥离与文本净化

5.1 用 BeautifulSoup 提取纯文本

HTML解析最常用的组合是 BeautifulSoup + lxml：

python 复制代码

from bs4 import BeautifulSoup

def extract_html_text(html_content):
    soup = BeautifulSoup(html_content, 'lxml')
    
    # 移除脚本和样式标签（这些里面没有正文）
    for tag in soup(['script', 'style', 'head', 'meta', 'noscript']):
        tag.decompose()
    
    # 提取文本，用换行分隔不同块
    text = soup.get_text(separator='\n', strip=True)
    return text

# 从文件读取
with open('page.html', 'r', encoding='utf-8') as f:
    html = f.read()
text = extract_html_text(html)

5.2 进阶清洗：去除多余空行和特殊字符

python 复制代码

import re

def clean_html_text(text):
    # 移除多余空行
    text = re.sub(r'\n\s*\n', '\n\n', text)
    # 移除行首行尾空白
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    text = '\n'.join(lines)
    # 处理HTML实体（&nbsp; &amp; 等）BeautifulSoup已自动处理
    return text

5.3 按需保留特定标签

有时候不想把所有标签都删掉，比如想保留标题层级：

python 复制代码

def extract_html_with_headers(html_content):
    soup = BeautifulSoup(html_content, 'lxml')
    
    # 移除噪声标签
    for tag in soup(['script', 'style', 'nav', 'footer', 'aside']):
        tag.decompose()
    
    # 提取标题和正文
    result = []
    for tag in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p']):
        if tag.name.startswith('h'):
            result.append(f"【{tag.name.upper()}】{tag.get_text(strip=True)}")
        else:
            result.append(tag.get_text(strip=True))
    
    return '\n'.join(result)

⑥ 多格式统一处理流程实战

实际工作中经常要处理混装的文档------今天来一个PDF，明天来个Word，后天爬了个网页。写一个统一入口函数，根据文件扩展名自动选择解析方式：

python 复制代码

import os

def parse_document(file_path):
    ext = os.path.splitext(file_path)[1].lower()
    
    if ext == '.pdf':
        raw_text = extract_pdf_text(file_path)
        return clean_pdf_text(raw_text)
    
    elif ext in ['.docx', '.doc']:
        if ext == '.doc':
            # 先转成docx（这里省略转换逻辑，参考4.4节）
            pass
        raw_text = extract_docx_text(file_path)
        # Word文档通常噪声较少，简单清洗即可
        return re.sub(r'\n{3,}', '\n\n', raw_text)
    
    elif ext in ['.html', '.htm']:
        with open(file_path, 'r', encoding='utf-8') as f:
            html = f.read()
        text = extract_html_text(html)
        return clean_html_text(text)
    
    else:
        raise ValueError(f"不支持的格式: {ext}")

# 批量处理
def batch_parse(folder_path):
    results = {}
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        try:
            results[filename] = parse_document(file_path)
        except Exception as e:
            print(f"解析 {filename} 失败: {e}")
            results[filename] = None
    return results

⑦ 特殊字符编码异常修复技巧

编码问题是文档解析中最常见也最磨人的坑。

7.1 检测文件编码

用 chardet 自动检测编码：

python 复制代码

import chardet

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        raw_data = f.read(10000)  # 读前10KB足够判断
    result = chardet.detect(raw_data)
    return result['encoding'], result['confidence']

# 使用
encoding, confidence = detect_encoding('unknown.txt')
print(f"检测到编码: {encoding}, 置信度: {confidence}")

7.2 用 ftfy 修复乱码

有些文件打开后全是乱码，比如"æ--‡æ¡£"这种，用 ftfy 一键修复：

python 复制代码

from ftfy import fix_text

# 修复单个字符串
garbled = "âœ" No problems"
fixed = fix_text(garbled)
print(fixed)  # 输出: "✔ No problems"

# 批量修复文本
def fix_document_text(text):
    return fix_text(text)

7.3 读取文件时的编码容错

python 复制代码

def safe_read_text(file_path):
    # 尝试常见编码
    encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'shift-jis']
    
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as f:
                return f.read()
        except UnicodeDecodeError:
            continue
    
    # 全部失败则用检测结果
    encoding, _ = detect_encoding(file_path)
    with open(file_path, 'r', encoding=encoding, errors='ignore') as f:
        return f.read()

⑧ 复杂排版下的表格提取方案

表格提取是文档解析里最有技术含量的环节，尤其是PDF里的表格。

8.1 pdfplumber 提取表格（适合有边框的表格）

python 复制代码

import pdfplumber

def extract_tables_pdfplumber(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        all_tables = []
        for page in pdf.pages:
            tables = page.extract_tables()
            for table in tables:
                # 过滤空表
                if table and len(table) > 1:
                    all_tables.append(table)
    return all_tables

8.2 Camelot 提取表格（适合复杂表格）

Camelot 提供两种模式：

lattice：适合有明确表格线的网格表格
stream：适合没有表格线的表格，靠空白分隔

python 复制代码

import camelot

# lattice模式 - 有表格线的PDF
tables = camelot.read_pdf('table_with_lines.pdf', flavor='lattice')
tables[0].df  # 转为DataFrame

# stream模式 - 无表格线的PDF
tables = camelot.read_pdf('table_no_lines.pdf', flavor='stream')

# 导出为CSV
tables[0].to_csv('output.csv')

踩坑提醒：Camelot 需要 Ghostscript，安装方法：

Windows：下载安装 Ghostscript，加到 PATH

Mac：brew install ghostscript

Linux：sudo apt-get install ghostscript

8.3 Tabula-py 提取表格（Java依赖，但更稳定）

python 复制代码

import tabula

# 读取PDF中的所有表格
tables = tabula.read_pdf('table.pdf', pages='all')

# 指定页面
tables = tabula.read_pdf('table.pdf', pages='1-3')

# 转为DataFrame列表
for df in tables:
    print(df.head())

8.4 导出到Excel

python 复制代码

import pandas as pd
from openpyxl import Workbook

def tables_to_excel(tables, output_path):
    wb = Workbook()
    for i, table in enumerate(tables):
        ws = wb.create_sheet(title=f'Table_{i+1}')
        for row_idx, row in enumerate(table, 1):
            for col_idx, cell in enumerate(row, 1):
                ws.cell(row=row_idx, column=col_idx, value=cell)
    # 删除默认创建的空sheet
    if 'Sheet' in wb.sheetnames:
        wb.remove(wb['Sheet'])
    wb.save(output_path)

⑨ 常见解析报错与排查思路

错误现象	可能原因	解决方法
PDF提取文本为空	扫描版PDF（纯图片）	需配合OCR（Tesseract + pdf2image）
PDF文本乱序	多栏排版	用pdfplumber的坐标信息重新排序
Word读取报错	文件是.doc而非.docx	用LibreOffice或pywin32转换
HTML解析卡死	页面超大或标签不闭合	换用'lxml'解析器，或设置时间限制
UnicodeDecodeError	编码不对	用chardet检测后用正确编码重读
Camelot找不到表格	表格线太淡或没有线	换stream模式，或调参数
内存溢出	文件太大	分页读取，用生成器而非一次性加载

扫描版PDF的OCR处理示例：

python 复制代码

from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path, lang='chi_sim+eng'):
    images = convert_from_path(pdf_path, dpi=300)
    text = ''
    for i, img in enumerate(images):
        # 灰度化 + 二值化提升识别率
        img = img.convert('L').point(lambda x: 0 if x < 140 else 255)
        text += f"--- Page {i+1} ---\n"
        text += pytesseract.image_to_string(img, lang=lang)
    return text

⑩ 批量处理脚本编写与性能优化

10.1 基础批量处理脚本

python 复制代码

import os
from concurrent.futures import ProcessPoolExecutor, as_completed

def process_single_file(file_path):
    """处理单个文件的函数（放在全局以便多进程调用）"""
    try:
        text = parse_document(file_path)
        # 保存结果
        output_path = file_path + '.txt'
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(text)
        return file_path, True, len(text)
    except Exception as e:
        return file_path, False, str(e)

def batch_process_parallel(folder_path, max_workers=4):
    """并行批量处理"""
    files = []
    for root, dirs, filenames in os.walk(folder_path):
        for f in filenames:
            if f.lower().endswith(('.pdf', '.docx', '.doc', '.html', '.htm')):
                files.append(os.path.join(root, f))
    
    print(f"找到 {len(files)} 个文档")
    
    results = []
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(process_single_file, file_path): file_path
            for file_path in files
        }
        for future in as_completed(future_to_file):
            file_path, success, info = future.result()
            if success:
                print(f"✓ {os.path.basename(file_path)}: {info} 字符")
            else:
                print(f"✗ {os.path.basename(file_path)}: {info}")
            results.append((file_path, success))
    
    return results

10.2 性能优化建议

用多进程代替多线程：Python的GIL限制下，CPU密集型任务用多进程才能发挥多核性能
分块读取大文件：对大PDF按页处理，不要一次全部加载到内存
缓存中间结果：如果同一个文档要反复解析，把解析结果缓存下来
选择合适的DPI：OCR时DPI 300是性价比最高的平衡点
批处理时控制并发数：一般CPU核心数的1-2倍即可，太多反而增加上下文切换开销

python 复制代码

# 带进度条的批量处理
from tqdm import tqdm

def batch_process_with_progress(folder_path):
    files = [...]  # 同上
    results = []
    with ProcessPoolExecutor(max_workers=4) as executor:
        futures = [executor.submit(process_single_file, f) for f in files]
        for future in tqdm(as_completed(futures), total=len(files), desc="处理文档"):
            results.append(future.result())
    return results

WEB项目地址：演示地址安卓APP下载地址：演示地址以上就是文档解析的全流程实战。从环境搭建到三种格式分别处理，再到统一流程、编码修复、表格提取、批量优化，基本覆盖了日常工作中会遇到的各种场景。遇到具体问题时，记得先确认文档类型（文字版还是扫描版、.doc还是.docx、HTML结构是否规范），再对症下药选对工具，能省下不少折腾的时间。