PDF 文本提取技术深度对比：基于规则与基于模型的两种实现

🧩 功能说明：PDF 文本提取技术深度对比：基于规则与基于模型的两种实现

本文将深入探讨两种主流的 PDF 文本提取方法：一种是基于 PyPDF2 库的传统规则方法，另一种是利用 unstructured 库的现代 AI 模型方法。我们将通过分析 rule_base.py 和 unstructured_processor.py 两个具体实现，来揭示它们的核心思想、适用场景及优缺点。

📌 一、模块作用

rule_base.py (基于规则): 此模块旨在提供一种轻量、快速的 PDF 文本提取方案。它直接解析 PDF 的内部结构，提取文本内容，最适用于那些结构简单、纯文本为主、不需要深度理解布局的文档。例如，提取学术论文、报告或书籍中的文字流。
unstructured_processor.py (基于模型): 此模块定位为一种高精度的、能理解文档结构的解决方案。它不仅能提取文本，还能识别标题、段落、列表、表格等不同元素，并保留其逻辑关系。它特别适用于布局复杂、图文混排、包含大量表格的商业文档、扫描件或演示文稿，是构建高质量知识库的理想选择。

🔢 二、输入输出说明

`rule_base.py`

python 复制代码

# 确保已安装PyPDF2模块
try:
    import PyPDF2
except ImportError:
    import sys

    sys.exit("Please install the PyPDF2 module first, using: pip install PyPDF2")


def extract_text_from_pdf(filename, page_num):
    try:
        with open(filename, 'rb') as pdf_file:
            reader = PyPDF2.PdfReader(pdf_file)
            if page_num < len(reader.pages):
                page = reader.pages[page_num]
                text = page.extract_text()
                if text:
                    return text
                else:
                    return "No text found on this page."
            else:
                return f"Page number {page_num} is out of range. This document has {len(reader.pages)} pages."
    except Exception as e:
        return f"An error occurred: {str(e)}"


if __name__ == '__main__':
    # 示例用法
    filename = "test.pdf"
    page_num = 5
    text = extract_text_from_pdf(filename, page_num)

    print('--------------------------------------------------')
    print(f"Text from file '{filename}' on page {page_num}:")
    print(text if text else "No text available on the selected page.")
    print('--------------------------------------------------')

输入 :
- filename (str): PDF 文件的绝对路径。
- page_num (int): 需要提取文本的页面索引（从 0 开始）。
输出 :
- text (str): 从指定页面提取的纯文本字符串。如果页面无文本或页码超限，则返回相应的提示信息。

`unstructured_processor.py`

python 复制代码

import tempfile
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.image import partition_image
import json
from unstructured.staging.base import elements_to_json
from rich.progress import Progress, SpinnerColumn, TextColumn
from rich import print
from bs4 import BeautifulSoup


class UnstructuredProcessor(object):
    def __init__(self):
        # 构造函数：初始化UnstructuredProcessor实例
        pass

    def extract_data(self, file_path, strategy, model_name, options, local=True, debug=False):
        """
        从指定的文件中提取数据。

        :param file_path: str，文件的路径，指定要处理的文件。
        :param strategy: 使用的策略来提取数据。
        :param model_name: 使用的模型名称，这里使用 目标检测模型 yolox
        :param options: dict，额外的选项或参数，用来干预数据提取的过程或结果。
        :param local: bool，一文件处理是否应在本地执行，默认为True。
        :param debug: bool，如果设置为True，则会显示更多的调试信息，帮助理解处理过程中发生了什么，默认为False。

        函数的执行流程：
        - 调用`invoke_pipeline_step`方法，这是一个高阶函数，它接受一个lambda函数和其他几个参数。
        - lambda函数调用`process_file`方法，处理文件并根据指定的策略和模型名提取数据。
        - `invoke_pipeline_step`方法除了执行传入的lambda函数，还可能处理本地执行逻辑，打印进程信息，并依据`local`参数决定执行环境。
        - 最终，数据提取的结果将从`process_file`方法返回，并由`invoke_pipeline_step`方法输出。
        """

        # # 调用数据提取流程，处理PDF文件并提取元素
        elements = self.invoke_pipeline_step(
            lambda: self.process_file(file_path, strategy, model_name),
            "Extracting elements from the document...",
            local
        )

        if debug:
            new_extension = 'json'  # You can change this to any extension you want
            new_file_path = self.change_file_extension(file_path, new_extension)

            content, table_content = self.invoke_pipeline_step(
                lambda: self.load_text_data(elements, new_file_path, options),
                "Loading text data...",
                local
            )
        else:
            with tempfile.TemporaryDirectory() as temp_dir:
                temp_file_path = os.path.join(temp_dir, "file_data.json")

                content, table_content = self.invoke_pipeline_step(
                    lambda: self.load_text_data(elements, temp_file_path, options),
                    "Loading text data...",
                    local
                )

        if debug:
            print("Data extracted from the document:")
            print(content)
            print("\n")
            print("Table content extracted from the document:")
            if table_content:
                print(len(table_content))
            print(table_content)

        print(f"这是content:{content}")
        print(f"这是table_content:{table_content}")
        return content, table_content

    def process_file(self, file_path, strategy, model_name):
        """
        处理文件并提取数据，支持PDF文件和图像文件。

        :param file_path: str，文件的路径，指定要处理的文件。
        :param strategy: 使用的策略来提取数据，影响数据处理的方法和结果。
        :param model_name: 使用的模型名称，这里使用yolox

        方法的执行流程：
        - 初始化`elements`变量为None，用来存储提取的元素。
        - 检查文件路径的后缀，根据文件类型调用相应的处理函数：
          - 如果文件是PDF（.pdf），使用`partition_pdf`函数处理：
            - `filename`：提供文件路径。
            - `strategy`：指定数据提取策略。
            - `infer_table_structure`：是否推断表格结构，这里设为True。
            - `hi_res_model_name`：提供高分辨率模型名称。
            - `languages`：设置处理的语言为英语。
          - 如果文件是图像（.jpg, .jpeg, .png），使用`partition_image`函数处理，参数类似于处理PDF的参数。
        - 返回提取的元素`elements`。

        :return: 返回从文件中提取的元素。
        """

        # 初始化元素变量
        elements = None
        # 根据文件后缀决定处理方式
        # partition_pdf 官方文档：https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-pdf

        # hi_res 策略配合 infer_table_structure=True 的表格识别效果较好
        if file_path.lower().endswith('.pdf'):
            elements = partition_pdf(
                filename=file_path,
                # strategy kwarg 控制用于处理 PDF 的方法。 PDF 的可用策略有 "auto" 、 "hi_res" 、 "ocr_only" 和 "fast"
                strategy=strategy,
                infer_table_structure=True,
                hi_res_model_name=model_name,
                languages=['chi_sim']
            )
        elif file_path.lower().endswith(('.jpg', '.jpeg', '.png')):
            # 处理图像文件
            elements = partition_image(
                filename=file_path,
                strategy=strategy,
                infer_table_structure=True,
                hi_res_model_name=model_name,
                languages=['chi_sim']
            )

        return elements

    def change_file_extension(self, file_path, new_extension, suffix=None):
        # Check if the new extension starts with a dot and add one if not
        if not new_extension.startswith('.'):
            new_extension = '.' + new_extension

        # Split the file path into two parts: the base (everything before the last dot) and the extension
        # If there's no dot in the filename, it'll just return the original filename without an extension
        base = file_path.rsplit('.', 1)[0]

        # Concatenate the base with the new extension
        if suffix is None:
            new_file_path = base + new_extension
        else:
            new_file_path = base + "_" + suffix + new_extension

        return new_file_path

    def load_text_data(self, elements, file_path, options):
        # 手动将元素保存到 JSON 文件中，确保使用 ensure_ascii=False
        with open(file_path, 'w', encoding='utf-8') as file:
            json.dump([e.to_dict() for e in elements], file, ensure_ascii=False)

        content, table_content = None, None

        if options is None:
            content = self.process_json_file(file_path)

        if options and "tables" in options and "unstructured" in options:
            content = self.process_json_file(file_path, "form")
            table_content = self.process_json_file(file_path, "table")

        return content, table_content

    def process_json_file(self, file_path, option=None):
        # Read the JSON file
        with open(file_path, 'r', encoding='utf-8') as file:
            data = json.load(file)

        # Iterate over the JSON data and extract required elements
        extracted_elements = []
        for entry in data:
            if entry["type"] == "Table" and (option is None or option == "table" or option == "form"):
                table_data = entry["metadata"]["text_as_html"]
                if option == "table" and self.table_has_header(table_data):
                    extracted_elements.append(table_data)
                if option is None or option == "form":
                    extracted_elements.append(table_data)
            elif entry["type"] == "Title" and (option is None or option == "form"):
                extracted_elements.append(entry["text"])
                # 叙述文本
            elif entry["type"] == "NarrativeText" and (option is None or option == "form"):
                extracted_elements.append(entry["text"])
                # 未分类
            elif entry["type"] == "UncategorizedText" and (option is None or option == "form"):
                extracted_elements.append(entry["text"])
            elif entry["type"] == "ListItem" and (option is None or option == "form"):
                extracted_elements.append(entry["text"])
            elif entry["type"] == "Image" and (option is None or option == "form"):
                extracted_elements.append(entry["text"])

        if option is None or option == "form":
            # Convert list to single string with two new lines between each element
            extracted_data = "\n\n".join(extracted_elements)
            return extracted_data
     
        return extracted_elements

    def invoke_pipeline_step(self, task_call, task_description, local):
        """
        执行管道步骤，可以在本地或非本地环境中运行任务。

        :param task_call: callable，一个无参数的函数或lambda表达式，它执行实际的任务。
        :param task_description: str，任务的描述，用于进度条或打印输出。
        :param local: bool，指示是否在本地环境中执行任务。如果为True，则使用进度条；如果为False，则仅打印任务描述。

        方法的执行流程：
        - 如果`local`为True，使用`Progress`上下文管理器来显示一个动态的进度条。
          - `SpinnerColumn()`：在进度条中添加一个旋转的指示器。
          - `TextColumn("[progress.description]{task.description}")`：添加一个文本列来显示任务描述。
          - `transient=False`：进度条显示完成后不会消失。
          - 在进度条中添加一个任务，然后调用`task_call()`执行实际的任务，任务的返回结果保存在`ret`中。
        - 如果`local`为False，则直接打印任务描述，不使用进度条，之后调用`task_call()`执行任务，任务的返回结果同样保存在`ret`中。

        :return: 返回从`task_call()`获取的结果。
        """
        if local:
            # 本地执行时，显示带有进度指示的进度条
            with Progress(
                    SpinnerColumn(),
                    TextColumn("[progress.description]{task.description}"),
                    transient=False,
            ) as progress:
                # 添加进度任务，总步长为None表示不确定的任务进度
                progress.add_task(description=task_description, total=None)
                # 调用task_call执行任务，并获取结果
                ret = task_call()
        else:
            print(task_description)
            ret = task_call()

        return ret

    def table_has_header(self, table_html):
        soup = BeautifulSoup(table_html, 'html.parser')
        table = soup.find('table')

        # Check if the table contains a <thead> tag
        if table.find('thead'):
            return True

        # Check if the table contains any <th> tags inside the table (in case there's no <thead>)
        if table.find_all('th'):
            return True

        return False


if __name__ == "__main__":
    processor = UnstructuredProcessor()

    # 提取PDF中的表格数据
    content, table_content = processor.extract_data(
        'test.pdf',
        'hi_res',       # 
        'yolox',    # https://github.com/Megvii-BaseDetection/YOLOX
        ['tables', 'unstructured'],
        True,
        True)

输入 :
- file_path (str): PDF 或图片文件的路径。
- strategy (str): unstructured 库的处理策略（如 "hi_res", "fast"）。
- model_name (str): 使用的模型名称（主要用于 hi_res 策略）。
输出 :
- text_content (str): 提取并拼接好的正文内容（包括标题、段落、列表等）。
- table_content (list[str]): 提取出的所有表格，每个表格以 HTML 字符串格式存储。

🔧 三、核心逻辑

`rule_base.py` (基于 PyPDF2)

该方法的核心逻辑非常直接，依赖于 PyPDF2 库对 PDF 文件格式的解析能力。

打开文件 : 以二进制读取模式（'rb'）打开指定的 PDF 文件。
创建阅读器对象 : 初始化 PyPDF2.PdfReader 对象，该对象能够解析 PDF 的文档结构。
定位页面 : 根据传入的 page_num，从阅读器对象的 pages 列表中获取对应的页面对象。
提取文本 : 调用页面对象的 extract_text() 方法。此方法会遍历页面内容流（Content Stream），识别并拼接出文本对象，最终返回一个完整的字符串。
异常处理: 包含对文件不存在、页码越界等情况的处理。

这种方式的本质是"解码"，它尝试按照 PDF 规范读取并解释文本数据，不涉及视觉层面的分析。

`unstructured_processor.py` (基于 unstructured)

该方法采用了一种更先进的、结合了计算机视觉（CV）和自然语言处理（NLP）的技术方案。

文件分区 (Partitioning) : 这是 unstructured 库的核心。它接收一个文件，然后调用 partition_pdf 或 partition_image 函数。这一步会将文档页面视为一张图片进行分析。
布局检测 (Layout Detection) : 在 hi_res 策略下，它会利用背后集成的目标检测模型（如 Detectron2）来识别页面中的不同区域块，例如页眉、页脚、标题、段落文本、图片和表格。
OCR 识别: 对于识别出的文本区域或扫描版 PDF，调用 OCR 引擎（如 Tesseract）将图像中的文字转换为机器可读的文本。
元素序列化 : 将识别出的各个块（Elements）进行结构化处理，每个元素都带有类型（如 Title, NarrativeText, ListItem, Table）和内容。
内容筛选与组合 : 脚本根据预设的规则（如 self.text_types），从分区后的元素列表中筛选出需要的文本类型，并按顺序拼接成连贯的 text_content。同时，单独提取所有 Table 类型的元素，并将其内容转换为 HTML 格式，存入 table_content。

这种方式的本质是"识别与重构"，它模拟人眼阅读的方式，先看懂布局，再提取内容，从而保留了丰富的结构化信息。

💻 四、代码实现

为了更清晰地理解两种方法的实现细节，我们对核心代码进行逐行讲解。

`rule_base.py` 代码讲解

该实现非常直接，完全依赖 PyPDF2 库的功能。

python 复制代码

# 导入PyPDF2库，用于处理PDF文件
import PyPDF2

# 定义一个函数，接收PDF文件名和页码作为参数
def extract_text_from_pdf(filename, page_num):
    try:
        # 使用 'with' 语句以二进制读取模式（'rb'）安全地打开PDF文件
        with open(filename, 'rb') as pdf_file:
            # 创建一个PdfReader对象，用于读取和解析PDF内容
            reader = PyPDF2.PdfReader(pdf_file)
            
            # 检查请求的页码是否在有效范围内
            if page_num < len(reader.pages):
                # 获取指定页码的页面对象
                page = reader.pages[page_num]
                # 调用页面的 extract_text() 方法提取所有文本
                text = page.extract_text()
                # 如果成功提取到文本，则返回文本；否则返回提示信息
                return text if text else "No text found on this page."
            else:
                # 如果页码超出范围，返回错误提示
                return f"Page number {page_num} is out of range."
    except Exception as e:
        # 捕获可能发生的任何异常（如文件未找到、文件损坏等），并返回错误信息
        return f"An error occurred: {str(e)}"

`unstructured_processor.py` 代码讲解

此实现的核心是调用 unstructured 库的 partition_pdf 函数，并对返回的结构化元素进行处理。

python 复制代码

# 从 unstructured.partition.pdf 模块导入 partition_pdf 函数
from unstructured.partition.pdf import partition_pdf

# 定义处理函数，接收文件路径和处理策略
def process_file_with_unstructured(file_path, strategy='hi_res'):
    try:
        # 调用 partition_pdf 对PDF文件进行分区和元素识别
        elements = partition_pdf(
            filename=file_path,               # 指定要处理的PDF文件路径
            strategy=strategy,                # 设置处理策略，'hi_res'表示高精度，会使用模型进行分析
            infer_table_structure=True,       # 启用表格结构推断，能更好地解析表格内容
            hi_res_model_name="yolox",        # 指定高精度策略下使用的目标检测模型
            languages=["chi_sim", "eng"]      # 指定文档可能包含的语言，有助于OCR识别
        )

        # 初始化用于存储文本和表格内容的变量
        text_content = ""
        table_content = []
        # 定义我们感兴趣的文本元素类型
        text_types = ["Title", "NarrativeText", "ListItem"]

        # 遍历所有识别出的元素
        for el in elements:
            # 如果元素的类别是我们定义的文本类型之一
            if el.category in text_types:
                # 将元素的文本内容追加到 text_content，并用换行符分隔
                text_content += el.text + "\n\n"
            # 如果元素的类别是表格
            elif el.category == "Table":
                # 将表格的HTML表示形式追加到 table_content 列表
                table_content.append(el.metadata.text_as_html)
        
        # 返回处理好的文本内容和表格列表
        return text_content, table_content

    except Exception as e:
        # 捕获并返回处理过程中可能发生的异常
        return f"An error occurred: {str(e)}", []

🧪 五、测试建议

对比测试: 使用同一份 PDF（包含纯文本、多栏布局、表格、图片），分别调用两种方法，对比提取结果的完整性、准确性和格式。
rule_base.py 边界测试 :
- 测试加密的或损坏的 PDF 文件。
- 测试只包含图片、没有嵌入文本的扫描版 PDF（预期无法提取文本）。
- 测试页码超出范围的情况。
unstructured_processor.py 场景测试 :
- 测试不同 strategy（"fast" vs "hi_res"）对提取质量和速度的影响。
- 测试包含复杂跨页表格的文档。
- 测试中英文混排的文档。
- 验证提取的表格 HTML 是否能被正确渲染。

💡 六、拓展与总结

特性	`rule_base.py` (PyPDF2)	`unstructured_processor.py` (unstructured)
核心技术	PDF 内部对象解析	计算机视觉 (CV) + OCR + NLP
处理能力	仅限数字原生 PDF 的文本	可处理数字原生及扫描版 PDF、图片
结构化信息	丢失所有布局和元素类型	保留标题、列表、表格等丰富结构
准确性	对纯文本流准确，但易受多栏、图表干扰	极高，能准确区分并提取不同内容块
速度	非常快	较慢，尤其 `hi_res` 策略涉及深度学习模型
依赖	轻量，仅 `PyPDF2`	较重，依赖 PyTorch、Detectron2 等多个库
最佳场景	快速提取简单、纯文本报告或书籍	构建高质量、保留结构信息的知识库

总结:

选择哪种方法完全取决于你的具体需求。如果你的任务是快速从大量格式统一的纯文本文档中抓取内容，PyPDF2 是一个高效、轻便的选择。然而，如果你正在构建一个需要深度理解和利用文档结构的高级应用（如智能问答、RAG），那么 unstructured 提供的模型驱动方法无疑是更强大、更可靠的解决方案，它能为你后续的 NLP 任务打下坚实的数据基础。

PDF 文本提取技术深度对比：基于规则与基于模型的两种实现

🧩 功能说明：PDF 文本提取技术深度对比：基于规则与基于模型的两种实现

📌 一、模块作用

🔢 二、输入输出说明

rule_base.py

unstructured_processor.py

🔧 三、核心逻辑

rule_base.py (基于 PyPDF2)

unstructured_processor.py (基于 unstructured)

💻 四、代码实现

rule_base.py 代码讲解

unstructured_processor.py 代码讲解

🧪 五、测试建议

💡 六、拓展与总结

`rule_base.py`

`unstructured_processor.py`

`rule_base.py` (基于 PyPDF2)

`unstructured_processor.py` (基于 unstructured)

`rule_base.py` 代码讲解

`unstructured_processor.py` 代码讲解