【Python】常用的pdf提取库介绍对比

提取PDF内容的Python库有多种选择，每个库都有其独特的优缺点。以下是一些常用的库以及它们的优缺点和示例代码：

pdfplumber
PyMuPDF (fitz)
PyPDF2
PDFMiner
Camelot

1. pdfplumber

优点：

易于使用，提供简单直观的API。
能提取文本、表格和图像。
提供对文本进行后处理的工具，如文字搜索、行识别等。
支持多页PDF文件。

缺点：

对于复杂的PDF文件，提取效果可能不如其他库。
速度相对较慢。

示例代码：

假设我们有一个PDF文件 example.pdf，内容包括文本和表格。

python 复制代码

import pdfplumber
import pandas as pd

pdf_path = 'example.pdf'
data = []

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if text:
            print(f"Page {page.page_number}:")
            print(text)

        # Extract tables
        tables = page.extract_tables()
        for table in tables:
            df = pd.DataFrame(table[1:], columns=table[0])
            data.append(df)
            print(df)

# Combine all tables into a single DataFrame
if data:
    all_tables = pd.concat(data)
    print("All extracted tables:")
    print(all_tables)

2. PyMuPDF (fitz)

优点：

性能高，速度快。
支持文本、图像、注释等多种元素的提取。
提供PDF文档的修改和操作功能，如添加文本、图像、注释等。

缺点：

文档和示例较少。
对于新手来说，可能有点复杂。

示例代码：

python 复制代码

import fitz  # PyMuPDF

pdf_path = 'example.pdf'
document = fitz.open(pdf_path)

for page_num in range(document.page_count):
    page = document.load_page(page_num)
    text = page.get_text()
    print(f"Page {page_num + 1}:")
    print(text)

    # Extract images
    for img in page.get_images():
        xref = img[0]
        base_image = document.extract_image(xref)
        image_bytes = base_image["image"]

        with open(f"image_{page_num + 1}_{xref}.png", "wb") as image_file:
            image_file.write(image_bytes)
        print(f"Extracted image from page {page_num + 1}, image reference {xref}")

3. PyPDF2

优点：

易于合并、拆分、旋转PDF文件。
轻量级，依赖少。
支持加密和解密PDF文件。

缺点：

对于文本提取和处理的支持较弱。
不支持图像提取。

示例代码：

python 复制代码

import PyPDF2

pdf_path = 'example.pdf'
with open(pdf_path, 'rb') as file:
    reader = PyPDF2.PdfFileReader(file)
    for page_num in range(reader.numPages):
        page = reader.getPage(page_num)
        text = page.extract_text()
        print(f"Page {page_num + 1}:")
        print(text)

# Example: Merging two PDFs
merger = PyPDF2.PdfFileMerger()
merger.append('example1.pdf')
merger.append('example2.pdf')
merger.write('merged.pdf')
merger.close()

# Example: Splitting a PDF
input_pdf = PyPDF2.PdfFileReader('example.pdf')
output_pdf = PyPDF2.PdfFileWriter()
for page_num in range(input_pdf.numPages // 2):
    output_pdf.addPage(input_pdf.getPage(page_num))
with open('split.pdf', 'wb') as output_file:
    output_pdf.write(output_file)

4. PDFMiner

优点：

非常强大的文本提取功能。
支持复杂的PDF结构。
提供详细的PDF文档解析功能。

缺点：

相对复杂，不易上手。
速度较慢。

示例代码：

复制代码

python 复制代码

from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_text_to_fp
import io

pdf_path = 'example.pdf'

# Extract text to a string
text = extract_text(pdf_path)
print(text)

# Extract text to a file-like object
output_string = io.StringIO()
with open(pdf_path, 'rb') as file:
    extract_text_to_fp(file, output_string)
print(output_string.getvalue())

5. Camelot

优点：

专门用于从PDF文件中提取表格。
提供了流模式和Lattice模式，处理不同类型的表格。
生成的表格可以方便地转换为pandas DataFrame。

缺点：

只适用于表格提取，不支持其他类型的PDF内容提取。
依赖于第三方工具（如Ghostscript）。

示例代码：

python 复制代码

import camelot

pdf_path = 'example.pdf'
tables = camelot.read_pdf(pdf_path, flavor='stream')  # Use 'stream' for stream mode, 'lattice' for lattice mode

for table in tables:
    print(f"Table on page {table.page}:")
    print(table.df)  # DataFrame of the extracted table

# Save tables to a CSV file
for i, table in enumerate(tables):
    table.to_csv(f'table_{i}.csv')

总结

pdfplumber 是一个易于使用的库，适用于一般的PDF文本和表格提取，特别适合处理表格。
PyMuPDF (fitz) 性能强大且速度快，适合需要高效处理的场景，特别是需要处理图像和注释的PDF文档。
PyPDF2 适用于PDF文件的合并、拆分和旋转，但文本提取功能较弱，更适合处理PDF文档的结构而不是内容。
PDFMiner 提供了最强大的文本提取功能，适合处理复杂PDF结构，但相对复杂且慢，适合需要详细解析PDF内容的场景。
Camelot 专用于表格提取，适合处理PDF中的表格数据，特别是在需要将表格数据转换为结构化数据时。

选择合适的库取决于具体需求和PDF文档的复杂性。如果只是需要提取文本和表格，pdfplumber 和 Camelot 是不错的选择。如果需要高性能处理或处理图像和注释，可以考虑 PyMuPDF (fitz)。如果需要处理复杂的PDF结构，PDFMiner 是最强大的工具。而 PyPDF2 适合处理PDF文件的结构操作，如合并和拆分。