使用Python自动识别和合并PDF中的跨页表格

Python自动识别和合并PDF中的跨页表格

前言
- 1.所需库与安装
- [2.提供 doc2pdf 函数用于将 Microsoft Word 文档转换为 PDF 格式](#2.提供 doc2pdf 函数用于将 Microsoft Word 文档转换为 PDF 格式)
- 3.定义函数，用于识别PDF文档中表格的位置，特别是跨页表格
- 4.识别PDF文件中跨页的表格并将其合并
- 5.总结

前言

在处理大量包含表格数据的PDF文档时，一个常见的挑战是这些表格可能跨越多页。手动合并这些表格不仅耗时，而且容易出错。幸运的是，通过使用Python和一些强大的库，我们可以自动化这一过程，有效地识别和合并跨页表格。

1.所需库与安装

为了实现这个功能，我们将使用以下Python库：

pandas：用于数据处理和分析。

camelot：用于从PDF文档中提取表格。

pdfplumber：用于读取PDF文档中的文本和布局信息。

win32com.client：用于将Word文档转换为PDF（仅限Windows环境）。

确保你的环境中安装了这些库。可以通过运行以下命令来安装它们：

python 复制代码

pip install pandas camelot-py pdfplumber pywin32

python 复制代码

import pandas as pd
import camelot  # 基于文本的 PDF，而不适用于扫描文档
import os
from win32com.client import Dispatch
from os import walk
import pdfplumber
# 文件夹路径
path = 'E:\\data\\table_json-master'
# 阈值，判断表格是否位于页面顶部或底部
topthreshold = 0.2
dthreshold = 0.8
# 将word文件转化为PDF
wdFormatPDF = 17

步骤详解

识别表格边界

利用pdfplumber，我们可以检测每一页上的表格边界，这有助于判断表格是否位于页面顶部或底部，以及是否跨页。
表格识别与跨页检查

camelot库被用来识别PDF文档中的表格。我们检查每一表格是否在页面的顶部或底部结束，以此判断它是否可能是一个跨页表格。
跨页表格合并

如果确定表格跨页并且列数匹配，我们使用camelot再次读取这些页面并将表格数据合并到一个DataFrame中，然后将结果保存为CSV或JSON文件。

2.提供 doc2pdf 函数用于将 Microsoft Word 文档转换为 PDF 格式

将文档从 Word 转换为 PDF 的方法非常适合批量转换，即当你有多个需要自动转换的 Word 文档时。但是请注意，根据文档的数量和大小，这个操作可能会比较慢，因为每次转换都涉及到打开 Word，加载文档，将其保存为 PDF，然后再关闭文档和 Word 应用程序。

python 复制代码

def doc2pdf(input_file):
    word = Dispatch('Word.Application')
    doc = word.Documents.Open(input_file)
    doc.SaveAs(input_file.replace(".docx", ".pdf"), FileFormat=wdFormatPDF)
    doc.Close()
    word.Quit()


# 遍历文件夹，将其中的word文件转化为PDF
for root, dirs, filenames in walk(path):
    for file in filenames:
        if file.endswith(".doc") or file.endswith(".docx"):
            doc2pdf(os.path.join(root, file))

win32com.client 库是针对 Windows 环境的，依赖于机器上安装的 Microsoft Word。

3.定义函数，用于识别PDF文档中表格的位置，特别是跨页表格

get_table_bounds(page):
这个函数接收一个PDF页面对象，使用pdfplumber库的find_tables()方法找到页面上的所有表格，并返回一个包含每个表格边界坐标的列表。每个表格的边界由一个四元组表示，包含左、顶、右、底的坐标。
is_first_table_at_top_of_page(page):
这个函数接收一个PDF页面对象，计算页面高度，然后获取第一个表格的边界，判断这个表格的顶部位置是否低于页面顶部的某个阈值（topthreshold）。如果低于这个阈值，意味着表格可能从页面的顶部开始，返回一个字典，指示这个条件是否满足。
is_last_table_at_bottom_of_page(page):
类似于is_first_table_at_top_of_page，但检查的是最后一个表格的底部位置是否高于页面底部的某个阈值（dthreshold），以判断表格是否接近页面底部。
is_table_ending_on_page(pdf_path, page, npage):
这个函数接收PDF文件路径、pdfplumber的页面对象和页码，使用camelot库读取下一页的表格。如果存在表格，它检查最后一个表格是否接近当前页面的底部，并返回一个布尔值和最后一行的列数。
is_table_starting_on_page(pdf_path, page, npage):
与is_table_ending_on_page类似，但检查的是下一个页面的第一个表格是否从页面顶部开始，并返回一个布尔值和第一行的列数。

python 复制代码

def is_first_table_at_top_of_page(page):
    page_height = page.height
    table_bounds_list = get_table_bounds(page)
    if table_bounds_list:
        first_table_bounds = table_bounds_list[0]
        is_first_table_at_top = first_table_bounds[1] < topthreshold * page_height
        return {
            "is_first_table_at_top": is_first_table_at_top,
        }
    else:
        return {
            "is_first_table_at_top": False,
        }


def is_last_table_at_bottom_of_page(page):
    page_height = page.height
    table_bounds_list = get_table_bounds(page)
    if table_bounds_list:
        last_table_bounds = table_bounds_list[-1]
        is_last_table_at_bottom = last_table_bounds[3] > dthreshold * page_height
        return {
            "is_last_table_at_bottom": is_last_table_at_bottom,
        }
    else:
        return {
            "is_last_table_at_bottom": False,
        }


def get_table_bounds(page):
    table_bounds_list = []
    tables = page.find_tables()
    for table in tables:
        x0, top, x1, bottom = table.bbox
        table_bounds_list.append((x0, top, x1, bottom))
    return table_bounds_list


def is_table_ending_on_page(pdf_path, page, npage):
    tables = camelot.read_pdf(pdf_path, pages=str(npage + 1))
    if not tables:
        return False, 0
    table = tables[-1]
    last_row = table.df.iloc[-1]
    last_table_result = is_last_table_at_bottom_of_page(page)
    return last_table_result['is_last_table_at_bottom'], len(last_row)


def is_table_starting_on_page(pdf_path, page, npage):
    tables = camelot.read_pdf(pdf_path, pages=str(npage + 1))
    if not tables:
        return False, 0
    table = tables[0]
    first_row = table.df.iloc[0]
    first_table_result = is_first_table_at_top_of_page(page)
    return first_table_result['is_first_table_at_top'], len(first_row)

4.识别PDF文件中跨页的表格并将其合并

获取PDF文件列表：
使用列表推导式和os.listdir(path)函数，筛选出所有以.pdf结尾的文件，形成一个包含所有PDF文件名的列表。
遍历PDF文件：对于每一个PDF文件，首先打印一条信息表明正在处理哪个文件。
构造完整的PDF文件路径。
使用camelot.read_pdf函数读取整个PDF文档中的所有表格。
检查跨页表格：使用pdfplumber库打开PDF文件，遍历除了最后一页的所有页面。
对于每一页，使用之前定义的函数is_table_ending_on_page和is_table_starting_on_page来检查表格是否在该页结束并在下一页开始。如果发现表格跨页，并且连续两页的表格具有相同的列数（意味着它们可能是同一表格的一部分），则记录这对页面号。
合并跨页表格：如果找到了跨页的表格，代码会尝试合并这些表格。使用一个列表current_merge来跟踪需要合并的页面范围。
当遇到新的跨页表格序列时，如果前一个序列尚未合并，先执行合并操作。

python 复制代码

# 获取文件夹中所有PDF文件的列表
pdf_files = [f for f in os.listdir(path) if f.endswith('.pdf')]
print(pdf_files)

# 遍历每个PDF文件并使用Camelot读取
for pdf_file in pdf_files:
    cross_page_tables = []
    print(f'开始对{pdf_file}进行表格识别')
    pdf_path = os.path.join(path, pdf_file)
    tables = camelot.read_pdf(pdf_path, pages='all')
    # 导出pdf所有的表格为json文件
    tables.export(f'{pdf_file}.json', f='json')  # json, excel, html, sqlite, markdown
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages[:-1]):
            table_ending, ending_col_count = is_table_ending_on_page(pdf_path, page, i)
            if page + 1 < len(pdf.pages):
                table_starting, starting_col_count = is_table_starting_on_page(pdf_path, page, i + 1)
                if table_ending and table_starting and ending_col_count == starting_col_count:
                    cross_page_tables.append((i, i + 1))
                    print(f" {pdf_file}的表格在第 {i + 1} 页和第 {i + 2} 页之间跨页，并且最后一行和下一页的第一行列数相同")

        # 处理所有跨页表格
        if cross_page_tables:
            current_merge = []
            for (start_page, end_page) in cross_page_tables:
                if not current_merge or start_page == current_merge[-1]:
                    current_merge.append(end_page)
                else:
                    # 合并当前跨页表格
                    merged_table = camelot.read_pdf(pdf_path, pages=",".join(map(str, [p + 1 for p in current_merge])))
                    full_df = pd.concat([table.df for table in merged_table])
                    full_df.to_csv(f'{pdf_file}_merged_{"_".join(map(str, [p + 1 for p in current_merge]))}.csv',
                                   index=False)
                    full_df.to_json(f'{pdf_file}_merged_{"_".join(map(str, [p + 1 for p in current_merge]))}.json',
                                    orient='records')
                    print(f'合并表格已保存为 {pdf_file}_merged_{"_".join(map(str, [p + 1 for p in current_merge]))}.csv 和 {pdf_file}_merged_{"_".join(map(str, [p + 1 for p in current_merge]))}.json')
                    # 重置合并列表
                    current_merge = [start_page, end_page]

5.总结

通过上述步骤，可以自动化地处理大量的PDF文档，识别和合并跨页表格，大大提高了数据处理的效率和准确性。这种方法特别适用于需要频繁分析大量文档的工作场景。