基于python的PDF文件解析器汇总

大多数已发表的科学文献目前以 PDF 格式存在，这是一种轻量级、普遍的文件格式，能够保持一致的文本布局和格式。对于人类读者而言， PDF格式的文件内容展示整洁且一致的布局有助于阅读，可以很容易地浏览一篇论文并识别标题和图表。但是对于计算机而言，PDF 格式是一个非常嘈杂的 ASCII 文件，并不包含任何结构化文本的信息。因此，我们期望从这些已经发表的PDF格式科学文献中重新提取文字、图片、表格、注释、目录等数据来构建格式化的信息用于机器学习，例如目前最需要大量文本数据的自然语言处理（Natural Language Processing, NLP）或大语言模型(Large Language Modles ,LLMs)等应用中。

1. Nougat

Nougat (N eural O ptical U nderstanding for A cademic Documents)是Meta出品的一款基于ViT（Visual Transformer）的模型，通过光学字符识别（Optical Character Recognition, OCR）将科学论文转化为标记语言。

最新发布时间：2023年8月22日
GitHub address: GitHub - facebookresearch/nougat: Implementation of Nougat Neural Optical Understanding for Academic Documents
Project page: Nougat

1.1 安装

安装之前的一些依赖与要求如下：

python_requires=">=3.7",

"transformers>=4.25.1",

"timm==0.5.4",

"orjson",

"opencv-python-headless",

"datasets[vision]",

"lightning>=2.0.0,<2022",

"nltk",

"python-Levenshtein",

"sentencepiece",

"sconf>=0.2.3",

"albumentations>=1.0.0",

"pypdf>=3.1.0",

"pypdfium2"

安装：

bash 复制代码

#创建一个新环境
conda create -n nougat python=3.9
#激活该环境
conda activate nougat

# from pip:
pip install nougat-ocr

# or from github repository
pip install git+https://github.com/facebookresearch/nougat

1.2 测试

bash 复制代码

nougat path/to/file.pdf --out output_directory

1.3 用法

复制代码

usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]
              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]

positional arguments:
  pdf                   PDF(s) to process.

options:
  -h, --help            show this help message and exit
  --batchsize BATCHSIZE, -b BATCHSIZE
                        Batch size to use.
  --checkpoint CHECKPOINT, -c CHECKPOINT
                        Path to checkpoint directory.
  --model MODEL_TAG, -m MODEL_TAG
                        Model tag to use.
  --out OUT, -o OUT     Output directory.
  --recompute           Recompute already computed PDF, discarding previous predictions.
  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.
  --no-markdown         Do not add postprocessing step for markdown compatibility.
  --markdown            Add postprocessing step for markdown compatibility (default).
  --no-skipping         Don't apply failure detection heuristic.
  --pages PAGES, -p PAGES
                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works

1.4 优劣限制

1. Nougat模型的训练数据几乎全是英文文献，因此对非英文文字的识别有待考证。特别是中文与英文和拉丁文体相差较大，因此中文文献的识别情况还很难说。
1. 依旧是训练数据，训练数据全部为科学论文（来自于arXiv、PMC和IDL），因此对科学论文的识别精度较高，除此之外的PDF文档的识别效率依旧有待考证和进一步的优化。
1. 由于这种方法是基于深度学习算法，因此在识别PDF文档时不可避免的需要使用GPU算力，且通常比经典方法（GROBID ）要慢。

2. ScienceBeam Parser

Githu address：ScienceBeam

2.1 安装

bash 复制代码

pip install sciencebeam-parser

2.2 测试

Python API: 服务器启动

python 复制代码

from sciencebeam_parser.config.config import AppConfig
from sciencebeam_parser.resources.default_config import DEFAULT_CONFIG_FILE
from sciencebeam_parser.service.server import create_app

config = AppConfig.load_yaml(DEFAULT_CONFIG_FILE)
app = create_app(config)
app.run(port=8080, host='127.0.0.1', threaded=True)

Python API: 解析PDF文件

python 复制代码

from sciencebeam_parser.resources.default_config import DEFAULT_CONFIG_FILE
from sciencebeam_parser.config.config import AppConfig
from sciencebeam_parser.utils.media_types import MediaTypes
from sciencebeam_parser.app.parser import ScienceBeamParser

config = AppConfig.load_yaml(DEFAULT_CONFIG_FILE)

# the parser contains all of the models
sciencebeam_parser = ScienceBeamParser.from_config(config)

# a session provides a scope and temporary directory for intermediate files
# it is recommended to create a separate session for every document
with sciencebeam_parser.get_new_session() as session:
    session_source = session.get_source(
        'example.pdf',
        MediaTypes.PDF
    )
    converted_file = session_source.get_local_file_for_response_media_type(
        MediaTypes.TEI_XML
    )
    # Note: the converted file will be in the temporary directory of the session
    print('converted file:', converted_file)

3. pdfrw

3.1 安装

bash 复制代码

pip install pdfrw

3.2 测试

python 复制代码

from pdfrw import PdfReader
def get_pdf_info(path):
    pdf = PdfReader(path)

    print(pdf.keys())
    print(pdf.Info)
    print(pdf.Root.keys())
    print('PDF has {} pages'.format(len(pdf.pages)))

if __name__ == '__main__':
    get_pdf_info('example.pdf')

4. PDFQuery

4.1 安装

bash 复制代码

pip install pdfquery

4.2 测试

python 复制代码

from pdfquery import PDFQuery

pdf = PDFQuery('example.pdf')
pdf.load()

# Use CSS-like selectors to locate the elements
text_elements = pdf.pq('LTTextLineHorizontal')

# Extract the text from the elements
text = [t.text for t in text_elements]

print(text)

5. pdfminer.six

GitHub address：pdfminer.six
最新发布时间：2023年12月28日

5.1 安装

bash 复制代码

pip install pdfminer.six

5.2 测试

python 复制代码

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

5.3 功能

支持各种字体类型（Type1、TrueType、Type3 和 CID）。
支持提取图像（JPG、JBIG2、Bitmaps）。
支持各种压缩方式（ASCIIHexDecode、ASCII85Decode、LZWDecode、FlateDecode、RunLengthDecode、CCITTFaxDecode）。
支持 RC4 和 AES 加密。
支持提取 AcroForm 交互式表单。
提取目录。
提取标记内容。
自动布局分析。

6. SciPDF Parser

基于GROBID (G eneR ation O f BI bliographic Data))

Github address: SciPDF Parser
最新发布时间：

6.1 安装

bash 复制代码

# from pip
pip install scipdf-parser

# or from github respository
pip install git+https://github.com/titipata/scipdf_parser

6.2 测试

在解析PDF之前需要先运行GROBID

bash 复制代码

bash serve_grobid.sh

该脚本将会运行 GROBID在默认端口：8070

以下为python 解析PDF文件的脚本。

python 复制代码

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary

# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('("example.pdf', soup=True) # option to parse full XML from GROBID

7. pdfplumber

GitHub address: pdfplumber
最新发布时间：2024年3月7日

7.1 安装

bash 复制代码

pip install pdfplumber

7.2 测试

bash 复制代码

pdfplumber < example.pdf > background-checks.csv

7.3 用法

参数	描述
`--format [format]`	`csv` or `json`. The `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes.
`--pages [list of pages]`	A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.
`--types [list of object types to extract]`	Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`, et cetera. Defaults to all available.
`--laparams`	A JSON-formatted string (e.g., `'{"detect_vertical": true}'`) to pass to `pdfplumber.open(..., laparams=...)`.
`--precision [integer]`	The number of decimal places to round floating-point numbers. Defaults to no rounding.

7.4 python package usage

python 复制代码

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

8. borb

8.0 简介

borb 是一个纯 Python 库，用于读取、写入和操作 PDF 文档。它将 PDF 文档表示为嵌套列表、字典和基本数据类型（数字、字符串、布尔值等）的类似 JSON 的数据结构。

Github address: borb
最新发布时间：2024年5月

8.1 安装

下载地址: borb · PyPI

bash 复制代码

# from pip
pip install borb

# reinstalled the latest version (rather than using its internal cache)
pip uninstall borb
pip install --no-cache borb

8.2 测试（创建pdf）

python 复制代码

from pathlib import Path

from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import SingleColumnLayout
from borb.pdf import Paragraph
from borb.pdf import PDF

# create an empty Document
pdf = Document()

# add an empty Page
page = Page()
pdf.add_page(page)

# use a PageLayout (SingleColumnLayout in this case)
layout = SingleColumnLayout(page)

# add a Paragraph object
layout.add(Paragraph("Hello World!"))

# store the PDF
with open(Path("output.pdf"), "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, pdf)

8.3 功能

读取PDF并提取元信息
修改元信息
从PDF中提取文本
从PDF中提取图像
改变PDF中的图像
向PDF添加注释（笔记、链接等）
向PDF添加文本
向PDF添加表格
向PDF添加列表
使用页面布局管理器

9. PyPDF4

Github address：PyPDF4
最新发布时间：2018年8月8日

9.1 安装

bash 复制代码

pip install pypdf

9.2 测试

python 复制代码

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())