LLM时代pdf是目前广泛使用的文档格式,然而pdf的结构化解析一直是一个问题。
之前已经探索unstructured解析非结构化html
https://blog.csdn.net/liliang199/article/details/152803902
这里尝试用unstructured解析和结构化pdf。
1 环境安装
使用conda的python 3.12环境,假设conda和python已经安装,unstructured安装参考如下指令。
1.1 unstructured安装
pip install "unstructured[local-inference]" -i https://pypi.tuna.tsinghua.edu.cn/simple
1.2 tesseract安装
unstructured解析pdf依赖tesseract,为简化分析,这里使用conda安装tesseract,版本5.2.0.
conda install tesseract=5.2.0
1.3 版本验证
tesseract --version
返回
tesseract 5.2.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 9e : libpng 1.6.39 : libtiff 4.4.0 : zlib 1.2.13 : libwebp 1.3.2 : libopenjp2 2.4.0
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.8.1 zlib/1.2.13 liblzma/5.6.4 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6 libxml2/2.13.8 openssl/3.0.17 libb2/bundled
1.4 ocr解析
tessearact解析case.png中的文字,输出为output.txt
tesseract case.png output.
2 unstructured解析验证
测试输入为nvidia a100白皮书pdf,下载链接如下。
2.1 pdf解析示例
unstructured解析pdf并输出文本,以及文本类别。
from unstructured.partition.auto import partition
elements = partition("data/nvidia-ampere-architecture-whitepaper.pdf")
for e in elements:
print("--")
print(e.category, "=>", e.text)
输出示例如下,可见unstructured较准确的解析了A100白皮书,并对文本进行初步分类。
--
Title => NVIDIA A100 Tensor Core GPU Architecture
--
Title => UNPRECEDENTED ACCELERATION AT EVERY SCALE
--
UncategorizedText => V1.0
--
Title => Table of Contents
--
Title => Introduction
--
UncategorizedText => 7
--
...
2.2 pdf结构化
先解析pdf,然后结构化内容为json。
from unstructured.partition.auto import partition
from unstructured.staging.base import convert_to_dict
elements = partition("data/nvidia-ampere-architecture-whitepaper.pdf")
json_data = convert_to_dict(elements)
for d in json_data:
print(d)
输出如下,可见unstructrued较好的对pdf进行了结构化。
{'type': 'Title', 'element_id': '443ebe1170f17e6155c4e4186b030350', 'metadata': {'coordinates': {'points': ((72.0, 312.2336), (72.0, 381.8192000000001), (521.1721599999998, 381.8192000000001), (521.1721599999998, 312.2336)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'filename': 'nvidia-ampere-architecture-whitepaper.pdf', 'file_directory': 'data', 'last_modified': '2025-10-10T11:13:16', 'filetype': 'application/pdf', 'page_number': 1}, 'text': 'NVIDIA A100 Tensor Core GPU Architecture'}
{'type': 'Title', 'element_id': '385a137a7020946b27cb099e1968267f', 'metadata': {'coordinates': {'points': ((72.0, 397.216), (72.0, 413.216), (494.048, 413.216), (494.048, 397.216)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'filename': 'nvidia-ampere-architecture-whitepaper.pdf', 'file_directory': 'data', 'last_modified': '2025-10-10T11:13:16', 'filetype': 'application/pdf', 'page_number': 1}, 'text': 'UNPRECEDENTED ACCELERATION AT EVERY SCALE'}
{'type': 'UncategorizedText', 'element_id': '35a6fd96d1fdcfa5e7af2835bcbb523b', 'metadata': {'coordinates': {'points': ((516.8048, 746.0416), (516.8048, 757.2416), (543.1136, 757.2416), (543.1136, 746.0416)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'filename': 'nvidia-ampere-architecture-whitepaper.pdf', 'file_directory': 'data', 'last_modified': '2025-10-10T11:13:16', 'filetype': 'application/pdf', 'parent_id': '385a137a7020946b27cb099e1968267f', 'page_number': 1}, 'text': 'V1.0'}
{'type': 'Title', 'element_id': 'a9360e0212a4d173581c91da68494d02', 'metadata': {'coordinates': {'points': ((72.0, 79.0), (72.0, 103.0), (269.064, 103.0), (269.064, 79.0)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'filename': 'nvidia-ampere-architecture-whitepaper.pdf', 'file_directory': 'data', 'last_modified': '2025-10-10T11:13:16', 'filetype': 'application/pdf', 'page_number': 2}, 'text': 'Table of Contents'}
{'type': 'Title', 'element_id': 'b605350bc00209520b7cd8f546322663', 'metadata': {'coordinates': {'points': ((72.0, 130.83680000000015), (72.0, 142.03680000000008), (132.71519999999998, 142.03680000000008), (132.71519999999998, 130.83680000000015)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'filename': 'nvidia-ampere-architecture-whitepaper.pdf', 'file_directory': 'data', 'last_modified': '2025-10-10T11:13:16', 'filetype': 'application/pdf', 'page_number': 2}, 'text': 'Introduction'}
{'type': 'UncategorizedText', 'element_id': '7902699be42c8a8e46fbbb4501726517', 'metadata': {'coordinates': {'points': ((532.8016, 130.83680000000015), (532.8016, 142.03680000000008), (539.0288, 142.03680000000008), (539.0288, 130.83680000000015)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'filename': 'nvidia-ampere-architecture-whitepaper.pdf', 'file_directory': 'data', 'last_modified': '2025-10-10T11:13:16', 'filetype': 'application/pdf', 'parent_id': 'b605350bc00209520b7cd8f546322663', 'page_number': 2}, 'text': '7'}
{'type': 'UncategorizedText', 'element_id': '5ec467182527e70db1988abb924fb943', 'metadata': {'coordinates': {'points': ((83.20000000000005, 150.84000000000015), (83.20000000000005, 176.44320000000005), (527.09072, 176.44320000000005), (527.09072, 150.84000000000015)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'filename': 'nvidia-ampere-architecture-whitepaper.pdf', 'file_directory': 'data', 'last_modified': '2025-10-10T11:13:16', 'filetype': 'application/pdf', 'parent_id': 'b605350bc00209520b7cd8f546322663', 'page_number': 2}, 'text': 'Introducing NVIDIA A100 Tensor Core GPU - our 8th Generation Data Center GPU for the Age of Elastic Computing'}
...
附录
问题1: TESSDATA_PREFIX环境变量问题
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.'
tesseract版本过低导致,安装最新版本
conda install tesseract=5.2.0
reference
LLM时代基于unstructured解析非结构化html
https://blog.csdn.net/liliang199/article/details/152803902
CentOS7下部署开源tesseract-ocr完整教程
https://www.jjblogs.com/post/2022230
Centos7 下 部署开源tesseract-ocr完整教程
https://blog.csdn.net/qq_33547169/article/details/132111551
使用 Unstructured 开源库快速入门指南