国产开源PDF解析工具MinerU

前言

PDF的数据解析是一件较困难的事情，几乎所有商家都把PDF转WORD功能做成付费产品。

PDF是基于PostScript子集渲染的，PostScript是一门图灵完备的语言。而WORD需要的渲染，本质上是PDF能力的子集。大模型领域，我们的目标文件格式一般是markdown，markdown相较于WORD更加简单，是WORD的子集。

子集向父集转换是容易的，因为子集有的功能，父集都有。而父集向子集转换是困难的，因为父集的众多功能，子集并不具备。

通过元素映射的方式来实现PDF的解析，是不现实的。于是，上海人工智能实验室的研发人员提出利用多种深度学习算法，来直接分析和识别PDF上的文字、图片、公式、表格等，再反向合并成最终的markdown文件。

总的来说，PaddleOCR 负责文本的检测与识别，而 TableMaster 负责表格的结构解析和内容整合，二者结合实现了对文档图像中表格的全面识别和理解。

MinerU涉及的模型

模型名称	模型功能	模型详情
LayoutLMv3	布局检测模型	unilm/layoutlmv3 at master · microsoft/unilm (github.com)
UniMERNet	公式识别模型	opendatalab/UniMERNet: UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition (github.com)
StructEqTable	表格识别模型	Alpha-Innovator/StructEqTable-Deploy: A High-efficiency Open-source Toolkit for Table-to-Latex Task (github.com)
YOLO	公式检测模型	ultralytics/ultralytics: Ultralytics YOLO11 🚀 (github.com)
PaddleOCR	OCR模型	PaddlePaddle/PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices) (github.com)
DocLayout-YOLO	布局检测模型	opendatalab/DocLayout-YOLO: DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception (github.com)

将DeepSeek V2论文输入到MinerU中，得到下列输出内容：

1.images目录

pdf中的图片
2.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M.md

最终输出的markdown文件
3.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_content_list.json

未知
4.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_layout.pdf

版面分析结果
5.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_middle.json

包含以下字段信息：

字段名	解释
pdf_info	list，每个元素都是一个dict,这个dict是每一页pdf的解析结果，详见下表
_parse_type	ocr \| txt，用来标识本次解析的中间态使用的模式
_version_name	string, 表示本次解析使用的 magic-pdf 的版本号

6.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_model.json
所有元素的检测框坐标

[

复制代码

  {

      "layout_dets": [

          {

              "category_id": 1,

              "poly": [

                  193,

                  793,

                  1462,

                  793,

                  1462,

                  1354,

                  193,

                  1354

              ],

              "score": 0.983

          },

          {

              "category_id": 0,

              "poly": [

                  319,

                  314,

                  1340,

                  314,

                  1340,

                  424,

                  319,

                  424

              ],

              "score": 0.968

          },

          {

              "category_id": 3,

              "poly": [

                  207,

                  1410,

                  1444,

                  1410,

                  1444,

                  1976,

                  207,

                  1976

              ],

              "score": 0.966

          },

7.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_origin.pdf

原始pdf文件
8.DeepSeek-AI 等 - 2024 - DeepSeek-V2 A Strong, Economical, and Efficient M_spans.pdf

不同元素的检测框可视化