1. magic-pdf 环境安装
conda create -n MinerU python=3.10
conda activate MinerU
pip install boto3>=1.28.43 -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple/
2. 权重下载
sudo apt-get install git-lfs
git clone https://github.com/opendatalab/MinerU.git
cd MinerU/
git lfs install
git lfs clone https://huggingface.co/wanderkid/PDF-Extract-Kit
或者
pip install modelscope
python
# Use the following Python code to download the model using the ModelScope SDK:
from modelscope import snapshot_download
model_dir = snapshot_download('wanderkid/PDF-Extract-Kit')
3. 修改配置
修改
magic-pdf.template.json 中models-dir修改为模型的下载路径
python
{
"bucket_info":{
"bucket-name-1":["ak", "sk", "endpoint"],
"bucket-name-2":["ak", "sk", "endpoint"]
},
"models-dir":"/home/adam/work/MinerU/PDF-Extract-Kit/models",
"device-mode":"cpu",
"table-config": {
"is_table_recog_enable": false,
"max_time": 400
}
}
将magic-pdf.template.json文件修改为magic-pdf.json放在系统目录,不同的系统默认目录不同,
Windows : C:\Users\YourUsername
,
Linux : /home/YourUsername
macOS : /Users/YourUsername
4. 使用参数
python
magic-pdf --help
Usage: magic-pdf [OPTIONS]
Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf
from ocr and txt.
without method specified, auto will be used by default.
--help Show this message and exit.
## show version
magic-pdf -v
## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
{some_pdf}
可以是单个 PDF 文件,也可以是包含多个 PDF 的目录。 结果将保存在目录中。输出文件列表如下:{some_output_dir}
python
├── some_pdf.md # markdown file
├── images # directory for storing images
├── layout.pdf # layout diagram
├── middle.json # MinerU intermediate processing result
├── model.json # model inference result
├── origin.pdf # original PDF file
└── spans.pdf # smallest granularity bbox position information diagram
5.测试
magic-pdf -p GenZ-LLM.pdf -o ./res/ -m auto
结果:
测试使用cpu执行,内存16g,3页pdf解析大概2分钟, 页数过多会崩掉。有些公式好像解析的不太对,整体可用。
具体log:
2024-08-13 15:53:44.149 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 14962, cid_chars_radio: 0.0
INFO:datasets:PyTorch version 2.3.1 available.
2024-08-13 15:53:53.048 | INFO | magic_pdf.model.pdf_extract_kit:__init__:111 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: False, apply_table: False
2024-08-13 15:53:53.048 | INFO | magic_pdf.model.pdf_extract_kit:__init__:119 - using device: cpu
2024-08-13 15:53:53.048 | INFO | magic_pdf.model.pdf_extract_kit:__init__:121 - using models_dir: /home/long/work/MinerU/PDF-Extract-Kit/models
CustomVisionEncoderDecoderModel init
CustomMBartForCausalLM init
CustomMBartDecoder init
[08/13 15:54:06 detectron2]: Rank of current process: 0. World size: 1
[08/13 15:54:07 detectron2]: Environment info:
------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sys.platform linux
Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
numpy 1.26.4
detectron2 0.6 @/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/detectron2
detectron2._C not built correctly: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/detectron2/_C.cpython-310-x86_64-linux-gnu.so)
Compiler ($CXX) c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
DETECTRON2_ENV_MODULE <not set>
PyTorch 2.3.1+cu121 @/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI False
GPU available No: torch.cuda.is_available() == False
Pillow 10.4.0
torchvision 0.18.1+cu121 @/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/torchvision
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.6.0
------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
[08/13 15:54:07 detectron2]: Command line arguments: {'config_file': '/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/home/long/work/MinerU/PDF-Extract-Kit/models/Layout/model_final.pth']}
[08/13 15:54:07 detectron2]: Contents of args.config_file=/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml:
AUG:
DETR: true
CACHE_DIR: ~/cache/huggingface
CUDNN_BENCHMARK: false
DATALOADER:
ASPECT_RATIO_GROUPING: true
FILTER_EMPTY_ANNOTATIONS: false
NUM_WORKERS: 4
REPEAT_THRESHOLD: 0.0
SAMPLER_TRAIN: TrainingSampler
DATASETS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
PROPOSAL_FILES_TEST: []
PROPOSAL_FILES_TRAIN: []
TEST:
- scihub_train
TRAIN:
- scihub_train
GLOBAL:
HACK: 1.0
ICDAR_DATA_DIR_TEST: ''
ICDAR_DATA_DIR_TRAIN: ''
INPUT:
CROP:
ENABLED: true
SIZE:
- 384
- 600
TYPE: absolute_range
FORMAT: RGB
MASK_FORMAT: polygon
MAX_SIZE_TEST: 1333
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 800
MIN_SIZE_TRAIN:
- 480
- 512
- 544
- 576
- 608
- 640
- 672
- 704
- 736
- 768
- 800
MIN_SIZE_TRAIN_SAMPLING: choice
RANDOM_FLIP: horizontal
MODEL:
ANCHOR_GENERATOR:
ANGLES:
- - -90
- 0
- 90
ASPECT_RATIOS:
- - 0.5
- 1.0
- 2.0
NAME: DefaultAnchorGenerator
OFFSET: 0.0
SIZES:
- - 32
- - 64
- - 128
- - 256
- - 512
BACKBONE:
FREEZE_AT: 2
NAME: build_vit_fpn_backbone
CONFIG_PATH: ''
DEVICE: cuda
FPN:
FUSE_TYPE: sum
IN_FEATURES:
- layer3
- layer5
- layer7
- layer11
NORM: ''
OUT_CHANNELS: 256
IMAGE_ONLY: true
KEYPOINT_ON: false
LOAD_PROPOSALS: false
MASK_ON: true
META_ARCHITECTURE: VLGeneralizedRCNN
PANOPTIC_FPN:
COMBINE:
ENABLED: true
INSTANCES_CONFIDENCE_THRESH: 0.5
OVERLAP_THRESH: 0.5
STUFF_AREA_LIMIT: 4096
INSTANCE_LOSS_WEIGHT: 1.0
PIXEL_MEAN:
- 127.5
- 127.5
- 127.5
PIXEL_STD:
- 127.5
- 127.5
- 127.5
PROPOSAL_GENERATOR:
MIN_SIZE: 0
NAME: RPN
RESNETS:
DEFORM_MODULATED: false
DEFORM_NUM_GROUPS: 1
DEFORM_ON_PER_STAGE:
- false
- false
- false
- false
DEPTH: 50
NORM: FrozenBN
NUM_GROUPS: 1
OUT_FEATURES:
- res4
RES2_OUT_CHANNELS: 256
RES5_DILATION: 1
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: true
WIDTH_PER_GROUP: 64
RETINANET:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_WEIGHTS:
- 1.0
- 1.0
- 1.0
- 1.0
FOCAL_LOSS_ALPHA: 0.25
FOCAL_LOSS_GAMMA: 2.0
IN_FEATURES:
- p3
- p4
- p5
- p6
- p7
IOU_LABELS:
- 0
- -1
- 1
IOU_THRESHOLDS:
- 0.4
- 0.5
NMS_THRESH_TEST: 0.5
NORM: ''
NUM_CLASSES: 10
NUM_CONVS: 4
PRIOR_PROB: 0.01
SCORE_THRESH_TEST: 0.05
SMOOTH_L1_LOSS_BETA: 0.1
TOPK_CANDIDATES_TEST: 1000
ROI_BOX_CASCADE_HEAD:
BBOX_REG_WEIGHTS:
- - 10.0
- 10.0
- 5.0
- 5.0
- - 20.0
- 20.0
- 10.0
- 10.0
- - 30.0
- 30.0
- 15.0
- 15.0
IOUS:
- 0.5
- 0.6
- 0.7
ROI_BOX_HEAD:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS:
- 10.0
- 10.0
- 5.0
- 5.0
CLS_AGNOSTIC_BBOX_REG: true
CONV_DIM: 256
FC_DIM: 1024
NAME: FastRCNNConvFCHead
NORM: ''
NUM_CONV: 0
NUM_FC: 2
POOLER_RESOLUTION: 7
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
SMOOTH_L1_BETA: 0.0
TRAIN_ON_PRED_BOXES: false
ROI_HEADS:
BATCH_SIZE_PER_IMAGE: 512
IN_FEATURES:
- p2
- p3
- p4
- p5
IOU_LABELS:
- 0
- 1
IOU_THRESHOLDS:
- 0.5
NAME: CascadeROIHeads
NMS_THRESH_TEST: 0.5
NUM_CLASSES: 10
POSITIVE_FRACTION: 0.25
PROPOSAL_APPEND_GT: true
SCORE_THRESH_TEST: 0.05
ROI_KEYPOINT_HEAD:
CONV_DIMS:
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512
LOSS_WEIGHT: 1.0
MIN_KEYPOINTS_PER_IMAGE: 1
NAME: KRCNNConvDeconvUpsampleHead
NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
NUM_KEYPOINTS: 17
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
ROI_MASK_HEAD:
CLS_AGNOSTIC_MASK: false
CONV_DIM: 256
NAME: MaskRCNNConvUpsampleHead
NORM: ''
NUM_CONV: 4
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
RPN:
BATCH_SIZE_PER_IMAGE: 256
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS:
- 1.0
- 1.0
- 1.0
- 1.0
BOUNDARY_THRESH: -1
CONV_DIMS:
- -1
HEAD_NAME: StandardRPNHead
IN_FEATURES:
- p2
- p3
- p4
- p5
- p6
IOU_LABELS:
- 0
- -1
- 1
IOU_THRESHOLDS:
- 0.3
- 0.7
LOSS_WEIGHT: 1.0
NMS_THRESH: 0.7
POSITIVE_FRACTION: 0.5
POST_NMS_TOPK_TEST: 1000
POST_NMS_TOPK_TRAIN: 2000
PRE_NMS_TOPK_TEST: 1000
PRE_NMS_TOPK_TRAIN: 2000
SMOOTH_L1_BETA: 0.0
SEM_SEG_HEAD:
COMMON_STRIDE: 4
CONVS_DIM: 128
IGNORE_VALUE: 255
IN_FEATURES:
- p2
- p3
- p4
- p5
LOSS_WEIGHT: 1.0
NAME: SemSegFPNHead
NORM: GN
NUM_CLASSES: 10
VIT:
DROP_PATH: 0.1
IMG_SIZE:
- 224
- 224
NAME: layoutlmv3_base
OUT_FEATURES:
- layer3
- layer5
- layer7
- layer11
POS_TYPE: abs
WEIGHTS:
OUTPUT_DIR:
SCIHUB_DATA_DIR_TRAIN: ~/publaynet/layout_scihub/train
SEED: 42
SOLVER:
AMP:
ENABLED: true
BACKBONE_MULTIPLIER: 1.0
BASE_LR: 0.0002
BIAS_LR_FACTOR: 1.0
CHECKPOINT_PERIOD: 2000
CLIP_GRADIENTS:
CLIP_TYPE: full_model
CLIP_VALUE: 1.0
ENABLED: true
NORM_TYPE: 2.0
GAMMA: 0.1
GRADIENT_ACCUMULATION_STEPS: 1
IMS_PER_BATCH: 32
LR_SCHEDULER_NAME: WarmupCosineLR
MAX_ITER: 20000
MOMENTUM: 0.9
NESTEROV: false
OPTIMIZER: longW
REFERENCE_WORLD_SIZE: 0
STEPS:
- 10000
WARMUP_FACTOR: 0.01
WARMUP_ITERS: 333
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.05
WEIGHT_DECAY_BIAS: null
WEIGHT_DECAY_NORM: 0.0
TEST:
AUG:
ENABLED: false
FLIP: true
MAX_SIZE: 4000
MIN_SIZES:
- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200
DETECTIONS_PER_IMAGE: 100
EVAL_PERIOD: 1000
EXPECTED_RESULTS: []
KEYPOINT_OKS_SIGMAS: []
PRECISE_BN:
ENABLED: false
NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0
[08/13 15:54:08 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /home/long/work/MinerU/PDF-Extract-Kit/models/Layout/model_final.pth ...
[08/13 15:54:08 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/long/work/MinerU/PDF-Extract-Kit/models/Layout/model_final.pth ...
2024-08-13 15:54:09.334 | INFO | magic_pdf.model.pdf_extract_kit:__init__:148 - DocAnalysis init done!
2024-08-13 15:54:09.336 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 25.18623661994934
2024-08-13 15:54:18.411 | INFO | magic_pdf.model.pdf_extract_kit:__call__:159 - layout detection cost: 8.96
0: 1888x1472 2 embeddings, 3839.2ms
Speed: 28.6ms preprocess, 3839.2ms inference, 0.9ms postprocess per image at shape (1, 3, 1888, 1472)
2024-08-13 15:54:25.349 | INFO | magic_pdf.model.pdf_extract_kit:__call__:189 - formula nums: 2, mfr time: 1.24
2024-08-13 15:54:34.577 | INFO | magic_pdf.model.pdf_extract_kit:__call__:159 - layout detection cost: 9.22
0: 1888x1472 25 embeddings, 4120.5ms
Speed: 15.3ms preprocess, 4120.5ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1472)
2024-08-13 15:54:49.462 | INFO | magic_pdf.model.pdf_extract_kit:__call__:189 - formula nums: 25, mfr time: 10.67
2024-08-13 15:54:59.903 | INFO | magic_pdf.model.pdf_extract_kit:__call__:159 - layout detection cost: 10.44
0: 1888x1472 18 embeddings, 4241.8ms
Speed: 20.1ms preprocess, 4241.8ms inference, 0.9ms postprocess per image at shape (1, 3, 1888, 1472)
2024-08-13 15:55:12.180 | INFO | magic_pdf.model.pdf_extract_kit:__call__:189 - formula nums: 18, mfr time: 7.93
2024-08-13 15:55:12.184 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:124 - doc analyze cost: 62.73242211341858
2024-08-13 15:55:12.233 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 0, last_page_cost_time: 0.0
2024-08-13 15:55:12.305 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 1, last_page_cost_time: 0.07
2024-08-13 15:55:12.364 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 2, last_page_cost_time: 0.06
2024-08-13 15:55:12.743 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表,列表行数:[(8, 9)], [[8, 9]]
2024-08-13 15:55:12.744 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第8到第9行是列表
2024-08-13 15:55:12.750 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表,列表行数:[(19, 20)], [[19]]
2024-08-13 15:55:12.750 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第19到第20行是列表
2024-08-13 15:55:12.755 | INFO | magic_pdf.para.para_split_v2:para_split:764 - 连接了第0页和第1页的段落
2024-08-13 15:55:13.239 | INFO | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:48 - uni_pipe mk mm_markdown finished
2024-08-13 15:55:13.278 | INFO | magic_pdf.pipe.UNIPipe:pipe_mk_uni_format:43 - uni_pipe mk content list finished
2024-08-13 15:55:13.278 | INFO | magic_pdf.tools.common:do_parse:119 - local output dir is ./res/GenZ-LLM-Analyzer/auto