pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com
pip install huggingface_hub
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
出现错误:
File "/usr/local/py311/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/py311/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 235, in snapshot_download
raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.
File "/usr/local/py311/lib/python3.11/site-packages/requests/adapters.py", line 700, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/opendatalab/PDF-Extract-Kit-1.0/revision/main (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fa75fb71110>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: f1d9e0a1-3114-4901-95bd-6f460516a120)')
原因:国内连不上 huggingface.co
解决: 查找 '/api/models/'
grep -R '/api/models/' /usr/local/py311/lib/python3.11/site-packages/huggingface_hub/
发现:./hf_api.py: f"{self.endpoint}/api/models/{repo_id}"
最终: vi download_models_hf.py
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
添加参数:
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns, endpoint='https://hf-mirror.com')
layoutreader_model_dir = snapshot_download('hantian/layoutreader', allow_patterns=layoutreader_pattern)
添加参数:
layoutreader_model_dir = snapshot_download('hantian/layoutreader', allow_patterns=layoutreader_pattern, endpoint='https://hf-mirror.com')
python download_models_hf.py 成功下载
"models-dir": "/root/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/60416a2cabad3f7b7284b43ce37a99864484fba2/models",
"layoutreader-model-dir": "/root/.cache/huggingface/hub/models--hantian--layoutreader/snapshots/641226775a0878b1014a96ad01b9642915136853",
3836(Mb) models--opendatalab--PDF-Extract-Kit-1.0
681(Mb) models--hantian--layoutreader
4G多数据,10多分钟下载完成。
=====================
按网上找的资料执行:
magic-pdf -p example.pdf -o output_dir -m auto
提示错误: Error: No such option: -p。 估计是资料里是老的版本
magic-pdf --help
Usage: magic-pdf [OPTIONS] COMMAND [ARGS]...
Options:
-v, --version 显示版本信息
-h, --help 显示帮助信息
Commands:
json-command
local-json-command
pdf-command
magic-pdf -v
magic-pdf, version 0.6.1
magic-pdf pdf-command --help
Usage: magic-pdf pdf-command [OPTIONS]
Options:
--pdf PATH PDF文件的路径 [required]
--model PATH 模型的路径
--method [ocr|txt|auto] 指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto:
程序智能选择解析方法
--inside_model BOOLEAN 使用内置模型测试
--model_mode TEXT 内置模型选择。lite: 快速解析,精度较低,full: 高精度解析,速度较慢
--help Show this message and exit.
配置文件在 /root/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
提示缺少 detectron2。
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
错误:No matching distribution found for detectron2
资料研究,detectron2 只有 python 3.10 版本 已编译好的。 安装一个 python 3.10.16。
pip310 install -U "magic-pdf[full-cpu]" detectron2 --extra-index-url https://wheels.myhloli.com
运行 magic-pdf pdf-command --pdf "11.pdf" --inside_model true
提示:Error: No such option: --pdf
magic-pdf -v
import tensorrt_llm failed, if do not use tensorrt, ignore this message
import lmdeploy failed, if do not use lmdeploy, ignore this message
magic-pdf, version 1.1.0
magic-pdf --help
Options:
-v, --version display the version and exit
-p, --path PATH local filepath or directory. support PDF, PPT,
PPTX, DOC, DOCX, PNG, JPG files [required]
-o, --output-dir PATH output local directory [required]
-m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr
technique to extract information from pdf. txt:
suitable for the text-based pdf only and
outperform ocr. auto: automatically choose the
best method for parsing pdf from ocr and txt.
without method specified, auto will be used by
default.
-l, --lang TEXT Input the languages in the pdf (if known) to
improve OCR accuracy. Optional. You should
input "Abbreviation" with language form url:
https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations
-d, --debug BOOLEAN Enables detailed debugging information during
the execution of the CLI commands.
-s, --start INTEGER The starting page for PDF parsing, beginning
from 0.
-e, --end INTEGER The ending page for PDF parsing, beginning from
0.
--help Show this message and exit.
在 python 3.11 下 安装 magic-pdf[full] 得到 0.6.1 版本。 python 3.10 下 安装 magic-pdf[full-cpu] 得到 1.10 版本。 参数还不一样。 搞不清楚开发者 怎么规划的。
magic-pdf -p "en.pdf" -o mkd -m auto 根据提示缺少组件, 挨个装上
No module named 'paddle'
No module named 'cv2'
No module named 'ultralytics'
No module named 'doclayout_yolo'
No module named 'timm'
No module named 'unimernet'
No module named 'paddleocr'
No module named 'rapid_table'
No module named 'struct_eqtable'
No module named 'rapidocr_onnxruntime'
pip310 install opencv-python timm rapid_table struct_eqtable rapidocr_onnxruntime ultralytics>=8.2.85 doclayout-yolo==0.0.2b1 unimernet==0.2.1 paddlepaddle==2.5.2 paddleocr==2.7.3
终于开始跑起来,又提示错误:Illegal instruction (core dumped)
试着加上语言参数: magic-pdf -p "en.pdf" -o mkd -m auto -l en
import tensorrt_llm failed, if do not use tensorrt, ignore this message
import lmdeploy failed, if do not use lmdeploy, ignore this message
2025-02-16 19:37:48.877 | INFO | magic_pdf.data.dataset:__init__:156 - lang: en
2025-02-16 19:37:51.296 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 11285, cid_chars_radio: 0.0
2025-02-16 19:37:51.298 | INFO | magic_pdf.model.pdf_extract_kit:__init__:78 - DocAnalysis init, this may take some times, layout_model: doclayout_yolo, apply_formula: True, apply_ocr: False, apply_table: True, table_model: rapid_table, lang: en
2025-02-16 19:37:51.298 | INFO | magic_pdf.model.pdf_extract_kit:__init__:99 - using device: cpu
2025-02-16 19:37:51.298 | INFO | magic_pdf.model.pdf_extract_kit:__init__:103 - using models_dir: /root/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/60416a2cabad3f7b7284b43ce37a99864484fba2/models
CustomVisionEncoderDecoderModel init
VariableUnimerNetModel init
VariableUnimerNetPatchEmbeddings init
VariableUnimerNetModel init
VariableUnimerNetPatchEmbeddings init
CustomMBartForCausalLM init
CustomMBartDecoder init
2025-02-16 19:38:00,906 - DownloadModel - DEBUG: /usr/local/py310/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists
[2025-02-16 19:38:00,906] [ DEBUG] download_model.py:34 - /usr/local/py310/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists
2025-02-16 19:38:01.416 | INFO | magic_pdf.model.pdf_extract_kit:__init__:181 - DocAnalysis init done!
2025-02-16 19:38:01.416 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:141 - model init cost: 10.12007188796997
2025-02-16 19:38:06.751 | INFO | magic_pdf.model.pdf_extract_kit:__call__:217 - layout detection time: 5.31
2025-02-16 19:38:11.639 | INFO | magic_pdf.model.pdf_extract_kit:__call__:223 - mfd time: 4.89
2025-02-16 19:38:12.624 | INFO | magic_pdf.model.pdf_extract_kit:__call__:230 - formula nums: 1, mfr time: 0.98
2025-02-16 19:38:13.935 | INFO | magic_pdf.model.pdf_extract_kit:__call__:264 - det time: 1.31
2025-02-16 19:38:13.935 | INFO | magic_pdf.model.pdf_extract_kit:__call__:304 - table time: 0.0
2025-02-16 19:38:13.936 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:236 - -----page_id : 0, page total time: 12.5-----
..............
2025-02-16 19:39:54.076 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:236 - -----page_id : 13, page total time: 6.81-----
2025-02-16 19:39:54.291 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:247 - gc time: 0.22
2025-02-16 19:39:54.291 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:251 - doc analyze time: 112.87, speed: 0.12 pages/second
2025-02-16 19:39:54.375 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 0, last_page_cost_time: 0.0
..............
2025-02-16 19:39:58.962 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 13, last_page_cost_time: 0.28
2025-02-16 19:40:00.850 | INFO | magic_pdf.tools.common:do_parse:242 - local output dir is mkd/en/auto
结果有点差强人意:
( )11. A.give B.sell C.show D.take ( )12. A.green B.bright C.large D.clean ( )13. A.farmer B.worker C.cleaner D.teacher ( )14.A.says B.points C.looks D.money
被识别成一行了。 试了下导出中文试卷,出错:Illegal instruction (core dumped)
出错后,研究 minerU 使用的核心组件 PDF-Extract-Kit
pdf-extract-kit paddle paddleocr pdf2markdown.py(效果不佳)-CSDN博客
更改了一些库的版本, ultralytics>=8.2.85 doclayout-yolo==0.0.2b1 unimernet==0.2.1 paddlepaddle==2.5.2 paddleocr==2.7.3
/usr/local/py310/bin/magic-pdf -p "cn.pdf" -o mkd -m auto -l ch -d true
中文 pdf 也成功了。