小红书开源dots.ocr:单一视觉语言模型中的多语言文档布局解析

简介

dots.ocr 是一款强大的多语言文档解析器,它将版面检测与内容识别统一整合到单一视觉语言模型中,同时保持出色的阅读顺序还原能力。尽管其基础模型仅为17亿参数的轻量级大语言模型(LLM),但性能达到了业界顶尖水平(SOTA)。

  1. 卓越性能 :在OmniDocBench基准测试中,dots.ocr 的文本、表格和阅读顺序解析均达到SOTA水平,公式识别效果更可媲美豆包1.5、gemini2.5-pro等参数量大得多的模型。
  2. 多语言支持:针对低资源语言展现出强大的解析能力,在我们自建的多语言文档测试集上,版面检测与内容识别均取得显著优势。
  3. 统一简洁架构 :依托单一视觉语言模型,dots.ocr 的架构远比依赖复杂多模型流水线的传统方案更精简。仅需调整输入提示词即可切换任务,证明视觉语言模型在检测任务上可比肩DocLayout-YOLO等传统检测模型。
  4. 高效推理:基于17亿参数的轻量LLM构建,推理速度优于许多基于更大规模基础模型的高性能方案。

性能对比:dots.ocr 与竞品模型

备注:

  • EN、ZH 指标为 OmniDocBench 端到端评估结果,Multilingual 指标为 dots.ocr-bench 端到端评估结果。

基准测试结果

  1. OmniDocBench
    不同任务的端到端评估结果

| Model Type | Methods | OverallEdit || TextEdit || FormulaEdit || TableTEDS || TableEdit || Read OrderEdit ||

Model Type Methods EN ZH EN ZH EN ZH EN ZH EN ZH EN ZH
Pipeline Tools MinerU 0.150 0.357 0.061 0.215 0.278 0.577 78.6 62.1 0.180 0.344 0.079 0.292
Pipeline Tools Marker 0.336 0.556 0.080 0.315 0.530 0.883 67.6 49.2 0.619 0.685 0.114 0.340
Pipeline Tools Mathpix 0.191 0.365 0.105 0.384 0.306 0.454 77.0 67.1 0.243 0.320 0.108 0.304
Pipeline Tools Docling 0.589 0.909 0.416 0.987 0.999 1 61.3 25.0 0.627 0.810 0.313 0.837
Pipeline Tools Pix2Text 0.320 0.528 0.138 0.356 0.276 0.611 73.6 66.2 0.584 0.645 0.281 0.499
Pipeline Tools Unstructured 0.586 0.716 0.198 0.481 0.999 1 0 0.06 1 0.998 0.145 0.387
Pipeline Tools OpenParse 0.646 0.814 0.681 0.974 0.996 1 64.8 27.5 0.284 0.639 0.595 0.641
Pipeline Tools PPStruct-V3 0.145 0.206 0.058 0.088 0.295 0.535 - - 0.159 0.109 0.069 0.091
Expert VLMs GOT-OCR 0.287 0.411 0.189 0.315 0.360 0.528 53.2 47.2 0.459 0.520 0.141 0.280
Expert VLMs Nougat 0.452 0.973 0.365 0.998 0.488 0.941 39.9 0 0.572 1.000 0.382 0.954
Expert VLMs Mistral OCR 0.268 0.439 0.072 0.325 0.318 0.495 75.8 63.6 0.600 0.650 0.083 0.284
Expert VLMs OLMOCR-sglang 0.326 0.469 0.097 0.293 0.455 0.655 68.1 61.3 0.608 0.652 0.145 0.277
Expert VLMs SmolDocling-256M 0.493 0.816 0.262 0.838 0.753 0.997 44.9 16.5 0.729 0.907 0.227 0.522
Expert VLMs Dolphin 0.206 0.306 0.107 0.197 0.447 0.580 77.3 67.2 0.180 0.285 0.091 0.162
Expert VLMs MinerU 2 0.139 0.240 0.047 0.109 0.297 0.536 82.5 79.0 0.141 0.195 0.069< 0.118
Expert VLMs OCRFlux 0.195 0.281 0.064 0.183 0.379 0.613 71.6 81.3 0.253 0.139 0.086 0.187
Expert VLMs MonkeyOCR-pro-3B 0.138 0.206 0.067 0.107 0.246 0.421 81.5 87.5 0.139 0.111 0.100 0.185
General VLMs GPT4o 0.233 0.399 0.144 0.409 0.425 0.606 72.0 62.9 0.234 0.329 0.128 0.251
General VLMs Qwen2-VL-72B 0.252 0.327 0.096 0.218 0.404 0.487 76.8 76.4 0.387 0.408 0.119 0.193
General VLMs Qwen2.5-VL-72B 0.214 0.261 0.092 0.18 0.315 0.434 82.9 83.9 0.341 0.262 0.106 0.168
General VLMs Gemini2.5-Pro 0.148 0.212 0.055 0.168 0.356 0.439 85.8 86.4 0.13 0.119 0.049 0.121
General VLMs doubao-1-5-thinking-vision-pro-250428 0.140 0.162 0.043 0.085 0.295 0.384 83.3 89.3 0.165 0.085 0.058 0.094
Expert VLMs dots.ocr 0.125 0.160 0.032 0.066 0.329 0.416 88.6 89.0 0.099 0.092 0.040 0.067

跨9种PDF页面类型的端到端文本识别性能。

Model Type Models Book Slides Financial Report Textbook Exam Paper Magazine Academic Papers Notes Newspaper Overall
Pipeline Tools MinerU 0.055 0.124 ++0.033++ 0.102 0.159 0.072 ++0.025++ 0.984 0.171 0.206
Pipeline Tools Marker 0.074 0.340 0.089 0.319 0.452 0.153 0.059 0.651 0.192 0.274
Pipeline Tools Mathpix 0.131 0.220 0.202 0.216 0.278 0.147 0.091 0.634 0.690 0.300
Expert VLMs GOT-OCR 0.111 0.222 0.067 0.132 0.204 0.198 0.179 0.388 0.771 0.267
Expert VLMs Nougat 0.734 0.958 1.000 0.820 0.930 0.830 0.214 0.991 0.871 0.806
Expert VLMs Dolphin 0.091 0.131 0.057 0.146 0.231 0.121 0.074 0.363 0.307 0.177
Expert VLMs OCRFlux 0.068 0.125 0.092 0.102 0.119 0.083 0.047 0.223 0.536 0.149
Expert VLMs MonkeyOCR-pro-3B 0.084 0.129 0.060 0.090 0.107 0.073 0.050 0.171 0.107 0.100
General VLMs GPT4o 0.157 0.163 0.348 0.187 0.281 0.173 0.146 0.607 0.751 0.316
General VLMs Qwen2.5-VL-7B 0.148 0.053 0.111 0.137 0.189 0.117 0.134 0.204 0.706 0.205
General VLMs InternVL3-8B 0.163 0.056 0.107 0.109 0.129 0.100 0.159 0.150 0.681 0.188
General VLMs doubao-1-5-thinking-vision-pro-250428 0.048 0.048 0.024 0.062 0.085 0.051 0.039 0.096 0.181 0.073
Expert VLMs dots.ocr 0.031 0.047 0.011 0.082 0.079 0.028 0.029 0.109 0.056 0.055

备注:

  • 指标数据来源于MonkeyOCR、OmniDocBench及我们内部的评估结果。
  • 我们在结果Markdown中删除了页眉和页脚单元格。
  • 我们使用tikz_preprocess流程将图像分辨率提升至200dpi。
  1. dots.ocr-bench
    这是一个内部基准,包含100种语言的1493张pdf图像。
    不同任务的端到端评估结果。
Methods OverallEdit TextEdit FormulaEdit TableTEDS TableEdit Read OrderEdit
MonkeyOCR-3B 0.483 0.445 0.627 50.93 0.452 0.409
doubao-1-5-thinking-vision-pro-250428 0.291 0.226 0.440 71.2 0.260 0.238
doubao-1-6 0.299 0.270 0.417 71.0 0.258 0.253
Gemini2.5-Pro 0.251 0.163 0.402 77.1 0.236 0.202
dots.ocr 0.177 0.075 0.297 79.2 0.186 0.152

注:

  • 我们采用了与 OmniDocBench 相同的指标计算流程。
  • 在结果 markdown 中删除了页眉(Page-header)和页脚(Page-footer)单元格。

布局检测

| Method | F1@IoU=.50:.05:.95↑ ||||| F1@IoU=.50↑ |||||

Method Overall Text Formula Table Picture Overall Text Formula Table Picture
DocLayout-YOLO-DocStructBench 0.733 0.694 0.480 0.803 0.619 0.806 0.779 0.620 0.858 0.678
dots.ocr-parse all 0.831 0.801 0.654 0.838 0.748 0.922 0.909 0.770 0.888 0.831
dots.ocr-detection only 0.845 0.816 0.716 0.875 0.765 0.930 0.917 0.832 0.918 0.843

备注:

  • prompt_layout_all_en 用于解析全部,prompt_layout_only_en 仅用于检测,请参阅提示
  1. olmOCR-bench.
Model ArXiv Old Scans Math Tables Old Scans Headers and Footers Multi column Long Tiny Text Base Overall
GOT OCR 52.7 52.0 0.2 22.1 93.6 42.0 29.9 94.0 48.3 ± 1.1
Marker 76.0 57.9 57.6 27.8 84.9 72.9 84.6 99.1 70.1 ± 1.1
MinerU 75.4 47.4 60.9 17.3 96.6 59.0 39.1 96.6 61.5 ± 1.1
Mistral OCR 77.2 67.5 60.6 29.3 93.6 71.3 77.1 99.4 72.0 ± 1.1
Nanonets OCR 67.0 68.6 77.7 39.5 40.7 69.9 53.4 99.3 64.5 ± 1.1
GPT-4o (No Anchor) 51.5 75.5 69.1 40.9 94.2 68.9 54.1 96.7 68.9 ± 1.1
GPT-4o (Anchored) 53.5 74.5 70.0 40.7 93.8 69.3 60.6 96.8 69.9 ± 1.1
Gemini Flash 2 (No Anchor) 32.1 56.3 61.4 27.8 48.0 58.7 84.4 94.0 57.8 ± 1.1
Gemini Flash 2 (Anchored) 54.5 56.1 72.1 34.2 64.7 61.5 71.5 95.6 63.8 ± 1.2
Qwen 2 VL (No Anchor) 19.7 31.7 24.2 17.1 88.9 8.3 6.8 55.5 31.5 ± 0.9
Qwen 2.5 VL (No Anchor) 63.1 65.7 67.3 38.6 73.6 68.3 49.1 98.3 65.5 ± 1.2
olmOCR v0.1.75 (No Anchor) 71.5 71.4 71.4 42.8 94.1 77.7 71.0 97.8 74.7 ± 1.1
olmOCR v0.1.75 (Anchored) 74.9 71.2 71.0 42.2 94.5 78.3 73.3 98.3 75.5 ± 1.0
MonkeyOCR-pro-3B 83.8 68.8 74.6 36.1 91.2 76.6 80.1 95.3 75.8 ± 1.0
dots.ocr 82.1 64.2 88.3 40.9 94.1 82.4 81.2 99.5 79.1 ± 1.0

快速开始

  1. 安装
    安装 dots.ocr
bash 复制代码
conda create -n dots_ocr python=3.12
conda activate dots_ocr

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr

# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -e .

如果在安装过程中遇到问题,可以尝试使用我们的Docker镜像以简化配置,并按照以下步骤操作:

bash 复制代码
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .

下载模型权重

💡注意:请为模型保存路径使用不含句点的目录名(例如用DotsOCR而非dots.ocr)。这是我们与Transformers集成前的临时解决方案。

bash 复制代码
python3 tools/download_model.py
  1. 部署
    vLLM推理
    我们强烈建议使用vLLM进行部署和推理。我们所有的评估结果均基于vLLM 0.9.1版本。Docker镜像基于官方vLLM镜像构建,您也可以参考Dockerfile自行搭建部署环境。
bash 复制代码
# You need to register model to vllm at first
python3 tools/download_model.py
export hf_model_path=./weights/DotsOCR  # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`  # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) 

# launch vllm server
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95  --chat-template-content-format string --served-model-name model --trust-remote-code

# If you get a ModuleNotFoundError: No module named 'DotsOCR', please check the note above on the saved model directory name.

# vllm api demo
python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en

Huggingface 推理

bash 复制代码
python3 demo/demo_hf.py

Huggingface 推理细节

python 复制代码
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from dots_ocr.utils import dict_promptmode_to_prompt

model_path = "./weights/DotsOCR"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_path = "demo/demo_image1.jpg"
prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]

2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].

3. Text Extraction & Formatting Rules:
    - Picture: For the 'Picture' category, the text field should be omitted.
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others (Text, Title, etc.): Format their text as Markdown.

4. Constraints:
    - The output text must be the original text from the image, with no translation.
    - All layout elements must be sorted according to human reading order.

5. Final Output: The entire output must be a single JSON object.
"""

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image_path
                },
                {"type": "text", "text": prompt}
            ]
        }
    ]

# Preparation for inference
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=24000)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

3.文档解析

基于vLLM服务器,您可以使用以下命令解析图像或pdf文件:

bash 复制代码
# Parse all layout info, both detection and recognition
# Parse a single image
python3 dots_ocr/parser.py demo/demo_image1.jpg
# Parse a single PDF
python3 dots_ocr/parser.py demo/demo_pdf1.pdf  --num_threads 64  # try bigger num_threads for pdf with a large number of pages

# Layout detection only
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en

# Parse text only, except Page-header and Page-footer
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr

# Parse layout info by bbox
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 
相关推荐
寅时码42 分钟前
我开源了一款 Canvas “瑞士军刀”,十几种“特效与工具”开箱即用
前端·开源·canvas
小杨勇敢飞2 小时前
大语言模型的解码策略:贪婪解码与波束搜索
人工智能·语言模型·自然语言处理
NullPointerExpection3 小时前
dify + mcp 实现图片 ocr 识别
spring boot·llm·ocr·springai·deepseek·mcp
zkmall5 小时前
电商系统定制开发流程:ZKmall开源商城需求分析到上线全程可控
开源·需求分析
s1ckrain5 小时前
【论文阅读】Editing Large Language Models: Problems, Methods, and Opportunities
论文阅读·人工智能·语言模型·大模型可解释性
IvorySQL10 小时前
PGSQL运维优化:提升vacuum执行时间观测能力
运维·postgresql·开源·开源数据库·ivorysql
呼啸长风13 小时前
记一次未成功的 MMKV Pull Request
android·ios·开源
萑澈15 小时前
大语言模型提示词工程详尽实战指南
人工智能·语言模型·自然语言处理
草梅友仁16 小时前
草梅 Auth 1.2.0 发布与最新动态 | 2025 年第 31 周草梅周报
开源·github·ai编程