SmolDocling-256M:极小参数量的视觉语言模型|端到端文档解析方案的另一种思路

背景问题

传统的一站式文档解析工具,包含布局分析、OCR和表格识别等,往往需要结合多个独立的模型,同时根据处理任务的不同调用不同的模型,增加了处理流程的复杂度,并且难以泛化到不同的文档类型。大型视觉语言模型(LVLMs)虽然提供端到端的解决方案,但是计算成本高,如Qwen2.5VL系列模型,至少7B以上的模型才有不错的效果,这对于文档解析这种轻量型的任务来说计算负担太重了。

SmolDocling

SmolDocling是一种超小型的VLM,由IBM Research和Hugging Face合作开发,能够在使用远少于大型模型计算资源的情况下(仅有传统语言模型的参数量),提供与大型模型相当的性能,支持OCR、布局和定位、代码识别、公式识别、图表识别、表格识别、图像分类、标题对应、列表分组、全页转换等功能。 该模型基于Hugging Face的SmolVLM-256M模型训练得到,使用Vision Encoder处理文档图像,以及轻量级语言模型理解和生成结构化文本,同时引入DocTags标记格式,提高文档结构的识别和处理效率。

DocTags标记格式

DocTags 是模型输出内容的格式,是一种类似XML的语言,DocTags创建了一个清晰且结构化的标签和规则系统,用于将文本与文档的结构分开。通过减少混淆,使得模型的转换变得更加容易和准确。而且,相比直接转换为HTML或 Markdown等格式可能会保留更多有用的细节。

使用方法

1. 在线试用

可访问 www.smoldocling.net/ 进行试用

2. 本地部署

导入必要的库

python 复制代码
from io import BytesIO
from pathlib import Path
from urllib.parse import urlparse

import requests
from PIL import Image
from docling_core.types.doc import ImageRefMode
from docling_core.types.doc.document import DocTagsDocument, DoclingDocument
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config, stream_generate

初始化模型

python 复制代码
model_path = "ds4sd/SmolDocling-256M-preview"
model, processor = load(model_path)
config = load_config(model_path)

数据准备

python 复制代码
# 设置提示词
prompt = "Convert this page to docling."

# 设置图片URL
image = "https://ibm.biz/docling-page-with-list"

if urlparse(image).scheme != "":  # it is a URL
    response = requests.get(image, stream=True, timeout=10)
    response.raise_for_status()
    pil_image = Image.open(BytesIO(response.content))
else:
    pil_image = Image.open(image)

formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)

执行转换

python 复制代码
print("DocTags: \n\n")

output = ""
for token in stream_generate(
    model, processor, formatted_prompt, [image], max_tokens=4096, verbose=False
):
    output += token.text
    print(token.text, end="")
    if "</doctag>" in token.text:
        break

print("\n\n")

doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([output], [pil_image])


# 转换为docling格式
doc = DoclingDocument(name="SampleDocument")
doc.load_from_doctags(doctags_doc)

转换为Markdown

python 复制代码
print("Markdown: \n\n")
print(doc.export_to_markdown())

实际体验

找了一张英语论文的截图,含有表格和多个段落。 prompt为Convert this page to docling.

转换后的Doctag格式:

xml 复制代码
<doctag><text><loc_10><loc_26><loc_471><loc_58>et al., 2021), ChartQA (Masry et al., 2022), and OCRBench_v2 (Fu et al., 
2024b). Furthermore, we also evaluate the visual grounding capability of our model on the referring expression comprehension benchmarks (Kazemzadeh et al., 2014; Mao et al., 2016), object detection in the wild (Li et al., 2022) and a self-curated point grounding benchmark.</text>
<text><loc_10><loc_66><loc_471><loc_82>Video (w/o Audio) → Text We assess our model on several representative video understanding tasks like Video-MME (Fu et al., 2024a), MVBench (Li et al., 2024a), and EgoSchema (Mangalam et al., 2023).</text>
<text><loc_10><loc_91><loc_471><loc_107>Multimodality → Text We demonstrate the ability of our model for mixed-modality (image, audio and text) prompts on OmniBench (Li et al., 2024b).</text>
<section_header_level_1><loc_10><loc_116><loc_166><loc_124>5.1.1 Performance of Text → Text</section_header_level_1>
<text><loc_10><loc_130><loc_471><loc_168>We compare Qwen2.5-Omni with other leading large language model of similar size (7B). As shown in Table 1, the performance of Qwen2.5-Omni generally falls between Qwen2-7B and Qwen2.5-7B. Our model outperforms Qwen2-7B on most benchmarks, such as MMLU-Pro, MMLU-redux, MATH, GSM8K, MBPP, MultiPL-E and LiveCodeBench, which demonstrates the exceptional capabilities of our model for Text → Text .</text>
<otsl><loc_40><loc_190><loc_438><loc_312><ched>Datasets<ched>Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B<lcel><lcel><lcel><lcel><nl><ecel><ecel><ched>General Tasks<lcel><lcel><lcel><nl><rhed>MMLU-Pro<fcel>52.1<fcel>48.3<fcel>44.1<fcel>56.3<fcel>47.0<nl><rhed>MMLU-redux<fcel>72.8<fcel>67.2<fcel>67.3<fcel>75.4<fcel>71.0<nl><rhed>LiveBench0831<fcel>30.6<fcel>26.7<fcel>29.2<fcel>35.9<fcel>29.6<nl><ecel><ched>Mathematics & Science Tasks<lcel><lcel><lcel><lcel><nl><rhed>GPQA<fcel>32.8<fcel>32.4<fcel>34.3<fcel>36.4<fcel>30.8<nl><rhed>MATH<fcel>44.3<fcel>51.9<fcel>52.9<fcel>75.5<fcel>71.5<nl><rhed>GSM8K<fcel>76.7<fcel>84.5<fcel>85.7<fcel>91.6<fcel>88.7<nl><ecel><ched>Coding Tasks<lcel><lcel><lcel><lcel><nl><rhed>HumanEval<fcel>68.9<fcel>72.6<fcel>79.9<fcel>84.8<fcel>78.7<nl><rhed>MBPP<fcel>74.9<fcel>69.6<fcel>67.2<fcel>79.2<fcel>73.2<nl><rhed>MultiPL-E<fcel>53.4<fcel>50.7<fcel>59.1<fcel>70.4<fcel>65.8<nl><rhed>LiveCodeBench2305-2409<fcel>18.9<fcel>8.3<fcel>23.9<fcel>28.7<fcel>24.6<nl><caption><loc_65><loc_174><loc_420><loc_182>Table 1: Text → Text performance of 7B+ pure text models and Qwen2.5-Omni</caption></otsl>
<section_header_level_1><loc_10><loc_329><loc_173><loc_337>5.1.2 Performance of Audio → Text</section_header_level_1>
<text><loc_10><loc_343><loc_471><loc_420>We compare Qwen2.5-Omni with other leading specialist or generalist models on diverse audio understanding, audio reasoning, and voice-chatting benchmarks. As shown in Table 2 and 3, Qwen2.5-Omni delivers better or comparable performance with other state-of-the-art methods on audio understanding. For instance, it achieves superior ASR and S2T performance on Fleurs_zh, CommonVoice_en, CommonVoice_zh, CoVoSt1_en-de and CoVoSt1_zh-en test sets, surpassing previous state-of-the-art models like Whisper-large-v3, Qwen2Audio, MinMo and other Omni models. Qwen2.5-Omni also achieves state-of-the-art performance on general audio understanding tasks like music and VSC. Additionally, Qwen2.5-Omni achieves state-of-the-art results on audio reasoning with superior performance on sound, music and speech subsets of MMU benchmark. These results demonstrate the powerful capabilities of Qwen2.5-Omni in general audio understanding and reasoning.</text>
<text><loc_10><loc_424><loc_471><loc_494>Additionally, on VoiceBench, Qwen2.5-Omni achieves an impressive average score of 74.12, surpassing other audio language models and omni models of similar size. This showcases our model's strong capabilities in speech interaction. To further explore the performance of diverse speech interaction, we convert text instructions from several pure-text benchmarks into speech and evaluate Qwen2.5-Omni, Qwen2-Audio and Qwen2-7B on the in-house voice-chat benchmark. About 90% of text-instructions are utilized. We use speech instruction for Qwen2.5-Omni and Qwen2-Audio, and text instruction for Qwen2-7B. As shown in Table 4, compared to Qwen2-Audio, Qwen2-5-Omni significantly narrows the gap with Qwen2-7B, which uses text instructions. This reflects our model's substantial progress in diversified end-to-end speech interaction.</text>
</doctag>

要求输出的Markdown格式:

vbnet 复制代码
Markdown: 


et al., 2021), ChartQA (Masry et al., 2022), and OCRBench\_v2 (Fu et al., 2024b). Furthermore, we also evaluate the visual grounding capability of our model on the referring expression comprehension benchmarks (Kazemzadeh et al., 2014; Mao et al., 2016), object detection in the wild (Li et al., 2022) and a self-curated point grounding benchmark.

Video (w/o Audio) → Text We assess our model on several representative video understanding tasks like Video-MME (Fu et al., 2024a), MVBench (Li et al., 2024a), and EgoSchema (Mangalam et al., 2023).

Multimodality → Text We demonstrate the ability of our model for mixed-modality (image, audio and text) prompts on OmniBench (Li et al., 2024b).

## 5.1.1 Performance of Text → Text

We compare Qwen2.5-Omni with other leading large language model of similar size (7B). As shown in Table 1, the performance of Qwen2.5-Omni generally falls between Qwen2-7B and Qwen2.5-7B. Our model outperforms Qwen2-7B on most benchmarks, such as MMLU-Pro, MMLU-redux, MATH, GSM8K, MBPP, MultiPL-E and LiveCodeBench, which demonstrates the exceptional capabilities of our model for Text → Text .

Table 1: Text → Text performance of 7B+ pure text models and Qwen2.5-Omni

| Datasets               | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   |
|------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|
|                        |                                                                 | General Tasks                                                   | General Tasks                                                   | General Tasks                                                   | General Tasks                                                   |
| MMLU-Pro               | 52.1                                                            | 48.3                                                            | 44.1                                                            | 56.3                                                            | 47.0                                                            |
| MMLU-redux             | 72.8                                                            | 67.2                                                            | 67.3                                                            | 75.4                                                            | 71.0                                                            |
| LiveBench0831          | 30.6                                                            | 26.7                                                            | 29.2                                                            | 35.9                                                            | 29.6                                                            |
|                        | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     |
| GPQA                   | 32.8                                                            | 32.4                                                            | 34.3                                                            | 36.4                                                            | 30.8                                                            |
| MATH                   | 44.3                                                            | 51.9                                                            | 52.9                                                            | 75.5                                                            | 71.5                                                            |
| GSM8K                  | 76.7                                                            | 84.5                                                            | 85.7                                                            | 91.6                                                            | 88.7                                                            |
|                        | Coding Tasks                                                    | Coding Tasks                                                    | Coding Tasks                                                    | Coding Tasks                                                    | Coding Tasks                                                    |
| HumanEval              | 68.9                                                            | 72.6                                                            | 79.9                                                            | 84.8                                                            | 78.7                                                            |
| MBPP                   | 74.9                                                            | 69.6                                                            | 67.2                                                            | 79.2                                                            | 73.2                                                            |
| MultiPL-E              | 53.4                                                            | 50.7                                                            | 59.1                                                            | 70.4                                                            | 65.8                                                            |
| LiveCodeBench2305-2409 | 18.9                                                            | 8.3                                                             | 23.9                                                            | 28.7                                                            | 24.6                                                            |
|                        |                                                                 |                                                                 |                                                                 |                                                                 |                                                                 |

## 5.1.2 Performance of Audio → Text

We compare Qwen2.5-Omni with other leading specialist or generalist models on diverse audio understanding, audio reasoning, and voice-chatting benchmarks. As shown in Table 2 and 3, Qwen2.5-Omni delivers better or comparable performance with other state-of-the-art methods on audio understanding. For instance, it achieves superior ASR and S2T performance on Fleurs\_zh, CommonVoice\_en, CommonVoice\_zh, CoVoSt1\_en-de and CoVoSt1\_zh-en test sets, surpassing previous state-of-the-art models like Whisper-large-v3, Qwen2Audio, MinMo and other Omni models. Qwen2.5-Omni also achieves state-of-the-art performance on general audio understanding tasks like music and VSC. Additionally, Qwen2.5-Omni achieves state-of-the-art results on audio reasoning with superior performance on sound, music and speech subsets of MMU benchmark. These results demonstrate the powerful capabilities of Qwen2.5-Omni in general audio understanding and reasoning.

Additionally, on VoiceBench, Qwen2.5-Omni achieves an impressive average score of 74.12, surpassing other audio language models and omni models of similar size. This showcases our model's strong capabilities in speech interaction. To further explore the performance of diverse speech interaction, we convert text instructions from several pure-text benchmarks into speech and evaluate Qwen2.5-Omni, Qwen2-Audio and Qwen2-7B on the in-house voice-chat benchmark. About 90% of text-instructions are utilized. We use speech instruction for Qwen2.5-Omni and Qwen2-Audio, and text instruction for Qwen2-7B. As shown in Table 4, compared to Qwen2-Audio, Qwen2-5-Omni significantly narrows the gap with Qwen2-7B, which uses text instructions. This reflects our model's substantial progress in diversified end-to-end speech interaction.

初步试用起来与Qwen2.5VL-7B差不多,甚至可能某些数据集上略胜一筹。

但是对于中文的图片识别效果就非常糟糕了,看了下是缺少对应训练的数据集导致的。

对于这样一个2亿参数量的小模型来说,能提供端到端的文档解析方案,同时还有不错的准确率,已经很难得了,期待后续有更多类似的模型推出。

相关推荐
幂律智能3 分钟前
从AI使用风险到合同智能审查重构企业风控能力
人工智能·重构
视***间11 分钟前
端侧大模型落地新标杆:视程空间将GPT-OSS边缘AI深度导入NVIDIA Jetson平台
人工智能·gpt·边缘计算·nvidia·ai算力·gpt-oss·视程空间
1892280486124 分钟前
NY379固态MT29F32T08GSLBHL8-36QA:B
大数据·服务器·人工智能·科技·缓存
Adair_z24 分钟前
[SEO艺术重读] 第9篇 熊猫算法、企鹅算法和惩罚机制
人工智能·熊猫算法·企鹅算法·谷歌算法恢复·网站seo诊断·高质量内容创作·e-e-a-t原则
ZZH_AI项目交付26 分钟前
我把 AI 最容易改坏真实 App 的地方,整理成了 skills
人工智能·ios·app
忆~遂愿27 分钟前
从文字应答到具象共情:Agent 交互的底层革新
人工智能·深度学习·目标检测·microsoft·机器学习·ar·交互
Ai.den29 分钟前
Windows 安装 MinerU 3.x 实现本地批量解析 PDF
人工智能·windows·ai
枫叶林FYL35 分钟前
【强化学习】长上下文可验证奖励强化学习:原理推导与系统架构
人工智能·系统架构
Teable任意门互动35 分钟前
深度解析:AI 赋能开源多维表格,实现企业全场景数据整合与高效应用
数据库·人工智能·低代码·信息可视化·开源·数据库开发
沪漂阿龙37 分钟前
Hermes Agent 安全边界全解析:让 AI Agent 敢执行、可控制、能回滚
人工智能·安全