百度开源 Qianfan-VL: 领域增强的通用视觉语言模型

通过持续预训练增强领域能力 | 参数量级30亿至700亿 | 文档理解与OCR增强 | 支持思维链推理

🔗 快速链接

代码库 : 💻 GitHub
模型下载 : 🤗 Hugging Face | 🤖 ModelScope
文档资料 : 📚 使用手册 | 📝 技术报告
博客文章 : 🇨🇳 中文博客 | 🇬🇧 英文博客

模型描述

千帆-VL是一系列专为企业级多模态应用优化的通用多模态大语言模型。该系列模型在保持强大通用能力的同时，针对工业部署中的高频场景进行了深度优化。

模型变体

模型	参数量	上下文长度	思维链支持	最佳适用场景
Qianfan-VL-3B	3B	32k	❌	边缘部署，实时光学字符识别
Qianfan-VL-8B	8B	32k	✅	服务器端通用场景，微调
Qianfan-VL-70B	70B	32k	✅	复杂推理，数据合成

架构

语言模型 ：
- 千帆-VL-3B：基于Qwen2.5-3B
- 千帆-VL-8B/70B：基于Llama 3.1架构
- 通过3T多语料库增强
视觉编码器：基于InternViT，支持动态分块高达4K分辨率
跨模态融合：MLP适配器实现高效的视觉语言桥接

核心能力

🔍 OCR与文档理解

全场景OCR：手写体、公式、自然场景、卡片/文档
文档智能：版面分析、表格解析、图表理解、文档问答
高精度：在OCR基准测试中具有行业领先表现

🧮 思维链推理（8B & 70B）

复杂图表分析与推理
数学问题分步求解
视觉推理与逻辑推断
统计计算与趋势预测

📊 基准测试表现

通用视觉语言基准

Benchmark	Qianfan-VL-3B	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
A-Bench_VAL	75.65	75.72	78.1	75.86	75.86	76.49	79.22
CCBench	66.86	70.39	80.98	77.84	70.78	57.65	73.73
SEEDBench_IMG	76.55	78.02	79.13	77.0	77.52	76.98	78.34
SEEDBench2_Plus	67.59	70.97	73.17	69.52	68.47	70.93	73.25
MMVet	48.17	53.21	67.34	80.28	78.9	70.64	75.69
MMMU_VAL	46.44	47.11	58.33	56.11	60.78	51.0	65.78
ScienceQA_TEST	95.19	97.62	98.76	97.97	97.17	85.47	92.51
ScienceQA_VAL	93.85	97.62	98.81	97.81	95.14	83.59	91.32
MMT-Bench_VAL	62.23	63.22	71.06	65.17	63.67	61.4	69.49
MTVQA_TEST	26.5	30.14	32.18	30.3	27.62	29.08	31.48
BLINK	49.97	56.81	59.44	55.87	51.87	54.55	63.02
MMStar	57.93	64.07	69.47	68.4	66.07	61.53	66.0
RealWorldQA	65.75	70.59	71.63	71.11	74.25	69.28	73.86
Q-Bench1_VAL	73.51	75.25	77.46	75.99	77.99	78.1	79.93
POPE	85.08	86.06	88.97	90.59	88.87	85.97	83.35
RefCOCO (Avg)	85.94	89.37	91.01	89.65	91.40	86.56	90.25

OCR和文档理解

Benchmark	Qianfan-VL-3B	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-3B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
OCRBench	831	854	873	881	847	810	883	874
AI2D_TEST	81.38	85.07	87.23	85.07	83.55	77.07	80.472	83.84
OCRVQA_TEST	66.15	68.98	74.06	39.03	35.58	69.24	71.02	66.8
TextVQA_VAL	80.11	82.13	84.48	82.15	83.52	79.09	84.962	83.26
DocVQA_VAL	90.85	93.54	94.75	92.04	83.82	92.71	94.91	95.75
ChartQA_TEST	81.79	87.72	89.6	85.76	82.04	83.4	86.68	87.16

数学推理

Benchmark	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
Mathvista-mini	69.19	78.6	69.5	70.1	67.2	73.9
Mathvision	32.82	50.29	29.61	34.8	25.95	39.34
Mathverse	48.4	61.04	43.68	49.26	44.21	55.18
ChartQA Pro	50.43	52	37.32	44.43	43.73	45.3
HallusionBench	51.72	54.52	49.2	40.2	47.9	49.9
InHouse Dataset A	59.87	71.78	40.64	41.47	45.58	57.2
InHouse Dataset B	61.33	75.6	36.25	42.65	30.62	59.68

快速开始

安装

bash 复制代码

pip install transformers accelerate torch torchvision pillow einops

使用 Transformers

python 复制代码

import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load model
MODEL_PATH = "baidu/Qianfan-VL-8B"  # or Qianfan-VL-3B, Qianfan-VL-70B
model = AutoModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Load and process image
pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)

# Inference
prompt = "<image>请识别图中所有文字"
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 512},
        verbose=False
    )
print(response)

使用vLLM

您可以通过vLLM官方Docker镜像部署千帆VL，实现高性能推理并兼容OpenAI API：

启动vLLM服务

bash 复制代码

docker run -d --name qianfan-vl \
  --gpus all \
  -v /path/to/Qianfan-VL-8B:/model \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model /model \
  --served-model-name qianfan-vl \
  --trust-remote-code \
  --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'

调用API

bash 复制代码

curl 'http://127.0.0.1:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "qianfan-vl",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"
            }
          },
          {
            "type": "text",
            "text": "<image>请识别图中所有文字"
          }
        ]
      }
    ]
  }'

或者使用Python和OpenAI SDK:

python 复制代码

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8000/v1"
)

response = client.chat.completions.create(
    model="qianfan-vl",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"}
                },
                {
                    "type": "text",
                    "text": "<image>请描述这张图片"
                }
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

训练详情

四阶段渐进式训练

跨模态对齐（1000亿token）：建立视觉-语言关联
通用知识注入（3.5万亿token）：构建强大基础能力
领域增强（3000亿token）：专项OCR与推理能力
训练后优化（10亿token）：指令跟随与偏好对齐

基础设施

基于5000+块百度昆仑芯片训练
单任务并行训练规模达5000芯片，创行业新纪录
超90%的大规模分布式训练扩展效率
创新的通信-计算融合技术

模型卡片

研发团队：百度智能云千帆团队
模型类型：视觉-语言Transformer
语言支持：多语种
许可协议：[详见模型卡片具体条款]
基础架构：请参阅技术报告

引用

如研究中使用千帆-VL模型，请引用：

bibtex 复制代码

@misc{qianfan-vl-2025,  
  title={千帆-VL：领域增强型通用视觉语言模型},  
  author={千帆团队},  
  year={2025},  
  publisher={百度}  
}

联系我们

访问百度千帆平台获取更多信息与API接入

致谢

本模型系列通过通用能力与领域增强的结合推动多模态AI重大突破，切实赋能产业应用。