【阿里拥抱开源】千问开源Qwen3.6-35B-A3B，并配上调参参考

$!注意$
本仓库包含后训练模型的Hugging Face Transformers格式权重与配置文件。

这些模型资产兼容Hugging Face Transformers、vLLM、SGLang、KTransformers等框架。

继二月发布Qwen3.5系列后，我们很高兴推出Qwen3.6的首个开源权重版本。基于社区直接反馈构建的Qwen3.6优先考虑稳定性和实际效用，为开发者提供更直观、响应迅速且真正高效的编程体验。

Qwen3.6亮点

本次版本带来重大升级，主要体现在：

智能体编程： 模型处理前端工作流和仓库级推理时更流畅精准
思维延续： 新增保留历史消息推理上下文的选项，简化迭代开发并降低开销

更多详情请参阅我们的博客文章 Qwen3.6-35B-A3B。

模型概述

类型：带视觉编码器的因果语言模型
训练阶段：预训练 & 后训练
语言模型
- 参数量：总计350亿，激活30亿
- 隐藏层维度：2048
- 词嵌入维度：248320（填充后）
- 层数：40
- 隐藏层结构：10 × (3 × (门控DeltaNet → 混合专家) → 1 × (门控注意力 → 混合专家))
- 门控DeltaNet：
  - 线性注意力头数：V向量32头，QK向量16头
  - 头维度：128
- 门控注意力：
  - 注意力头数：Q向量16头，KV向量2头
  - 头维度：256
  - 旋转位置编码维度：64
- 混合专家系统：
  - 专家总数：256
  - 激活专家数：8路由专家 + 1共享专家
  - 专家中间层维度：512
- 语言模型输出：248320（填充后）
- 多任务预训练：采用多步训练策略
上下文长度：原生支持262,144 token，可扩展至1,010,000 token

基准测试结果

语言能力

模型能力对比表

	Qwen3.5-27B	Gemma4-31B	Qwen3.5-35BA3B	Gemma4-26BA4B	Qwen3.6-35BA3B
Coding Agent
SWE-bench Verified	75.0	52.0	70.0	17.4	73.4
SWE-bench Multilingual	69.3	51.7	60.3	17.3	67.2
SWE-bench Pro	51.2	35.7	44.6	13.8	49.5
Terminal-Bench 2.0	41.6	42.9	40.5	34.2	51.5
Claw-Eval (Avg)	64.3	48.5	65.4	58.8	68.7
Claw-Eval (Pass^3)	46.2	25.0	51.0	28.0	50.0
SkillsBench (Avg5)	27.2	23.6	4.4	12.3	28.7
QwenClawBench	52.2	41.7	47.7	38.7	52.6
NL2Repo	27.3	15.5	20.5	11.6	29.4
QwenWebBench	1068	1197	978	1178	1397
General Agent
TAU3-Bench	68.4	67.5	68.9	59.0	67.2
VITA-Bench	41.8	43.0	29.1	36.9	35.6
DeepPlanning	22.6	24.0	22.8	16.2	25.9
Tool Decathlon	31.5	21.2	28.7	12.0	26.9
MCPMark	36.3	18.1	27.0	14.2	37.0
MCP-Atlas	68.4	57.2	62.4	50.0	62.8
WideSearch	66.4	35.2	59.1	38.3	60.1
Knowledge
MMLU-Pro	86.1	85.2	85.3	82.6	85.2
MMLU-Redux	93.2	93.7	93.3	92.7	93.3
SuperGPQA	65.6	65.7	63.4	61.4	64.7
C-Eval	90.5	82.6	90.2	82.5	90.0
STEM & Reasoning
GPQA	85.5	84.3	84.2	82.3	86.0
HLE	24.3	19.5	22.4	8.7	21.4
LiveCodeBench v6	80.7	80.0	74.6	77.1	80.4
HMMT Feb 25	92.0	88.7	89.0	91.7	90.7
HMMT Nov 25	89.8	87.5	89.2	87.5	89.1
HMMT Feb 26	84.3	77.2	78.7	79.0	83.6
IMOAnswerBench	79.9	74.5	76.8	74.3	78.9
AIME26	92.6	89.2	91.0	88.3	92.7

SWE-Bench系列：内部智能体脚手架（bash + 文件编辑工具）；温度参数=1.0，top_p=0.95，20万上下文窗口。我们修正了SWE-bench Pro公开集中的部分问题任务，并在优化后的基准上评估所有基线。
Terminal-Bench 2.0：Harbor/Terminus-2测试框架；3小时超时，32核CPU/48GB内存；温度参数=1.0，top_p=0.95，top_k=20，最大标记数8万，25.6万上下文窗口；5次运行平均值。
SkillsBench：通过OpenCode评估78项任务（自包含子集，排除API依赖型任务）；5次运行平均值。
NL2Repo：其他模型通过Claude Code评估（温度参数=1.0，top_p=0.95，最大轮次=900）。
QwenClawBench：内部真实用户分布式Claw智能体基准（即将开源）；温度参数=0.6，25.6万上下文窗口。
QwenWebBench：内部前端代码生成基准；双语（中/英），7大类（网页设计、网页应用、游戏、SVG、数据可视化、动画及3D）；自动渲染+多模态评判（代码/视觉正确性）；BT/Elo评级体系。
TAU3-Bench：采用官方用户模型（gpt-5.2，低推理强度）+默认BM25检索。
VITA-Bench：子领域平均分；使用claude-4-sonnet作为评判器（因官方评判器claude-3.7-sonnet已停用）。
MCPMark：GitHub MCP v0.30.3；Playwright响应截断于3.2万标记处。
MCP-Atlas：公开集得分；gemini-2.5-pro评判器。
AIME 26：采用完整版AIME 2026（I卷&II卷），分数可能与Qwen 3.5记录存在差异。

视觉语言

多模态模型能力对比表

	Qwen3.5-27B	Claude-Sonnet-4.5	Gemma4-31B	Gemma4-26BA4B	Qwen3.5-35B-A3B	Qwen3.6-35B-A3B
STEM and Puzzle
MMMU	82.3	79.6	80.4	78.4	81.4	81.7
MMMU-Pro	75.0	68.4	76.9*	73.8*	75.1	75.3
Mathvista(mini)	87.8	79.8	79.3	79.4	86.2	86.4
ZEROBench_sub	36.2	26.3	26.0	26.3	34.1	34.4
General VQA
RealWorldQA	83.7	70.3	72.3	72.2	84.1	85.3
MMBench_EN-DEV-v1.1	92.6	88.3	90.9	89.0	91.5	92.8
SimpleVQA	56.0	57.6	52.9	52.2	58.3	58.9
HallusionBench	70.0	59.9	67.4	66.1	67.9	69.8
Text Recognition and Document Understanding
OmniDocBench1.5	88.9	85.8	80.1	74.4	89.3	89.9
CharXiv(RQ)	79.5	67.2	67.9	69.0	77.5	78.0
CC-OCR	81.0	68.1	75.7	74.5	80.7	81.9
AI2D_TEST	92.9	87.0	89.0	88.3	92.6	92.7
Spatial Intelligence
RefCOCO(avg)	90.9	--	--	--	89.2	92.0
ODInW13	41.1	--	--	--	42.6	50.8
EmbSpatialBench	84.5	71.8	--	--	83.1	84.3
RefSpatialBench	67.7	--	--	--	63.5	64.3
Video Understanding
VideoMME~(w sub.)~	87.0	81.1	--	--	86.6	86.6
VideoMME~(w/o sub.)~	82.8	75.3	--	--	82.5	82.5
VideoMMMU	82.3	77.6	81.6	76.0	80.4	83.7
MLVU	85.9	72.8	--	--	85.6	86.2
MVBench	74.6	--	--	--	74.8	74.6
LVBench	73.6	--	--	--	71.4	71.4

* 空单元格（--）表示分数不可用或不适用。

快速入门

为简化集成流程，我们推荐通过API使用Qwen3.6模型。以下是通过OpenAI兼容API使用Qwen3.6的指南。

部署Qwen3.6

Qwen3.6可通过主流推理框架的API提供服务。下面展示为Qwen3.6模型启动OpenAI兼容API服务的示例命令。

$!重要$
不同框架的推理效率和吞吐量差异显著。

建议使用最新框架版本以确保最佳性能和兼容性。

针对生产环境或高吞吐场景，强烈推荐使用SGLang、KTransformers或vLLM等专用服务引擎。
$!重要$
该模型默认上下文长度为262,144个token。

若遇到内存不足(OOM)错误，可考虑减小上下文窗口。

但由于Qwen3.6依赖扩展上下文处理复杂任务，建议至少保持128K token的上下文长度以保留思考能力。

SGLang

SGLang是面向大语言模型和视觉语言模型的高性能服务框架。

推荐使用sglang>=0.5.10版本运行Qwen3.6，可通过以下命令在新环境中安装：

shell 复制代码

uv pip install sglang[all]

请参阅文档了解更多详情。

以下命令将在http://localhost:8000/v1创建API端点：

标准版本：以下命令可用于创建最大上下文长度为262,144 tokens的API端点，并在8个GPU上使用张量并行。

shell 复制代码

python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3

工具使用：为了支持工具使用，您可以使用以下命令。

shell 复制代码

python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder

多令牌预测（MTP）：建议使用以下命令进行MTP：

shell 复制代码

python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

有关详细部署指南，请参阅 SGLang Qwen3.5 使用手册。

vLLM

vLLM 是一个面向大语言模型的高吞吐量、内存高效的推理与服务引擎。

建议为Qwen3.6使用vllm>=0.19.0版本，可通过以下命令在新环境中安装：

shell 复制代码

uv pip install vllm --torch-backend=auto

请参阅其文档获取更多细节。

以下命令将在http://localhost:8000/v1创建API端点：

标准版本：以下命令可用于在8个GPU上使用张量并行创建最大上下文长度为262,144个token的API端点。
shell 复制代码
```
vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
```

工具调用：要支持工具使用，您可以使用以下命令。

shell 复制代码

vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

多令牌预测（MTP）：建议使用以下命令进行 MTP：

shell 复制代码

vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

纯文本：以下命令跳过视觉编码器和多模态分析，以释放额外的键值缓存内存：

shell 复制代码

vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only

详细部署指南请参阅 vLLM Qwen3.5 配方

KTransformers

KTransformers 是一个基于CPU-GPU异构计算的灵活框架，专为前沿大语言模型推理优化而设计。运行Qwen3.6的KTransformers方案请查阅 KTransformers部署指南

Hugging Face Transformers

Hugging Face Transformers 提供轻量级服务器方案，适用于快速测试及中等负载部署场景。运行Qwen3.6需确保安装最新版 transformers 库：

shell 复制代码

pip install "transformers[serving]"

请参阅其文档了解更多详情。同时请确保已安装torchvision和pillow。

然后运行transformers serve命令启动服务器，API端点将位于http://localhost:8000/v1；如果可用，该命令会将模型放置在加速器上：

shell 复制代码

transformers serve Qwen/Qwen3.6-35B-A3B --port 8000 --continuous-batching

通过聊天补全API使用Qwen3.6

该聊天补全API可通过标准HTTP请求或OpenAI SDK访问。

以下我们展示使用OpenAI Python SDK的示例。

开始前，请确保已安装SDK并配置好API密钥及API基础URL，例如：

shell 复制代码

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

$!提示$
我们推荐使用以下采样参数组合进行生成

常规任务的思考模式：temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

精确编码任务（如Web开发）的思考模式：temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

常规任务的指令（非思考）模式：temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

推理任务的指令（非思考）模式：temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

请注意，不同推理框架对采样参数的支持程度可能有所不同。
$!重要$
Qwen3.6模型默认以思考模式运行，会在生成最终响应前输出由<think>\n...</think>\n\n标识的思考内容。

如需禁用思考内容直接获取响应，请参考此处示例。

python 复制代码

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Type \"I love Qwen3.6\" backwards"},
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

图像输入

python 复制代码

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
                }
            },
            {
                "type": "text",
                "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$"
            }
        ]
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)

视频输入

python 复制代码

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    }, 
)

print("Chat response:", chat_response)

指令（或无思考）模式

$!重要$
Qwen3.6 官方不支持 Qwen3 的软切换功能，即 /think 和 /nothink 命令。

Qwen3.6 默认会在响应前进行思考。

您可以通过配置 API 参数直接获取模型的无思考响应。

例如，

python 复制代码

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                "text": "Where is this?"
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    }, 
)
print("Chat response:", chat_response)

$!注意$
如果您使用的是阿里云模型工作室的API，除了修改model外，请使用"enable_thinking": False而非"chat_template_kwargs": {"enable_thinking": False}。

保留思考过程

默认情况下，仅保留处理最新用户消息时生成的思考块，形成常见的交错式思考模式。

Qwen3.6经过额外训练，能够保留并利用历史消息中的思考痕迹。

您可以通过设置preserve_thinking选项启用此功能：

python 复制代码

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [...]

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"preserve_thinking": True},
    }, 
)
print("Chat response:", chat_response)

$!注意$
若使用阿里云模型服务平台API，除修改model外，请使用"preserve_thinking": True替代"chat_template_kwargs": {"preserve_thinking": False}配置

该特性对智能体场景尤为有益，保持完整推理上下文既能提升决策一致性，还能通过减少重复推理降低整体token消耗。同时可提升KV缓存利用率，优化思维模式与非思维模式下的推理效率。

智能体应用

Qwen3.6具备出色的工具调用能力。

Qwen-Agent框架

推荐使用Qwen-Agent框架快速构建基于Qwen3.6的智能体应用。

可通过MCP配置文件定义可用工具，使用Qwen-Agent内置工具，或自行集成其他工具。

python 复制代码

import os
from qwen_agent.agents import Assistant

# Define LLM
# Using Alibaba Cloud Model Studio
llm_cfg = {
    # Use the OpenAI-compatible model service provided by DashScope:
    'model': 'Qwen3.6-35B-A3B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),

    'generate_cfg': {
        'use_raw_api': True,
        # When using Dash Scope OAI API, pass the parameter of whether to enable thinking mode in this way
        'extra_body': {
            'enable_thinking': True,
            'preserve_thinking': True,
        },
    },
}

# Using OpenAI-compatible API endpoint.
# functionality of the deployment frameworks and let Qwen-Agent automate the related operations.
#
# llm_cfg = {
#     # Use your own model service compatible with OpenAI API by vLLM/SGLang:
#     'model': 'Qwen/Qwen3.6-35B-A3B',
#     'model_type': 'qwenvl_oai',
#     'model_server': 'http://localhost:8000/v1',  # api_base
#     'api_key': 'EMPTY',
#
#     'generate_cfg': {
#         'use_raw_api': True,
#         # When using vLLM/SGLang OAI API, pass the parameter of whether to enable thinking mode in this way
#         'extra_body': {
#             'chat_template_kwargs': {'enable_thinking': True, 'preserve_thinking': True}
#         },
#     },
# }

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            "filesystem": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/xxxx/Desktop"]
            }
        }
    }
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'Help me organize my desktop.'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

# Streaming generation
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the desktop'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Qwen Code

Qwen Code 是一个面向终端的开源AI助手，专为Qwen模型优化。它能帮助你理解大型代码库、自动化繁琐工作，从而提升开发效率。

更多信息请查阅 Qwen Code。

超长文本处理

Qwen3.6原生支持最高262,144 tokens的上下文长度。

当任务总长度（含输入输出）超过该限制时，建议使用RoPE缩放技术（如YaRN）高效处理长文本。

当前transformers、vllm、ktransformers及sglang等多个推理框架已支持YaRN技术，

通常可通过两种方式启用：

修改模型配置文件：

在config.json的text_config中调整rope_parameters参数为：

json 复制代码

{
    "mrope_interleaved": true,
    "mrope_section": [
        11,
        11,
        10
    ],
    "rope_type": "yarn",
    "rope_theta": 10000000,
    "partial_rotary_factor": 0.25,
    "factor": 4.0,
    "original_max_position_embeddings": 262144,
}

传递命令行参数：

对于 vllm，你可以使用

shell 复制代码

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000

对于 sglang 和 ktransformers，你可以使用

shell 复制代码

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000

$!注意$
所有知名的开源框架均采用静态YaRN实现，这意味着缩放因子不随输入长度变化，可能影响短文本性能 。

建议仅在需要处理长上下文时修改rope_parameters配置。

同时推荐根据需求调整factor参数。例如，若应用场景的典型上下文长度为524,288个token，则建议将factor设为2.0。

最佳实践

为获得最优性能，我们推荐以下配置方案：

采样参数配置：
- 根据模式与任务类型选用下列参数组合：
  - 通用任务思考模式 ：
    temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  - 精确编程任务思考模式（如Web开发） ：
    temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  - 通用任务指令模式（非思考模式） ：
    temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  - 推理任务指令模式（非思考模式） ：
    temperature=1.0, top_p=1.0, top_k=40, min_p=0.0, presence_penalty=2.0, repetition_penalty=1.0
- 支持框架中可将presence_penalty参数在0至2之间调节以减少无限重复现象，但过高值可能导致语言混杂与轻微性能下降。
输出长度设定：
- 常规查询建议输出32,768个token。针对数学/编程竞赛等超高复杂度场景的基准测试，推荐将最大输出长度设为81,920 token，为模型提供充分生成空间以提升综合表现。
标准化输出格式：
- 数学问题：在提示词中加入"请逐步推理，并将最终答案置于\boxed{}内"
- 选择题 ：在提示词中追加JSON结构规范："请在answer字段仅显示选项字母，例如"answer": "C""
长视频理解优化：
- 为平衡文本与图像的推理效率，发布的video_preprocessor_config.json中size参数采用保守配置。建议将配置文件中的longest_edge参数设为469,762,048（对应224k视频token），可实现小时级视频的高帧率采样，从而获得更优性能表现。例如，
json 复制代码
```
{"longest_edge": 469762048, "shortest_edge": 4096}
```

或者，通过引擎启动参数覆盖默认值。具体实现细节请参考：vLLM / SGLang。