【Qwen】DataArguments说明

DataArguments

Holds all configuration options for data loading and preprocessing in Qwen-VL fine-tuning. Passed as data_args after parsing from the command line (e.g. via HfArgumentParser) and used by make_supervised_data_module to build the dataset and collator.


Attributes

Name Type Default Description
dataset_use str "" Comma-separated dataset names or paths. Resolved via data_list() to get annotation_path and data_path for LazySupervisedDataset.
data_flatten bool False If True, use FlattenedDataCollatorForSupervisedDataset and packed sequences; otherwise use DataCollatorForSupervisedDataset.
data_packing bool False If True, enable sequence packing in the dataset (_get_packed_item).
base_interval int 2 Base interval used in packing or flattening (exact meaning depends on data_list / collator implementation).
max_pixels int 28 * 28 * 576 Maximum number of pixels (e.g. H * W) for an image. Written to the image processor's size["longest_edge"] / max_pixels.
min_pixels int 28 * 28 * 16 Minimum number of pixels for an image. Written to the image processor's size["shortest_edge"] / min_pixels.
video_max_frames int or None 8 Maximum number of sampled frames per video (used by video processor if present).
video_min_frames int or None 4 Minimum number of sampled frames per video.
video_max_pixels int 1024 * 28 * 28 Maximum total pixels for video frames. Set on the video processor when available.
video_min_pixels int 256 * 28 * 28 Minimum total pixels for video frames.
video_fps float 2 Frames per second used when sampling video.

Usage

Parsed together with ModelArguments and TrainingArguments in the training script:

python 复制代码
parser = transformers.HfArgumentParser(
    (ModelArguments, DataArguments, TrainingArguments)
)
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

data_module = make_supervised_data_module(processor, data_args=data_args)

Command-line example:

bash 复制代码
python qwenvl/train/train_qwen.py \
    --dataset_use "path/to/annotations.json" \
    --data_flatten True \
    --max_pixels 50176 \
    --min_pixels 784

Note

  • DataArguments is defined in qwenvl/train/argument.py and is a dataclass. The parsed instance is typically named data_args in the training pipeline.
  • The image processor's pixel limits are updated in update_processor_pixels(processor, data_args) using max_pixels and min_pixels.
相关推荐
We་ct7 小时前
LeetCode 228. 汇总区间:解题思路+代码详解
前端·算法·leetcode·typescript
AIpanda8888 小时前
如何借助AI销冠系统提升数字员工在销售中的成效?
算法
乾元8 小时前
身份与访问:行为生物识别(按键习惯、移动轨迹)的 AI 建模
运维·网络·人工智能·深度学习·安全·自动化·安全架构
啊阿狸不会拉杆8 小时前
《机器学习导论》第 7 章-聚类
数据结构·人工智能·python·算法·机器学习·数据挖掘·聚类
木非哲8 小时前
机器学习--从“三个臭皮匠”到 XGBoost:揭秘 Boosting 算法的“填坑”艺术
算法·机器学习·boosting
love you joyfully8 小时前
告别“人多力量大”误区:看AI团队如何通过奖励设计实现协作韧性
人工智能·深度学习·神经网络·多智能体
小辉同志8 小时前
437. 路径总和 III
算法·深度优先·广度优先
happyprince8 小时前
2026年02月08日热门论文
人工智能·深度学习·计算机视觉
笨笨阿库娅8 小时前
从零开始的算法基础学习
学习·算法
芷栀夏8 小时前
CANN ops-math:面向 AI 计算的基础数学算子开发与高性能调用实战指南
人工智能·深度学习·神经网络·cann