【大模型实战】利用ms-swift微调框架对QwQ-32B推理模型进行微调

1. 背景介绍

之前我们在《大模型训练/微调的一些经验分享》、《利用DeepSeek-R1数据微调蒸馏ChatGLM32B让大模型具备思考能力》中做了相关模型微调的介绍。目前在基座大模型能力还没有达到足够牛的情况下，大模型微调在商业化、垂直领域应用依然是不可或缺，即使是使用DeepSeek-R1、QwQ-32B也难以保证商业应用的要求。

今天我们将介绍利用ms-swift框架对近期开源的阿里QwQ-32B推理模型进行lora微调的实验。

2. 模型微调方案

ms-swift是modelscope提供的大模型与多模态大模型微调部署框架，支持语言大模型与多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。其中语言大模型包括：Qwen2.5、GLM4、Llama3.3、DeepSeek-R1等模型，多模态大模型包括：Qwen2.5-VL、Qwen2-Audio、Llama3.2-Vision、Llava、GLM4v、Xcomposer2.5、DeepSeek-VL2等模型。

QwQ-32B模型参数大小约为62G，如果选择全参微调，保守估计需要20倍的显存资源，也就是需要1240G显存，即使8卡的A800集群，大概在640G显存，也不足以做微调，因此为了落地方便，选择Lora微调方式。最终我们利用4卡-A800完成Lora的训练。

题外话：ms-swift的文档不全，这点做的不是特别好。

3. 模型微调实验

3.1 微调数据格式

格式如下所示，但我们微调使用的是带推理过程的数据，这个需要注意一下。关于think的语料可以参考《利用DeepSeek-R1数据微调蒸馏ChatGLM32B让大模型具备思考能力》。

复制代码

{"messages": [{"role": "system", "content": "你是一个乐于助人且知识渊博的AI助手。"},{"role": "user", "content": "请解释AlphaGo的基本原理。"}, {"role": "assistant", "content": "AlphaGo结合深度神经网络与树搜索算法，通过棋局数据进行训练，逐步提升棋力，能够判断围棋局势并选择最佳着法。"}]}{"messages": [{"role": "system", "content": "你是一个乐于助人且知识渊博的AI助手。"},{"role": "user", "content": "列举一些健康饮食的重要原则。"}, {"role": "assistant", "content": "健康饮食的重要原则包括：均衡饮食，合理摄入各类营养素；多吃新鲜蔬果，适量饮水；少吃糖、油、盐，忌吃过量快餐；养成规律生活作息习惯，适量运动。"}]} {"messages": [{"role": "system", "content": "你是一个乐于助人且知识渊博的AI助手。"},{"role": "user", "content": "给出一些改善睡眠质量的建议。"}, {"role": "assistant", "content": "改善睡眠质量的建议包括：保证睡眠时间，避免睡眠不足；睡前放松，避免刺激；养成规律作息时间，不要频繁熬夜；适量运动，但不要运动过度；睡前可以喝一杯热牛奶等温和饮料。"}]}

3.2 训练脚本

bash 复制代码

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model /data/QwQ-32B \
    --train_type lora \
    --dataset '/data/qwq32b_sft_lora/rl-v0312.jsonl' \
    --torch_dtype bfloat16 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 8 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 5 \
    --logging_steps 5 \
    --max_length 8192 \
    --output_dir /data/qwq32b_sft_lora/output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --model_author swift \
    --model_name swift-robot \
    --deepspeed zero3

3.3 训练日志

从训练日志可以清晰看到，整个微调阶段的loss逐步收敛。另外框架会输出最佳的模型checkpoint模型参数。

$2025-03-11 19:28:37,083\] \[INFO\] \[config.py:734:__init__\] Config mesh_device None world_size = 4 \[2025-03-11 19:28:37,084\] \[INFO\] \[config.py:734:__init__\] Config mesh_device None world_size = 4 \[2025-03-11 19:28:37,092\] \[INFO\] \[config.py:734:__init__\] Config mesh_device None world_size = 4 \[2025-03-11 19:28:37,401\] \[INFO\] \[partition_parameters.py:348:__exit__\] finished initializing model - num_params = 771, num_elems = 32.76B Loading checkpoint shards: 100%\|██████████\| 14/14 \[00:16\<00:00, 1.17s/it$
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, 1.17s/it]

Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, 1.17s/it]

Loading checkpoint shards: 100%|██████████| 14/14 [00:17<00:00, 1.22s/it]
$INFO:swift\] model_info: ModelInfo(model_type='qwq', model_dir='/data/QwQ-32B', torch_dtype=torch.bfloat16, max_model_len=131072, quant_method=None, quant_bits=None, rope_scaling=None, config=Qwen2Config { "_name_or_path": "/data/QwQ-32B", "architectures": \[ "Qwen2ForCausalLM" \], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 27648, "max_position_embeddings": 131072, "max_window_layers": 64, "model_type": "qwen2", "num_attention_heads": 40, "num_hidden_layers": 64, "num_key_value_heads": 8, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } , task_type='causal_lm', num_labels=None) \[INFO:swift\] model.generation_config: GenerationConfig { "bos_token_id": 151643, "eos_token_id": \[ 151645, 151643 \], "max_new_tokens": 64, "pad_token_id": 151643 } \[INFO:swift\] default_system: None \[INFO:swift\] The TrainArguments will be saved in: /data/qwq32b_sft_lora/output/v9-20250311-192834/args.json \[INFO:swift\] Start time of running main: 2025-03-11 19:28:54.707260 Map: 100%\|██████████\| 2645/2645 \[00:00\<00:00, 8426.21 examples/s$
Map: 100%|██████████| 2645/2645 [00:00<00:00, 7697.28 examples/s]

Map: 100%|██████████| 2645/2645 [00:00<00:00, 6463.52 examples/s]

Map: 0%| | 0/2619 [00:00<?, ? examples/s][INFO:swift] create tmp_dir: /.cache/modelscope/hub/tmp/hf_datasets-i15lb3_o

Map: 100%|██████████| 2645/2645 [00:00<00:00, 9980.89 examples/s]

{'eval_loss': 0.82470703, 'eval_token_acc': 0.75927635, 'eval_runtime': 15.7907, 'eval_samples_per_second': 1.647, 'eval_steps_per_second': 0.443, 'epoch': 0.61, 'global_step/max_steps': '50/405', 'percentage': '12.35%', 'elapsed_time': '42m 2s', 'remaining_time': '4h 58m 30s'}

Val: 100%|██████████| 7/7 [00:13<00:00, 1.87s/it]00s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-50 \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* \[2025-03-11 20:12:35,271\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.84376278, 'token_acc': 0.74929837, 'grad_norm': 0.29243814, 'learning_rate': 9.808e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019739, 'epoch': 0.67, 'global_step/max_steps': '55/405', 'percentage': '13.58%', 'elapsed_time': '46m 22s', 'remaining_time': '4h 55m 8s'} {'loss': 0.82531147, 'token_acc': 0.75041408, 'grad_norm': 0.29134859, 'learning_rate': 9.748e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019703, 'epoch': 0.73, 'global_step/max_steps': '60/405', 'percentage': '14.81%', 'elapsed_time': '50m 41s', 'remaining_time': '4h 51m 29s'} {'loss': 0.8170001, 'token_acc': 0.75919308, 'grad_norm': 0.24516849, 'learning_rate': 9.68e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019724, 'epoch': 0.8, 'global_step/max_steps': '65/405', 'percentage': '16.05%', 'elapsed_time': '54m 51s', 'remaining_time': '4h 46m 59s'} {'loss': 0.81388254, 'token_acc': 0.75490298, 'grad_norm': 0.28124103, 'learning_rate': 9.604e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019781, 'epoch': 0.86, 'global_step/max_steps': '70/405', 'percentage': '17.28%', 'elapsed_time': '58m 55s', 'remaining_time': '4h 41m 58s'} {'loss': 0.81019135, 'token_acc': 0.74177519, 'grad_norm': 0.28694744, 'learning_rate': 9.52e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019784, 'epoch': 0.92, 'global_step/max_steps': '75/405', 'percentage': '18.52%', 'elapsed_time': '1h 3m 7s', 'remaining_time': '4h 37m 44s'} {'loss': 0.76696019, 'token_acc': 0.7639197, 'grad_norm': 0.311834, 'learning_rate': 9.429e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019813, 'epoch': 0.98, 'global_step/max_steps': '80/405', 'percentage': '19.75%', 'elapsed_time': '1h 7m 14s', 'remaining_time': '4h 33m 8s'} {'loss': 0.76195569, 'token_acc': 0.76973895, 'grad_norm': 0.43021317, 'learning_rate': 9.33e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019856, 'epoch': 1.04, 'global_step/max_steps': '85/405', 'percentage': '20.99%', 'elapsed_time': '1h 11m 17s', 'remaining_time': '4h 28m 22s'} {'loss': 0.7821136, 'token_acc': 0.74735605, 'grad_norm': 0.41759374, 'learning_rate': 9.224e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019888, 'epoch': 1.1, 'global_step/max_steps': '90/405', 'percentage': '22.22%', 'elapsed_time': '1h 15m 21s', 'remaining_time': '4h 23m 45s'} Train: 23%\|██▎ \| 92/405 \[1:17:03\<4:18:27, 49.54s/it\]\[2025-03-11 20:46:52,504\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.74946299, 'token_acc': 0.76573743, 'grad_norm': 0.31465808, 'learning_rate': 9.111e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019875, 'epoch': 1.16, 'global_step/max_steps': '95/405', 'percentage': '23.46%', 'elapsed_time': '1h 19m 36s', 'remaining_time': '4h 19m 45s'} {'loss': 0.75774355, 'token_acc': 0.76279737, 'grad_norm': 0.34568468, 'learning_rate': 8.992e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019874, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 23m 48s', 'remaining_time': '4h 15m 35s'} Train: 25%\|██▍ \| 100/405 \[1:23:48\<4:17:26, 50.64s/it$
{'eval_loss': 0.720375, 'eval_token_acc': 0.77822903, 'eval_runtime': 15.6988, 'eval_samples_per_second': 1.656, 'eval_steps_per_second': 0.446, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 24m 3s', 'remaining_time': '4h 16m 23s'}

Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it]50.64s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-100 \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* {'loss': 0.72672591, 'token_acc': 0.76866752, 'grad_norm': 0.64908534, 'learning_rate': 8.865e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019801, 'epoch': 1.28, 'global_step/max_steps': '105/405', 'percentage': '25.93%', 'elapsed_time': '1h 28m 19s', 'remaining_time': '4h 12m 20s'} {'loss': 0.72024732, 'token_acc': 0.76941662, 'grad_norm': 0.36116413, 'learning_rate': 8.732e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019815, 'epoch': 1.34, 'global_step/max_steps': '110/405', 'percentage': '27.16%', 'elapsed_time': '1h 32m 27s', 'remaining_time': '4h 7m 58s'} {'loss': 0.68267331, 'token_acc': 0.7761134, 'grad_norm': 0.38293342, 'learning_rate': 8.593e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019806, 'epoch': 1.4, 'global_step/max_steps': '115/405', 'percentage': '28.40%', 'elapsed_time': '1h 36m 42s', 'remaining_time': '4h 3m 52s'} {'loss': 0.71170344, 'token_acc': 0.78053525, 'grad_norm': 0.3713337, 'learning_rate': 8.448e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019831, 'epoch': 1.46, 'global_step/max_steps': '120/405', 'percentage': '29.63%', 'elapsed_time': '1h 40m 47s', 'remaining_time': '3h 59m 22s'} {'loss': 0.70673256, 'token_acc': 0.77159011, 'grad_norm': 0.36822507, 'learning_rate': 8.297e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019845, 'epoch': 1.53, 'global_step/max_steps': '125/405', 'percentage': '30.86%', 'elapsed_time': '1h 44m 55s', 'remaining_time': '3h 55m 1s'} {'loss': 0.67356033, 'token_acc': 0.7921583, 'grad_norm': 0.4612934, 'learning_rate': 8.14e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019851, 'epoch': 1.59, 'global_step/max_steps': '130/405', 'percentage': '32.10%', 'elapsed_time': '1h 49m 5s', 'remaining_time': '3h 50m 45s'} Train: 33%\|███▎ \| 132/405 \[1:50:48\<3:50:21, 50.63s/it\]\[2025-03-11 21:20:37,710\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.68124514, 'token_acc': 0.78771819, 'grad_norm': 0.46047566, 'learning_rate': 7.978e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.65, 'global_step/max_steps': '135/405', 'percentage': '33.33%', 'elapsed_time': '1h 53m 16s', 'remaining_time': '3h 46m 32s'} {'loss': 0.67308445, 'token_acc': 0.78043745, 'grad_norm': 0.46205863, 'learning_rate': 7.812e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019881, 'epoch': 1.71, 'global_step/max_steps': '140/405', 'percentage': '34.57%', 'elapsed_time': '1h 57m 18s', 'remaining_time': '3h 42m 2s'} {'loss': 0.65709753, 'token_acc': 0.794716, 'grad_norm': 0.46728156, 'learning_rate': 7.64e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019893, 'epoch': 1.77, 'global_step/max_steps': '145/405', 'percentage': '35.80%', 'elapsed_time': '2h 1m 25s', 'remaining_time': '3h 37m 43s'} {'loss': 0.66156731, 'token_acc': 0.78602904, 'grad_norm': 0.45510392, 'learning_rate': 7.464e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019899, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 34s', 'remaining_time': '3h 33m 28s'} Train: 37%\|███▋ \| 150/405 \[2:05:34\<3:31:14, 49.70s/it$
{'eval_loss': 0.65251857, 'eval_token_acc': 0.79853547, 'eval_runtime': 15.6574, 'eval_samples_per_second': 1.661, 'eval_steps_per_second': 0.447, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 50s', 'remaining_time': '3h 33m 55s'}

Val: 100%|██████████| 7/7 [00:12<00:00, 1.86s/it]49.70s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-150 \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* {'loss': 0.65750132, 'token_acc': 0.78596818, 'grad_norm': 0.47214887, 'learning_rate': 7.285e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019847, 'epoch': 1.89, 'global_step/max_steps': '155/405', 'percentage': '38.27%', 'elapsed_time': '2h 10m 6s', 'remaining_time': '3h 29m 50s'} {'loss': 0.63944697, 'token_acc': 0.80483245, 'grad_norm': 0.49222756, 'learning_rate': 7.101e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.95, 'global_step/max_steps': '160/405', 'percentage': '39.51%', 'elapsed_time': '2h 14m 15s', 'remaining_time': '3h 25m 35s'} {'loss': 0.63674178, 'token_acc': 0.80768833, 'grad_norm': 0.59897131, 'learning_rate': 6.913e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019895, 'epoch': 2.01, 'global_step/max_steps': '165/405', 'percentage': '40.74%', 'elapsed_time': '2h 18m 10s', 'remaining_time': '3h 20m 58s'} {'loss': 0.64350748, 'token_acc': 0.80203466, 'grad_norm': 0.51221188, 'learning_rate': 6.723e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019886, 'epoch': 2.07, 'global_step/max_steps': '170/405', 'percentage': '41.98%', 'elapsed_time': '2h 22m 25s', 'remaining_time': '3h 16m 52s'} {'loss': 0.59812784, 'token_acc': 0.80184307, 'grad_norm': 0.52895864, 'learning_rate': 6.53e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019905, 'epoch': 2.13, 'global_step/max_steps': '175/405', 'percentage': '43.21%', 'elapsed_time': '2h 26m 27s', 'remaining_time': '3h 12m 29s'} {'loss': 0.60168495, 'token_acc': 0.80204451, 'grad_norm': 0.54771068, 'learning_rate': 6.334e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019928, 'epoch': 2.2, 'global_step/max_steps': '180/405', 'percentage': '44.44%', 'elapsed_time': '2h 30m 28s', 'remaining_time': '3h 8m 6s'} Train: 44%\|████▍ \| 180/405 \[2:30:28\<2:59:27, 47.85s/it\]\[2025-03-11 22:00:31,985\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.59545937, 'token_acc': 0.80456827, 'grad_norm': 0.57579227, 'learning_rate': 6.135e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.26, 'global_step/max_steps': '185/405', 'percentage': '45.68%', 'elapsed_time': '2h 34m 35s', 'remaining_time': '3h 3m 50s'} {'loss': 0.59948916, 'token_acc': 0.80121831, 'grad_norm': 0.53543298, 'learning_rate': 5.935e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.32, 'global_step/max_steps': '190/405', 'percentage': '46.91%', 'elapsed_time': '2h 38m 46s', 'remaining_time': '2h 59m 39s'} {'loss': 0.59326115, 'token_acc': 0.79956183, 'grad_norm': 0.55039623, 'learning_rate': 5.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01994, 'epoch': 2.38, 'global_step/max_steps': '195/405', 'percentage': '48.15%', 'elapsed_time': '2h 42m 55s', 'remaining_time': '2h 55m 27s'} {'loss': 0.58592167, 'token_acc': 0.80714245, 'grad_norm': 0.69052059, 'learning_rate': 5.531e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019935, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 9s', 'remaining_time': '2h 51m 19s'} Train: 49%\|████▉ \| 200/405 \[2:47:09\<2:49:56, 49.74s/it$
{'eval_loss': 0.56886792, 'eval_token_acc': 0.82167251, 'eval_runtime': 15.7067, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.446, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 24s', 'remaining_time': '2h 51m 36s'}

Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it]49.74s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-200 \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* Train: 50%\|████▉ \| 202/405 \[2:49:15\<3:08:49, 55.81s/it\]\[2025-03-11 22:19:18,274\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.54969444, 'token_acc': 0.82021995, 'grad_norm': 0.51556883, 'learning_rate': 5.327e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019871, 'epoch': 2.5, 'global_step/max_steps': '205/405', 'percentage': '50.62%', 'elapsed_time': '2h 51m 52s', 'remaining_time': '2h 47m 41s'} {'loss': 0.52501326, 'token_acc': 0.81536282, 'grad_norm': 0.54576287, 'learning_rate': 5.123e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 2.56, 'global_step/max_steps': '210/405', 'percentage': '51.85%', 'elapsed_time': '2h 55m 58s', 'remaining_time': '2h 43m 24s'} {'loss': 0.5639473, 'token_acc': 0.8235682, 'grad_norm': 0.51644597, 'learning_rate': 4.918e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019872, 'epoch': 2.62, 'global_step/max_steps': '215/405', 'percentage': '53.09%', 'elapsed_time': '3h 0m 15s', 'remaining_time': '2h 39m 17s'} Train: 53%\|█████▎ \| 215/405 \[3:00:15\<2:40:55, 50.82s/it\]\[2025-03-11 22:30:17,273\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.54539089, 'token_acc': 0.82929161, 'grad_norm': 0.5427966, 'learning_rate': 4.714e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019887, 'epoch': 2.69, 'global_step/max_steps': '220/405', 'percentage': '54.32%', 'elapsed_time': '3h 4m 18s', 'remaining_time': '2h 34m 59s'} {'loss': 0.54721932, 'token_acc': 0.82292752, 'grad_norm': 0.58632606, 'learning_rate': 4.51e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 2.75, 'global_step/max_steps': '225/405', 'percentage': '55.56%', 'elapsed_time': '3h 8m 28s', 'remaining_time': '2h 30m 46s'} {'loss': 0.51745701, 'token_acc': 0.82614152, 'grad_norm': 0.51928985, 'learning_rate': 4.307e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019892, 'epoch': 2.81, 'global_step/max_steps': '230/405', 'percentage': '56.79%', 'elapsed_time': '3h 12m 38s', 'remaining_time': '2h 26m 34s'} {'loss': 0.54157047, 'token_acc': 0.81710944, 'grad_norm': 0.71657186, 'learning_rate': 4.105e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019899, 'epoch': 2.87, 'global_step/max_steps': '235/405', 'percentage': '58.02%', 'elapsed_time': '3h 16m 46s', 'remaining_time': '2h 22m 20s'} {'loss': 0.54548702, 'token_acc': 0.81284619, 'grad_norm': 0.50686509, 'learning_rate': 3.904e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.93, 'global_step/max_steps': '240/405', 'percentage': '59.26%', 'elapsed_time': '3h 20m 52s', 'remaining_time': '2h 18m 6s'} {'loss': 0.51912632, 'token_acc': 0.83365523, 'grad_norm': 0.68279731, 'learning_rate': 3.706e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.99, 'global_step/max_steps': '245/405', 'percentage': '60.49%', 'elapsed_time': '3h 25m 3s', 'remaining_time': '2h 13m 55s'} {'loss': 0.52836185, 'token_acc': 0.83409461, 'grad_norm': 0.55463023, 'learning_rate': 3.509e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 4s', 'remaining_time': '2h 9m 37s'} Train: 62%\|██████▏ \| 250/405 \[3:29:04\<2:07:39, 49.41s/it$
{'eval_loss': 0.52870411, 'eval_token_acc': 0.83231801, 'eval_runtime': 15.7131, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.445, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 20s', 'remaining_time': '2h 9m 47s'}

Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it]49.41s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-250 \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* {'loss': 0.51691947, 'token_acc': 0.82422604, 'grad_norm': 0.53855505, 'learning_rate': 3.316e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019896, 'epoch': 3.11, 'global_step/max_steps': '255/405', 'percentage': '62.96%', 'elapsed_time': '3h 33m 33s', 'remaining_time': '2h 5m 37s'} Train: 63%\|██████▎ \| 257/405 \[3:35:20\<2:08:40, 52.17s/it\]\[2025-03-11 23:05:38,683\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.50732822, 'token_acc': 0.83722172, 'grad_norm': 0.7386699, 'learning_rate': 3.124e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.17, 'global_step/max_steps': '260/405', 'percentage': '64.20%', 'elapsed_time': '3h 37m 51s', 'remaining_time': '2h 1m 30s'} {'loss': 0.50304022, 'token_acc': 0.84518402, 'grad_norm': 0.5332581, 'learning_rate': 2.936e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.23, 'global_step/max_steps': '265/405', 'percentage': '65.43%', 'elapsed_time': '3h 42m 0s', 'remaining_time': '1h 57m 17s'} {'loss': 0.5034606, 'token_acc': 0.81697432, 'grad_norm': 0.67853403, 'learning_rate': 2.752e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019886, 'epoch': 3.29, 'global_step/max_steps': '270/405', 'percentage': '66.67%', 'elapsed_time': '3h 46m 13s', 'remaining_time': '1h 53m 6s'} {'loss': 0.5183465, 'token_acc': 0.83563731, 'grad_norm': 0.6259943, 'learning_rate': 2.571e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.35, 'global_step/max_steps': '275/405', 'percentage': '67.90%', 'elapsed_time': '3h 50m 25s', 'remaining_time': '1h 48m 55s'} {'loss': 0.51731062, 'token_acc': 0.83534514, 'grad_norm': 0.61180005, 'learning_rate': 2.394e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.42, 'global_step/max_steps': '280/405', 'percentage': '69.14%', 'elapsed_time': '3h 54m 40s', 'remaining_time': '1h 44m 45s'} {'loss': 0.48814211, 'token_acc': 0.82283914, 'grad_norm': 0.57190785, 'learning_rate': 2.222e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 3.48, 'global_step/max_steps': '285/405', 'percentage': '70.37%', 'elapsed_time': '3h 58m 50s', 'remaining_time': '1h 40m 33s'} {'loss': 0.4921607, 'token_acc': 0.82588464, 'grad_norm': 0.52349298, 'learning_rate': 2.054e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019888, 'epoch': 3.54, 'global_step/max_steps': '290/405', 'percentage': '71.60%', 'elapsed_time': '4h 2m 58s', 'remaining_time': '1h 36m 20s'} {'loss': 0.46711798, 'token_acc': 0.85013139, 'grad_norm': 0.6346718, 'learning_rate': 1.892e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.6, 'global_step/max_steps': '295/405', 'percentage': '72.84%', 'elapsed_time': '4h 7m 8s', 'remaining_time': '1h 32m 9s'} {'loss': 0.48140554, 'token_acc': 0.83738891, 'grad_norm': 0.62962168, 'learning_rate': 1.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019894, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 16s', 'remaining_time': '1h 27m 56s'} Train: 74%\|███████▍ \| 300/405 \[4:11:16\<1:25:56, 49.11s/it$
{'eval_loss': 0.50453913, 'eval_token_acc': 0.84007138, 'eval_runtime': 15.7353, 'eval_samples_per_second': 1.652, 'eval_steps_per_second': 0.445, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 32s', 'remaining_time': '1h 28m 2s'}

Val: 100%|██████████| 7/7 [00:13<00:00, 1.87s/it]49.11s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-300 \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* {'loss': 0.48724985, 'token_acc': 0.83901735, 'grad_norm': 0.59317845, 'learning_rate': 1.582e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019877, 'epoch': 3.72, 'global_step/max_steps': '305/405', 'percentage': '75.31%', 'elapsed_time': '4h 15m 40s', 'remaining_time': '1h 23m 49s'} {'loss': 0.47789598, 'token_acc': 0.85550931, 'grad_norm': 0.67272985, 'learning_rate': 1.436e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.78, 'global_step/max_steps': '310/405', 'percentage': '76.54%', 'elapsed_time': '4h 19m 49s', 'remaining_time': '1h 19m 37s'} {'loss': 0.49318271, 'token_acc': 0.81770707, 'grad_norm': 0.64257801, 'learning_rate': 1.295e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.84, 'global_step/max_steps': '315/405', 'percentage': '77.78%', 'elapsed_time': '4h 23m 53s', 'remaining_time': '1h 15m 23s'} Train: 78%\|███████▊ \| 316/405 \[4:24:43\<1:14:10, 50.00s/it\]\[2025-03-11 23:54:59,684\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.46136761, 'token_acc': 0.8458454, 'grad_norm': 0.62953055, 'learning_rate': 1.161e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.91, 'global_step/max_steps': '320/405', 'percentage': '79.01%', 'elapsed_time': '4h 28m 5s', 'remaining_time': '1h 11m 12s'} {'loss': 0.4856822, 'token_acc': 0.83825816, 'grad_norm': 0.64470125, 'learning_rate': 1.033e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019893, 'epoch': 3.97, 'global_step/max_steps': '325/405', 'percentage': '80.25%', 'elapsed_time': '4h 32m 13s', 'remaining_time': '1h 7m 0s'} {'loss': 0.46592345, 'token_acc': 0.84528571, 'grad_norm': 0.65905805, 'learning_rate': 9.12e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019917, 'epoch': 4.02, 'global_step/max_steps': '330/405', 'percentage': '81.48%', 'elapsed_time': '4h 36m 5s', 'remaining_time': '1h 2m 44s'} {'loss': 0.48042569, 'token_acc': 0.85237186, 'grad_norm': 0.61635281, 'learning_rate': 7.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 4.09, 'global_step/max_steps': '335/405', 'percentage': '82.72%', 'elapsed_time': '4h 40m 11s', 'remaining_time': '58m 32s'} {'loss': 0.45569935, 'token_acc': 0.83371485, 'grad_norm': 0.64527875, 'learning_rate': 6.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01993, 'epoch': 4.15, 'global_step/max_steps': '340/405', 'percentage': '83.95%', 'elapsed_time': '4h 44m 16s', 'remaining_time': '54m 20s'} {'loss': 0.46417255, 'token_acc': 0.84960884, 'grad_norm': 0.67313113, 'learning_rate': 5.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019931, 'epoch': 4.21, 'global_step/max_steps': '345/405', 'percentage': '85.19%', 'elapsed_time': '4h 48m 26s', 'remaining_time': '50m 9s'} {'loss': 0.47292795, 'token_acc': 0.85013211, 'grad_norm': 0.59537749, 'learning_rate': 4.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019936, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 32s', 'remaining_time': '45m 58s'} Train: 86%\|████████▋ \| 350/405 \[4:52:32\<45:19, 49.44s/it$
{'eval_loss': 0.490695, 'eval_token_acc': 0.84296351, 'eval_runtime': 15.6909, 'eval_samples_per_second': 1.657, 'eval_steps_per_second': 0.446, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 48s', 'remaining_time': '46m 0s'}

Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it].44s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-350 \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* Train: 87%\|████████▋ \| 352/405 \[4:54:35\<48:02, 54.39s/it\]\[2025-03-12 00:25:04,775\] \[WARNING\] \[stage3.py:2139:step\] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time {'loss': 0.46881456, 'token_acc': 0.83740075, 'grad_norm': 0.59338625, 'learning_rate': 4.13e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01991, 'epoch': 4.33, 'global_step/max_steps': '355/405', 'percentage': '87.65%', 'elapsed_time': '4h 57m 6s', 'remaining_time': '41m 50s'} {'eval_loss': 0.48915866, 'eval_token_acc': 0.84357886, 'eval_runtime': 15.8494, 'eval_samples_per_second': 1.64, 'eval_steps_per_second': 0.442, 'epoch': 4.88, 'global_step/max_steps': '400/405', 'percentage': '98.77%', 'elapsed_time': '5h 34m 38s', 'remaining_time': '4m 10s'} Val: 100%\|██████████\| 7/7 \[00:13\<00:00, 1.88s/it\].31s/it$ $INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400 {**'loss': 0.49458728**, 'token_acc': 0.83115697, 'grad_norm': 0.65133526, 'learning_rate': 0.0, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01992, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 38m 48s', 'remaining_time': '0s'} Train: 100%\|██████████\| 405/405 \[5:38:48\<00:00, 50.98s/it$
{'eval_loss': 0.4893617, 'eval_token_acc': 0.84308658, 'eval_runtime': 15.9508, 'eval_samples_per_second': 1.63, 'eval_steps_per_second': 0.439, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 4s', 'remaining_time': '0s'}

Val: 100%|██████████| 7/7 [00:13<00:00, 1.90s/it].98s/it]
$INFO:swift\] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405 {'train_runtime': 20349.5179, 'train_samples_per_second': 0.642, 'train_steps_per_second': 0.02, 'train_loss': 0.63218051, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 9s', 'remaining_time': '0s'} Train: 100%\|██████████\| 405/405 \[5:39:09\<00:00, 50.25s/it$
**[INFO:swift] last_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405
$INFO:swift\] best_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400**$

4. 模型部署及推理脚本

我们采用多卡部署，并且自定义服务端口：

bash 复制代码

RAY_memory_monitor_refresh_ms=0
CUDA_VISIBLE_DEVICES=0,1 swift deploy \
    --ckpt_dir /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400 \
    --infer_backend vllm \
    --max_new_tokens 2048 \
    --tensor_parallel_size 2 \
    --port 8011

推理脚本：

python 复制代码

from openai import OpenAI


openai_api_key = "EMPTY"
openai_api_base = "http://ip:8011/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="QwQ-32B",
    messages=[{"role": "system", "content": "你是一款客户机器人，帮助客户解决问题"}, 
              {"role": "user", "content": "问一下这款手机现在附带什么配件"}, 
              {"role": "assistant", "content": "附件内容：锂离子电池组 NP-FW50，电源适配器AC-UUD12 ，Micro USB 连接线，肩带，镜头盖，热靴盖，遮光罩，使用说明书，保修卡"}, 
              {"role": "user", "content": "售后和质保是什么标准"}
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=2048,
    extra_body={
        "repetition_penalty": 1.05,
    },
)
print("Chat response:", chat_response)

5. 参考材料

【1】https://github.com/modelscope/ms-swift

【2】推理和部署