【AI】Ubuntu 22.04 evalscope 模型评测 Qwen3-4B-FP8

安装evalscope

复制代码
mkdir evalscope
cd evalscope/
python3 -m venv venv
source venv/bin/activate
pip install 'evalscope[app,perf]' -U -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com

pip install tiktoken omegaconf -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com

下载通用评测数据集(暂时未用上)

复制代码
pip install -U modelscope

modelscope download --dataset modelscope/mmlu --local_dir /data/ai/evalscope_data/mmlu
modelscope download --dataset modelscope/gsm8k --local_dir /data/ai/evalscope_data/gsm8k
modelscope download --dataset modelscope/human_eval --local_dir /data/ai/evalscope_data/human_eval

本地部署Qwen3-4B-FP8

复制代码
modelscope download --model Qwen/Qwen3-4B-FP8
vllm serve /home/yeqiang/.cache/modelscope/hub/models/Qwen/Qwen3-4B-FP8 --served-model-name Qwen3-4B-FP8 --port 8000 --dtype auto --gpu-memory-utilization 0.8 --max-model-len 40960 --tensor-parallel-size 1

编辑评测脚本(采用EvalScope-Qwen3-Test数据集)

不支持python3.9.9

eval_qwen3_mmlu.py (名称有误,之前计划做mmlu数据集测试的,暂时无视这个错误)

复制代码
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
    model='Qwen3-4B-FP8',
    api_url='http://localhost:8000/v1/chat/completions',
    eval_type='service',
    datasets=[
        'data_collection',
    ],
    dataset_args={
        'data_collection': {
            'dataset_id': 'modelscope/EvalScope-Qwen3-Test',
            'filters': {'remove_until': '</think>'}  # 过滤掉思考的内容
        }
    },
    eval_batch_size=128,
    generation_config={
        'max_tokens': 30000,  # 最大生成token数,建议设置为较大值避免输出截断
        'temperature': 0.6,  # 采样温度 (qwen 报告推荐值)
        'top_p': 0.95,  # top-p采样 (qwen 报告推荐值)
        'top_k': 20,  # top-k采样 (qwen 报告推荐值)
        'n': 1,  # 每个请求产生的回复数量
    },
    timeout=60000,  # 超时时间
    stream=True,  # 是否使用流式输出
    limit=100,  # 设置为100条数据进行测试
)

run_task(task_cfg=task_cfg)

执行评测

复制代码
(venv) yeqiang@yeqiang-Default-string:/data/ai/evalscope$ python eval_qwen3_mmlu.py 
2025-05-06 22:44:04,363 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
2025-05-06 22:44:06,473 - evalscope - INFO - Loading dataset from modelscope: > dataset_name: modelscope/EvalScope-Qwen3-Test
Downloading Dataset to directory: /home/yeqiang/.cache/modelscope/hub/datasets/modelscope/EvalScope-Qwen3-Test
2025-05-06 22:44:08,753 - evalscope - INFO - Dump task config to ./outputs/20250506_224404/configs/task_config_7d0e13.yaml
2025-05-06 22:44:08,755 - evalscope - INFO - {
    "model": "Qwen3-4B-FP8",
    "model_id": "Qwen3-4B-FP8",
    "model_args": {
        "revision": "master",
        "precision": "torch.float16"
    },
    "model_task": "text_generation",
    "template_type": null,
    "chat_template": null,
    "datasets": [
        "data_collection"
    ],
    "dataset_args": {
        "data_collection": {
            "dataset_id": "modelscope/EvalScope-Qwen3-Test",
            "filters": {
                "remove_until": "</think>"
            }
        }
    },
    "dataset_dir": "/home/yeqiang/.cache/modelscope/hub/datasets",
    "dataset_hub": "modelscope",
    "generation_config": {
        "max_tokens": 30000,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "n": 1
    },
    "eval_type": "service",
    "eval_backend": "Native",
    "eval_config": null,
    "stage": "all",
    "limit": 100,
    "eval_batch_size": 128,
    "mem_cache": false,
    "use_cache": null,
    "work_dir": "./outputs/20250506_224404",
    "outputs": null,
    "debug": false,
    "dry_run": false,
    "seed": 42,
    "api_url": "http://localhost:8000/v1/chat/completions",
    "api_key": "EMPTY",
    "timeout": 60000,
    "stream": true,
    "judge_strategy": "auto",
    "judge_worker_num": 1,
    "judge_model_args": {}
}
   
Getting answers:  15%|████████████████████████████████▊                                            [00:50<02:06,  1.49s/it]
Getting answers:  32%|█████████████████████████████████████████████████████████████████████                                                            
Getting answers:  33%|█████████████████████████████████████████████████████████████████████| 33/100 [02:28<07:44,  6.93s/it]

nvidia-smi

vllm服务状态

报告

复制代码
2025-05-06 23:13:37,099 - evalscope - INFO - subset_level Report:
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
|  task_type  |         metric          |  dataset_name   |        subset_name         | average_score | count |
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
|    exam     |     AverageAccuracy     |    mmlu_pro     |           health           |    0.6667     |   9   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |            math            |      1.0      |   7   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |        engineering         |    0.6667     |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         chemistry          |      0.5      |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         psychology         |    0.6667     |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          biology           |      0.8      |   5   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |            law             |      0.2      |   5   |
| instruction | prompt_level_strict_acc |     ifeval      |          default           |     0.75      |   4   |
| instruction |  inst_level_strict_acc  |     ifeval      |          default           |     0.75      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          physics           |     0.75      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |           other            |      0.5      |   4   |
| instruction | prompt_level_loose_acc  |     ifeval      |          default           |      1.0      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |      computer science      |      1.0      |   4   |
| instruction |  inst_level_loose_acc   |     ifeval      |          default           |      1.0      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          business          |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          history           |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         philosophy         |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |         prehistory         |      1.0      |   2   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         economics          |      0.5      |   2   |
|    exam     |     AverageAccuracy     |      ceval      |     education_science      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |            law             |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |       tax_accountant       |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      iquiz      |             EQ             |      1.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |    high_school_biology     |      1.0      |   1   |
|    code     |         Pass@1          | live_code_bench |           v5_v6            |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |       basic_medicine       |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |          anatomy           |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |  college_computer_science  |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    college_mathematics     |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      abstract_algebra      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |  high_school_mathematics   |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    | high_school_macroeconomics |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |   high_school_chemistry    |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    high_school_biology     |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |     conceptual_physics     |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    | high_school_world_history  |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |       miscellaneous        |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      medical_genetics      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |          virology          |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      security_studies      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |   professional_medicine    |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      moral_scenarios       |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      world_religions       |      0.0      |   1   |
|  knowledge  |      AveragePass@1      |      gpqa       |        gpqa_diamond        |      1.0      |   1   |
|    math     |      AveragePass@1      |    math_500     |          Level 3           |      1.0      |   1   |
|    math     |      AveragePass@1      |    math_500     |          Level 5           |      1.0      |   1   |
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
2025-05-06 23:13:37,099 - evalscope - INFO - dataset_level Report:
+-------------+-------------------------+-----------------+---------------+-------+
|  task_type  |         metric          |  dataset_name   | average_score | count |
+-------------+-------------------------+-----------------+---------------+-------+
|    exam     |     AverageAccuracy     |    mmlu_pro     |    0.6716     |  67   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    0.6842     |  19   |
|    exam     |     AverageAccuracy     |      ceval      |      0.6      |   5   |
| instruction | prompt_level_loose_acc  |     ifeval      |      1.0      |   4   |
| instruction | prompt_level_strict_acc |     ifeval      |     0.75      |   4   |
| instruction |  inst_level_loose_acc   |     ifeval      |      1.0      |   4   |
| instruction |  inst_level_strict_acc  |     ifeval      |     0.75      |   4   |
|    math     |      AveragePass@1      |    math_500     |      1.0      |   2   |
|    code     |         Pass@1          | live_code_bench |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      iquiz      |      1.0      |   1   |
|  knowledge  |      AveragePass@1      |      gpqa       |      1.0      |   1   |
+-------------+-------------------------+-----------------+---------------+-------+
2025-05-06 23:13:37,099 - evalscope - INFO - task_level Report:
+-------------+-------------------------+---------------+-------+
|  task_type  |         metric          | average_score | count |
+-------------+-------------------------+---------------+-------+
|    exam     |     AverageAccuracy     |    0.6739     |  92   |
| instruction |  inst_level_loose_acc   |      1.0      |   4   |
| instruction |  inst_level_strict_acc  |     0.75      |   4   |
| instruction | prompt_level_loose_acc  |      1.0      |   4   |
| instruction | prompt_level_strict_acc |     0.75      |   4   |
|    math     |      AveragePass@1      |      1.0      |   2   |
|    code     |         Pass@1          |      0.0      |   1   |
|  knowledge  |      AveragePass@1      |      1.0      |   1   |
+-------------+-------------------------+---------------+-------+
2025-05-06 23:13:37,100 - evalscope - INFO - tag_level Report:
+------+-------------------------+---------------+-------+
| tags |         metric          | average_score | count |
+------+-------------------------+---------------+-------+
|  en  |     AverageAccuracy     |    0.6744     |  86   |
|  zh  |     AverageAccuracy     |    0.6667     |   6   |
|  en  |  inst_level_strict_acc  |     0.75      |   4   |
|  en  |  inst_level_loose_acc   |      1.0      |   4   |
|  en  | prompt_level_loose_acc  |      1.0      |   4   |
|  en  | prompt_level_strict_acc |     0.75      |   4   |
|  en  |      AveragePass@1      |      1.0      |   3   |
|  en  |         Pass@1          |      0.0      |   1   |
+------+-------------------------+---------------+-------+
2025-05-06 23:13:37,100 - evalscope - INFO - category_level Report:
+-----------+--------------+-------------------------+---------------+-------+
| category0 |  category1   |         metric          | average_score | count |
+-----------+--------------+-------------------------+---------------+-------+
|   Qwen3   |   English    |     AverageAccuracy     |    0.6744     |  86   |
|   Qwen3   |   Chinese    |     AverageAccuracy     |    0.6667     |   6   |
|   Qwen3   |   English    |  inst_level_loose_acc   |      1.0      |   4   |
|   Qwen3   |   English    |  inst_level_strict_acc  |     0.75      |   4   |
|   Qwen3   |   English    | prompt_level_strict_acc |     0.75      |   4   |
|   Qwen3   |   English    | prompt_level_loose_acc  |      1.0      |   4   |
|   Qwen3   | Math&Science |      AveragePass@1      |      1.0      |   3   |
|   Qwen3   |     Code     |         Pass@1          |      0.0      |   1   |
+-----------+--------------+-------------------------+---------------+-------+

参考

https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html

相关推荐
AngelPP1 天前
OpenClaw 架构深度解析:如何把 AI 助手搬到你的个人设备上
人工智能
宅小年1 天前
Claude Code 换成了Kimi K2.5后,我再也回不去了
人工智能·ai编程·claude
九狼1 天前
Flutter URL Scheme 跨平台跳转
人工智能·flutter·github
ZFSS1 天前
Kimi Chat Completion API 申请及使用
前端·人工智能
天翼云开发者社区1 天前
春节复工福利就位!天翼云息壤2500万Tokens免费送,全品类大模型一键畅玩!
人工智能·算力服务·息壤
知识浅谈1 天前
教你如何用 Gemini 将课本图片一键转为精美 PPT
人工智能
Ray Liang1 天前
被低估的量化版模型,小身材也能干大事
人工智能·ai·ai助手·mindx
shengjk11 天前
NanoClaw 深度剖析:一个"AI 原生"架构的个人助手是如何运转的?
人工智能
西门老铁1 天前
🦞OpenClaw 让 MacMini 脱销了,而我拿出了6年陈的安卓机
人工智能