【训练与微调篇07】训练监控与模型评估:从实验管理到Benchmark实战

🎯 训练监控与模型评估:从实验管理到Benchmark实战

2026年大模型评估已进入「全维度标准化」时代------从W&B实验追踪到16项核心Benchmark,再到Multi-Score多维质量评分,系统化评估是决定模型成败的关键一环

📑 目录

一、为什么需要系统化评估

二、训练监控系统搭建

三、2026年核心Benchmark全景解析

四、自动化评估Pipeline

五、模型质量的多维评估框架

六、实操:完整评估脚本

七、2026年6月主流模型实测对比

面试加分点

一、为什么需要系统化评估

1.1 「没有评估就没有优化」

在大模型训练中,一个常见的误区是:只看训练 Loss 下降就认为模型在变好。事实上:

Loss下降 ≠ 能力提升:模型可能过拟合训练数据,Loss 降低但泛化能力变差

单指标偏差:只关注 MMLU 可能导致模型偏科严重

无监控的训练是盲目的:无法及时发现梯度爆炸、loss 震荡、灾难性遗忘等问题

1.2 2026年评估体系的三层架构

第一层:训练监控(实时)

├─ Loss曲线 / 梯度范数 / 学习率 / 吞吐量

├─ W&B / MLflow / TensorBoard

└─ 硬件监控(GPU利用率、显存、温度)

第二层:定期评估(每日/每轮)

├─ 通用基准(MMLU-Pro, GPQA)

├─ 代码基准(HumanEval+, LiveCodeBench)

└─ 中文基准(SuperCLUE, C-Eval)

第三层:最终评估(模型发布前)

├─ 全量Benchmark(16项+)

├─ 人工评估 / 红队测试

└─ 部署性能(延迟、吞吐、显存)

二、训练监控系统搭建

2.1 Weights & Biases(W&B)实战

W&B 是2026年大模型训练中最广泛使用的实验管理平台,支持实时追踪 Loss、硬件指标、模型参数和评估结果:

import wandb

import torch

import torch.nn as nn

from torch.utils.data import DataLoader

import psutil

import GPUtil

from datetime import datetime

class ExperimentTracker:

"""训练实验管理器 - 基于W&B"""

def init (self, project_name, config=None, resume=False):

self.run = wandb.init(

project=project_name,

config=config,

resume=resume,

settings=wandb.Settings(

start_method="thread",

_disable_stats=False,

)

)

self.config = config or {}

self.global_step = 0

self.best_metric = 0.0

self.best_step = 0

复制代码
    # 启用系统监控
    wandb.watch_called = False  # 避免重复watch
    
def log_training(self, loss, lr, grad_norm, step=None):
    """记录训练指标"""
    self.global_step = step or self.global_step + 1
    
    # GPU信息
    gpu_stats = {}
    try:
        gpus = GPUtil.getGPUs()
        for i, gpu in enumerate(gpus):
            gpu_stats.update({
                f"gpu_{i}_util": gpu.load * 100,
                f"gpu_{i}_mem": gpu.memoryUtil * 100,
                f"gpu_{i}_temp": gpu.temperature,
            })
    except Exception:
        pass
    
    # CPU/内存信息
    cpu_stats = {
        "cpu_percent": psutil.cpu_percent(),
        "memory_percent": psutil.virtual_memory().percent,
        "memory_used_gb": psutil.virtual_memory().used / 1e9,
    }
    
    wandb.log({
        "train/loss": loss,
        "train/lr": lr,
        "train/grad_norm": grad_norm,
        "train/step": self.global_step,
        **gpu_stats,
        **cpu_stats,
    }, step=self.global_step)

def log_evaluation(self, metrics, prefix="eval"):
    """记录评估指标"""
    log_dict = {}
    for name, value in metrics.items():
        log_dict[f"{prefix}/{name}"] = value
    
    # 自动保存最佳模型
    if "score" in metrics and metrics["score"] > self.best_metric:
        self.best_metric = metrics["score"]
        self.best_step = self.global_step
        wandb.run.summary["best_score"] = self.best_metric
        wandb.run.summary["best_step"] = self.best_step
        log_dict["best/score"] = self.best_metric
    
    wandb.log(log_dict, step=self.global_step)

def log_model_graph(self, model, dummy_input):
    """记录模型计算图"""
    wandb.watch(model, log="all", log_freq=100)
    wandb.log({"model/params": sum(p.numel() for p in model.parameters())})

def log_artifact(self, file_path, artifact_type="model"):
    """记录模型权重/数据集等制品"""
    artifact = wandb.Artifact(
        name=f"model-{self.run.id}-{self.global_step}",
        type=artifact_type
    )
    artifact.add_file(file_path)
    self.run.log_artifact(artifact)

def finish(self):
    """结束实验"""
    self.run.finish()

使用示例

tracker = ExperimentTracker(

project_name="qwen3-70b-sft",

config={

"model": "Qwen3-70B",

"batch_size": 16,

"lr": 2e-5,

"optimizer": "AdamW",

"warmup_steps": 500,

"total_steps": 10000,

"dataset": "dataflow_sft_v3",

"precision": "bf16",

"parallelism": "TP=8, PP=4, DP=8",

"notes": "Stage 2 full parameter fine-tuning",

}

)

训练循环中

for step, batch in enumerate(train_loader):

loss = train_step(batch)

tracker.log_training(

loss=loss.item(),

lr=scheduler.get_last_lr()0,

grad_norm=grad_norm,

step=step

)

复制代码
if step % 1000 == 0:
    metrics = evaluate(model, eval_loader)
    tracker.log_evaluation(metrics)

tracker.finish()

2.2 MLflow(自托管方案)

对于数据安全的团队,MLflow 是更好的选择------支持自托管、开源免费、与任何 ML 框架兼容:

import mlflow

import mlflow.pytorch

class MLflowTracker:

"""基于MLflow的实验追踪 - 适合自托管"""

def init (self, tracking_uri="http://localhost:5000",

experiment_name="llm_training"):

mlflow.set_tracking_uri(tracking_uri)

mlflow.set_experiment(experiment_name)

self.client = mlflow.tracking.MlflowClient()

复制代码
def start_run(self, run_name=None, tags=None):
    """开始一次实验"""
    self.run = mlflow.start_run(run_name=run_name)
    if tags:
        mlflow.set_tags(tags)
    return self.run.info.run_id

def log_params(self, params_dict):
    """记录超参数"""
    for key, value in params_dict.items():
        mlflow.log_param(key, value)

def log_metrics(self, metrics_dict, step=None):
    """记录指标"""
    for key, value in metrics_dict.items():
        mlflow.log_metric(key, value, step=step)

def log_model(self, model, artifact_path="model"):
    """记录模型"""
    mlflow.pytorch.log_model(model, artifact_path)

def log_artifact(self, local_path):
    """记录文件"""
    mlflow.log_artifact(local_path)

def compare_runs(self, run_ids, metric="eval/score"):
    """对比多个实验"""
    runs = []
    for run_id in run_ids:
        run = self.client.get_run(run_id)
        runs.append({
            "run_id": run_id,
            "score": run.data.metrics.get(metric),
            **run.data.params
        })
    return sorted(runs, key=lambda x: x["score"], reverse=True)

def end_run(self):
    mlflow.end_run()

2.3 训练硬件监控

class HardwareMonitor:

"""硬件监控 - GPU/CPU/内存/网络"""

def init (self, log_interval=60): # 每60秒记录一次

self.log_interval = log_interval

self.start_time = time.time()

self.history = \[\]

复制代码
def snapshot(self):
    """采集硬件快照"""
    snapshot = {
        "timestamp": time.time() - self.start_time,
    }
    
    # GPU信息
    try:
        gpus = GPUtil.getGPUs()
        for i, gpu in enumerate(gpus):
            snapshot.update({
                f"gpu_{i}_load": f"{gpu.load*100:.1f}%",
                f"gpu_{i}_mem": f"{gpu.memoryUtil*100:.1f}%",
                f"gpu_{i}_mem_used": f"{gpu.memoryUsed}MB",
                f"gpu_{i}_temp": f"{gpu.temperature}°C",
            })
    except Exception:
        snapshot["gpu"] = "N/A"
    
    # CPU/内存
    snapshot["cpu"] = f"{psutil.cpu_percent():.1f}%"
    snapshot["ram"] = f"{psutil.virtual_memory().percent:.1f}%"
    
    # 网络
    net = psutil.net_io_counters()
    snapshot["net_sent"] = f"{net.bytes_sent/1e9:.2f}GB"
    snapshot["net_recv"] = f"{net.bytes_recv/1e9:.2f}GB"
    
    # 磁盘IO
    disk = psutil.disk_io_counters()
    snapshot["disk_read"] = f"{disk.read_bytes/1e9:.2f}GB"
    snapshot["disk_write"] = f"{disk.write_bytes/1e9:.2f}GB"
    
    self.history.append(snapshot)
    return snapshot

def check_anomaly(self):
    """检查硬件异常"""
    last = self.history[-1] if self.history else {}
    warnings = []
    
    for key, val in last.items():
        if "temp" in key and isinstance(val, str):
            temp = float(val.replace("°C", ""))
            if temp > 85:
                warnings.append(f"高温警告: {key}={temp}°C")
    
    return warnings

def get_summary(self):
    """生成训练硬件报告"""
    if not self.history:
        return "暂无数据"
    
    summary = []
    # 计算各个指标的平均值
    for metric in ["gpu_0_load", "gpu_0_mem", "gpu_0_temp"]:
        values = []
        for h in self.history:
            if metric in h:
                try:
                    val = float(h[metric].replace("%", "").replace("°C", ""))
                    values.append(val)
                except:
                    pass
        if values:
            summary.append(f"{metric}: avg={sum(values)/len(values):.1f} "
                          f"max={max(values):.1f} min={min(values):.1f}")
    
    return "\n".join(summary)

三、2026年核心Benchmark全景解析

3.1 知识理解类

基准 全称 题量 覆盖学科 2026年SOTA 难度分级

MMLU-Pro Massive Multitask Language Understanding (Pro) 14,042 57学科 Qwen3.6-35B 84.9% ⭐⭐⭐⭐

GPQA Diamond Graduate-Level Q&A (Diamond) 448 生物/物理/化学 Qwen3.5-397B 88.4% ⭐⭐⭐⭐⭐

SimpleQA Simple Question Answering 4,326 事实性知识 GPT-5.5 62.3% ⭐⭐

HLE Humanity's Last Exam 2,000 跨学科极限 需多步推理 ⭐⭐⭐⭐⭐

MMLU-Pro vs MMLU 核心区别:

题目从选择题升级为10选1(原4选1),大幅降低随机猜中概率

增加了更多需要多步推理的题目

移除了简单常识题,整体难度提升约40%

3.2 代码与工程类

基准 测试内容 2026年SOTA 说明

LiveCodeBench 实时算法竞赛题(LeetCode风格) DeepSeek V4-Pro 93.5% 算法编码「断层第一」

SWE-bench Verified 真实GitHub Issue修复 Claude Opus 4.7 87.6% 工程修复能力

SWE-bench Multilingual 多语言版本 DeepSeek V4-Pro 78.8% Python/Java/TS/Go/Rust

HumanEval+ 函数级代码生成 DeepSeek V4-Pro 92.7% 含边界测试

Terminal Bench 2.0 终端命令/脚本编写 GPT-5.5 71.3% 2026年新增

SWE-bench Verified 任务示例:

任务:修复django项目中一个URL匹配bug

Issue: 当URL包含非ASCII字符时,resolve()抛出UnicodeDecodeError

原始代码 (有bug)

def resolve(self, path, urlconf=None):

path = path.split('/') # 中文路径会被错误编码

...

模型修复后的代码

def resolve(self, path, urlconf=None):

path = path.encode('utf-8', errors='surrogateescape')

.decode('utf-8', errors='replace')

.split('/')

...

SWE-bench验证通过: 补丁正确合并且测试全部通过 ✓

3.3 数学推理类

基准 题量 内容 2026年SOTA

AIME 2024/2025 30 美国数学邀请赛 DeepSeek V4-Pro 86.7%

GSM8K 8,500 小学数学应用题 Qwen3.5-397B 96.8%

MATH-500 500 高中数学竞赛 Claude Opus 4.7 94.2%

IMO-Answer-Bench 50 国际奥赛级 Gemini 3.0 52.3%

3.4 中文专项类

基准 说明 2026年6月榜单 SOTA

SuperCLUE 中文综合能力 DeepSeek V4-Pro 70.48 开源第1

SuperCLUE 数学推理 中文数学 DeepSeek V4-Flash 82.69 开源第1

SuperCLUE 代码生成 中文代码 Kimi K2.6 75.79 所有模型最高

SuperCLUE Agent 智能体能力 Kimi K2.6 80.95 所有模型最高

C-Eval 中文多学科 Qwen3.5-397B 92.5% -

CMMLU 中文知识 DeepSeek V4-Pro 91.3% -

3.5 Agent与多模态类

基准 测试内容 2026年SOTA

τ²-Bench 工具调用与Agent规划 GPT-5.5 68.4%

BrowseComp 浏览器交互理解 Gemini 3.0 71.5%

MMMU 多模态理解 InternVL3-78B 72.2

MMMU Pro 多模态极限版 Gemini 3.0 Pro 76.1%

Arena-Hard 人工偏好排名 Claude Opus 4.7 Elo 1423

四、自动化评估Pipeline

4.1 完整评估框架

import subprocess

import json

import time

from concurrent.futures import ThreadPoolExecutor, as_completed

from dataclasses import dataclass, field

from typing import List, Dict, Optional

@dataclass

class BenchmarkConfig:

"""单个Benchmark的配置"""

name: str

script: str

metrics: Liststr

timeout: int = 3600 # 1小时超时

requires_gpu: bool = True

batch_size: int = 1

num_fewshot: int = 5

2026年16项核心Benchmark配置

BENCHMARKS = [

BenchmarkConfig("mmlu_pro", "evaluate_mmlu_pro.py",

"accuracy", "macro_avg", num_fewshot=5),

BenchmarkConfig("gpqa_diamond", "evaluate_gpqa.py",

"accuracy", "pass_rate", num_fewshot=0),

BenchmarkConfig("live_code_bench", "evaluate_lcb.py",

"pass@1", "pass@5"),

BenchmarkConfig("swe_bench", "evaluate_swe.py",

"resolve_rate", "apply_rate",

timeout=7200),

BenchmarkConfig("humaneval_plus", "evaluate_he_plus.py",

"pass@1", "pass@10"),

BenchmarkConfig("simple_qa", "evaluate_simpleqa.py",

"accuracy", "precision"),

BenchmarkConfig("aime_2025", "evaluate_aime.py",

"accuracy", num_fewshot=0),

BenchmarkConfig("gsm8k", "evaluate_gsm8k.py",

"accuracy", num_fewshot=8),

BenchmarkConfig("superclue", "evaluate_superclue.py",

"total_score", "math", "code", "agent"),

BenchmarkConfig("mmmu", "evaluate_mmmu.py",

"accuracy", "macro_avg"),

BenchmarkConfig("tau_bench", "evaluate_tau_bench.py",

"success_rate", "tool_acc"),

BenchmarkConfig("browsec_comp", "evaluate_browsec.py",

"accuracy", "f1"),

BenchmarkConfig("arena_hard", "evaluate_arena_hard.py",

"elo_score", "win_rate"),

BenchmarkConfig("ceval", "evaluate_ceval.py",

"accuracy", "macro_avg"),

BenchmarkConfig("terminal_bench", "evaluate_terminal.py",

"pass@1", "pass@5"),

BenchmarkConfig("safety_eval", "evaluate_safety.py",

"safe_rate", "refusal_rate"),

]

class BenchmarkSuite:

"""自动化评测套件 - 支持并行与串行混合调度"""

复制代码
def __init__(self, model_path, output_dir="./eval_results",
             model_type="vllm", tensor_parallel=8,
             num_workers=4):
    self.model_path = model_path
    self.output_dir = output_dir
    self.model_type = model_type
    self.tensor_parallel = tensor_parallel
    self.num_workers = num_workers
    self.results = {}
    
    import os
    os.makedirs(output_dir, exist_ok=True)

def run_single_benchmark(self, config: BenchmarkConfig) -> Dict:
    """运行单个Benchmark"""
    print(f"[{config.name}] Starting evaluation...")
    start_time = time.time()
    
    # 构建命令
    cmd = f"""
    python {config.script} \
        --model {self.model_path} \
        --model-type {self.model_type} \
        --tensor-parallel {self.tensor_parallel} \
        --batch-size {config.batch_size} \
        --num-fewshot {config.num_fewshot} \
        --output-dir {self.output_dir}
    """
    
    try:
        result = subprocess.run(
            cmd, shell=True, capture_output=True, text=True,
            timeout=config.timeout
        )
        
        if result.returncode == 0:
            elapsed = time.time() - start_time
            
            # 解析JSON结果
            output_file = f"{self.output_dir}/{config.name}_results.json"
            if os.path.exists(output_file):
                with open(output_file) as f:
                    metrics = json.load(f)
            else:
                metrics = {"raw_output": result.stdout[-1000:]}
            
            metrics["elapsed_minutes"] = elapsed / 60
            metrics["status"] = "completed"
            
            print(f"[{config.name}] Completed in {elapsed/60:.1f}m ✓")
            return metrics
        else:
            print(f"[{config.name}] Failed: {result.stderr[-200:]}")
            return {
                "status": "failed",
                "error": result.stderr[-500:],
                "elapsed_minutes": (time.time() - start_time) / 60
            }
            
    except subprocess.TimeoutExpired:
        print(f"[{config.name}] Timeout after {config.timeout}s")
        return {"status": "timeout", "elapsed_seconds": config.timeout}

def run_all(self, parallel_benchmarks=None):
    """运行所有Benchmark
    
    Args:
        parallel_benchmarks: 可并行的benchmark列表,默认代码/数学类并行
    """
    if parallel_benchmarks is None:
        # 默认:代码和数学类并行(独立模型调用)
        parallel_benchmarks = [
            "humaneval_plus", "gsm8k", "simple_qa", "ceval"
        ]
    
    sequential = [b for b in BENCHMARKS 
                 if b.name not in parallel_benchmarks]
    
    # 先串行执行非并行类(权重加载一次)
    for config in sequential:
        self.results[config.name] = self.run_single_benchmark(config)
        self.save_intermediate()
    
    # 并行执行独立类
    parallel_configs = [b for b in BENCHMARKS 
                       if b.name in parallel_benchmarks]
    
    with ThreadPoolExecutor(max_workers=self.num_workers) as executor:
        futures = {
            executor.submit(self.run_single_benchmark, config): config.name
            for config in parallel_configs
        }
        for future in as_completed(futures):
            name = futures[future]
            self.results[name] = future.result()
            self.save_intermediate()
    
    # 生成最终报告
    self.generate_report()
    return self.results

def save_intermediate(self):
    """保存中间结果(防止崩溃丢失)"""
    temp_file = f"{self.output_dir}/.intermediate_results.json"
    with open(temp_file, "w") as f:
        json.dump(self.results, f, indent=2, ensure_ascii=False)

def generate_report(self):
    """生成最终评估报告"""
    report = {
        "model": self.model_path,
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "results": {},
        "summary": {}
    }
    
    # 按类别汇总
    categories = {
        "knowledge": [],
        "code": [],
        "math": [],
        "chinese": [],
        "agent": [],
        "multimodal": [],
        "safety": [],
    }
    
    category_map = {
        "mmlu_pro": "knowledge", "gpqa_diamond": "knowledge",
        "simple_qa": "knowledge", "hle": "knowledge",
        "live_code_bench": "code", "swe_bench": "code",
        "humaneval_plus": "code", "terminal_bench": "code",
        "aime_2025": "math", "gsm8k": "math", "math_500": "math",
        "superclue": "chinese", "ceval": "chinese",
        "tau_bench": "agent", "browsec_comp": "agent",
        "mmmu": "multimodal",
    }
    
    for name, result in self.results.items():
        if result.get("status") == "completed":
            primary_metric = result.get(
                BENCHMARKS_CONFIG[name].metrics[0], "N/A"
            )
            report["results"][name] = {
                "metrics": result,
                "primary_score": primary_metric,
            }
            
            # 按类别汇总
            cat = category_map.get(name, "other")
            if isinstance(primary_metric, (int, float)):
                categories[cat].append(primary_metric)
    
    # 计算各类别平均分
    report["summary"]["category_scores"] = {
        cat: (sum(scores) / len(scores) if scores else 0)
        for cat, scores in categories.items() if scores
    }
    
    # 综合总分
    all_scores = []
    for scores in categories.values():
        all_scores.extend(scores)
    report["summary"]["total_score"] = (
        sum(all_scores) / len(all_scores) if all_scores else 0
    )
    
    report_path = f"{self.output_dir}/evaluation_report.json"
    with open(report_path, "w") as f:
        json.dump(report, f, indent=2, ensure_ascii=False)
    
    print(f"\n{'='*50}")
    print(f"评估报告已生成: {report_path}")
    print(f"综合总分: {report['summary']['total_score']:.2f}")
    print(f"{'='*50}")
    
    return report

4.2 各Benchmark的执行策略

Benchmark 预估耗时 (7B) 预估耗时 (70B) GPU需求 并行策略

MMLU-Pro 30min 2h 1×H100 串行

GPQA Diamond 10min 40min 1×H100 串行

HumanEval+ 15min 1h 1×H100 可并行

LiveCodeBench 1h 4h 2×H100 串行

SWE-bench 2h 8h 4×H100 串行

GSM8K 10min 40min 1×H100 可并行

AIME 15min 1h 1×H100 串行

SuperCLUE 1h 3h 2×H100 串行

SimpleQA 5min 20min 1×H100 可并行

总计 ~6h ~24h - -

4.3 评估结果自动上报

class EvaluationReporter:

"""评估结果自动上报到W&B/MLflow"""

def init (self, tracker, model_name, run_id):

self.tracker = tracker

self.model_name = model_name

self.run_id = run_id

复制代码
def report_results(self, results):
    """将评估结果上报到实验追踪系统"""
    # 展平结果
    flat_metrics = {}
    for bench_name, bench_result in results.get("results", {}).items():
        metrics = bench_result.get("metrics", {})
        for metric_name, value in metrics.items():
            if isinstance(value, (int, float)):
                flat_metrics[f"{bench_name}/{metric_name}"] = value
    
    # 添加汇总
    summary = results.get("summary", {})
    if "total_score" in summary:
        flat_metrics["overall/total_score"] = summary["total_score"]
    
    for cat, score in summary.get("category_scores", {}).items():
        flat_metrics[f"overall/{cat}_avg"] = score
    
    # 上报
    self.tracker.log_metrics(flat_metrics)
    self.tracker.log_artifact(
        f"{self.output_dir}/evaluation_report.json"
    )

五、模型质量的多维评估框架

除了标准Benchmark,2026年主流团队还会使用多维质量评分体系:

5.1 Multi-Score质量评分

class QualityScoring:

"""多维质量评分框架"""

复制代码
DIMENSIONS = {
    "helpfulness": {
        "weight": 0.35,
        "tests": ["instruction_following", "completeness", "clarity"]
    },
    "honesty": {
        "weight": 0.30,
        "tests": ["factual_accuracy", "uncertainty_calibration", 
                  "hallucination_rate"]
    },
    "harmlessness": {
        "weight": 0.20,
        "tests": ["toxicity", "bias", "safety_refusal"]
    },
    "efficiency": {
        "weight": 0.10,
        "tests": ["response_length", "latency", "token_efficiency"]
    },
    "formatting": {
        "weight": 0.05,
        "tests": ["code_format", "markdown_quality", "structure"]
    }
}

def __init__(self, eval_model="gpt-5.5-judge"):
    self.eval_model = eval_model
    self.results = {}

def evaluate_helpfulness(self, prompts, responses):
    """评估有益性"""
    scores = []
    for prompt, response in zip(prompts, responses):
        # 检查指令遵循度
        follows_instructions = self.check_instruction_following(
            prompt, response
        )
        # 检查完整性
        completeness = self.check_completeness(prompt, response)
        scores.append(0.5 * follows_instructions + 0.5 * completeness)
    return sum(scores) / len(scores)

def evaluate_honesty(self, questions, answers, 
                     ground_truths=None):
    """评估诚实度(幻觉率)"""
    if ground_truths:
        # 有ground truth时的精确评估
        correct = sum(
            1 for a, g in zip(answers, ground_truths)
            if self.check_factual_match(a, g)
        )
        return correct / len(answers)
    else:
        # 无ground truth时使用模型自评
        scores = []
        for q, a in zip(questions, answers):
            confidence = self.estimate_uncertainty(a)
            # 不确定时应该拒绝回答
            if confidence < 0.3:
                refuses = self.check_refusal(a)
                scores.append(1.0 if refuses else 0.5)
            else:
                scores.append(confidence)
        return sum(scores) / len(scores)

def evaluate_safety(self, prompts, responses):
    """评估安全性"""
    harmful_count = 0
    for prompt, response in zip(prompts, responses):
        if self.is_harmful_request(prompt):
            if self.checks_refusal(response):
                harmful_count += 1  # 正确拒绝
            else:
                harmful_count -= 0.5  # 危险响应
    return max(0, harmful_count / len(prompts))

def compute_total_score(self, dim_scores: Dict[str, float]) -> Dict:
    """计算加权总分"""
    total = 0
    details = {}
    
    for dim, config in self.DIMENSIONS.items():
        score = dim_scores.get(dim, 0)
        weighted = score * config["weight"]
        total += weighted
        details[dim] = {
            "score": score,
            "weight": config["weight"],
            "weighted": weighted
        }
    
    return {
        "total_score": total,
        "details": details,
        "grade": self.get_grade(total)
    }

def get_grade(self, score):
    if score >= 0.9: return "S"
    if score >= 0.8: return "A"
    if score >= 0.7: return "B"
    if score >= 0.6: return "C"
    return "D"

5.2 灾难性遗忘检测

class CatastrophicForgettingDetector:

"""

灾难性遗忘检测器

复制代码
在微调过程中定期检测:
1. 原始能力的保持程度
2. 是否存在知识遗忘
"""
def __init__(self, baseline_model, finetuned_model, 
             baseline_results: Dict):
    self.baseline_model = baseline_model
    self.finetuned_model = finetuned_model
    self.baseline_results = baseline_results

def measure_forgetting(self, current_results: Dict) -> Dict:
    """测量遗忘程度"""
    forgetting_rates = {}
    
    for benchmark, baseline_score in self.baseline_results.items():
        current_score = current_results.get(benchmark, 0)
        if baseline_score > 0:
            # 遗忘率 = (基线 - 当前) / 基线
            forgetting_rate = (baseline_score - current_score) / baseline_score
            forgetting_rates[benchmark] = {
                "baseline": baseline_score,
                "current": current_score,
                "forgetting_rate": forgetting_rate,
                "status": "normal" if forgetting_rate < 0.05 
                          else "minor" if forgetting_rate < 0.1
                          else "significant" if forgetting_rate < 0.2
                          else "critical"
            }
    
    # 综合遗忘指数
    avg_forgetting = sum(
        v["forgetting_rate"] for v in forgetting_rates.values()
    ) / len(forgetting_rates) if forgetting_rates else 0
    
    return {
        "per_benchmark": forgetting_rates,
        "avg_forgetting_rate": avg_forgetting,
        "severity": "none" if avg_forgetting < 0.01
                   else "low" if avg_forgetting < 0.05
                   else "medium" if avg_forgetting < 0.1
                   else "high"
    }

六、实操:完整评估脚本

以下是一个可直接运行的评估脚本,整合了实验追踪、Benchmark执行和报告生成:

#!/usr/bin/env python3

"""

模型评估主脚本 - 支持W&B追踪 + 自动化评测 + 报告生成

"""

import argparse

import json

import os

import sys

import time

from pathlib import Path

def parse_args():

parser = argparse.ArgumentParser(

description="LLM Evaluation Pipeline"

)

parser.add_argument("--model", type=str, required=True,

help="Model path or name")

parser.add_argument("--model-type", type=str, default="vllm",

choices="vllm", "hf", "tgi")

parser.add_argument("--tensor-parallel", type=int, default=8)

parser.add_argument("--output-dir", type=str, default="./eval_results")

parser.add_argument("--tracking", type=str, default="wandb",

choices="wandb", "mlflow", "none")

parser.add_argument("--benchmarks", type=str, nargs="+",

default=None, help="Specific benchmarks to run")

parser.add_argument("--quick", action="store_true",

help="Quick mode: only run 4 core benchmarks")

parser.add_argument("--num-workers", type=int, default=4)

return parser.parse_args()

def main():

args = parse_args()

复制代码
# 1. 初始化实验追踪
tracker = None
if args.tracking == "wandb":
    import wandb
    tracker = ExperimentTracker(
        project_name="llm-evaluation",
        config=vars(args)
    )
elif args.tracking == "mlflow":
    tracker = MLflowTracker()
    tracker.start_run(run_name=f"eval-{Path(args.model).name}")
    tracker.log_params(vars(args))

print(f"{'='*60}")
print(f"启动模型评估 Pipeline")
print(f"模型: {args.model}")
print(f"类型: {args.model_type}")
print(f"TP: {args.tensor_parallel}")
print(f"Quick模式: {args.quick}")
print(f"{'='*60}")

# 2. 初始化评测套件
suite = BenchmarkSuite(
    model_path=args.model,
    output_dir=args.output_dir,
    model_type=args.model_type,
    tensor_parallel=args.tensor_parallel,
    num_workers=args.num_workers,
)

# 3. 运行评测
if args.quick:
    # Quick模式:只跑4个核心基准(约2小时)
    quick_benchmarks = [
        "mmlu_pro", "humaneval_plus", "gsm8k", "superclue"
    ]
    filters = [b for b in BENCHMARKS if b.name in quick_benchmarks]
    BENCHMARKS.clear()
    BENCHMARKS.extend(filters)

start_time = time.time()
results = suite.run_all()
total_time = time.time() - start_time

# 4. 上报结果到追踪系统
if tracker:
    reporter = EvaluationReporter(tracker, args.model, 
                                  getattr(tracker, 'run_id', None))
    reporter.report_results(results)

# 5. 打印摘要
print(f"\n{'='*60}")
print(f"评估完成! 总耗时: {total_time/3600:.1f}小时")
print(f"{'='*60}")

summary = results.get("summary", {})
print(f"\n📊 综合总分: {summary.get('total_score', 'N/A')}")

for cat, score in summary.get("category_scores", {}).items():
    print(f"  {cat}: {score:.2f}")

print(f"\n📁 详细报告: {args.output_dir}/evaluation_report.json")
print(f"📁 完整结果: {args.output_dir}")

if tracker:
    tracker.finish()

return 0

if name == "main ":

sys.exit(main())

使用方法:

完整评测(~24小时)

python evaluate.py --model /path/to/model --tensor-parallel 8

Quick模式(~2小时,只跑4个核心Benchmark)

python evaluate.py --model /path/to/model --quick

指定具体Benchmark

python evaluate.py --model /path/to/model

--benchmarks mmlu_pro humaneval_plus gsm8k

使用MLflow追踪

python evaluate.py --model /path/to/model --tracking mlflow

七、2026年6月主流模型实测对比

以下数据来自SuperCLUE 2026年5月28日最新榜单及独立第三方评测交叉验证 1:

7.1 国际基准横向对比

模型 MMLU-Pro GPQA Diamond HumanEval+ LiveCodeBench SWE-Bench

DeepSeek V4-Pro 82.3% 84.9% 92.7% 93.5% 80.6%

DeepSeek V4-Flash 92.8% - - - -

Qwen 3.5-397B 84.7% 88.4% 91.3% 80.7% 77.2%

Qwen 3.6-35B-A3B 84.9% 86.0% 76.8% - 73.4%

Llama 4 Maverick 78.4% 79.6% 85.1% 75.2% 72.5%

Gemma 4 26B MoE 83.1% - 71.3% - 24.1%

GPT-5.5 83.5% 86.1% 90.8% 85.3% 78.9%

Claude Opus 4.7 82.9% 85.3% 89.5% 82.1% 87.6%

7.2 SuperCLUE中文综合榜(2026.5.28)

排名 模型 总分 数学推理 代码生成 Agent智能体

1 GPT-5.5 (high) 74.27 - - -

2 Claude Opus 4.8 73.93 - - -

3 Kimi K2.6 68.66 75.93 75.79 80.95

4 DeepSeek V4-Pro 70.48 71.93 74.95 78.12

5 DeepSeek V4-Flash 67.49 82.69 66.75 75.56

6 GLM-5.1 63.24 70.18 70.80 66.80

7.3 关键结论

算法代码生成(LiveCodeBench):DeepSeek V4-Pro 93.5% 断层第一,开源封神

工程修复(SWE-Bench):Claude Opus 4.7 87.6% 仍是王者,DeepSeek V4-Pro 80.6% 开源最佳

研究生级科学推理(GPQA Diamond):Qwen 3.5-397B 88.4% 全球登顶

中文综合能力:DeepSeek V4-Pro 70.48 开源第1,数学推理最强的是 V4-Flash 82.69%

Agent智能体:Kimi K2.6 80.95 超所有模型

消费级性价比:Qwen 3.6-35B-A3B 单卡RTX3090跑84.9% MMLU-Pro + 86% GPQA

7.4 各场景选型建议

场景 推荐模型 原因

算法编程竞赛 DeepSeek V4-Pro LiveCodeBench 93.5%

工程代码修复 Claude Opus 4.7 SWE-Bench 87.6%

科学研究/论文 Qwen 3.5-397B GPQA Diamond 88.4%

中文通用场景 DeepSeek V4-Pro SuperCLUE 70.48

Agent开发 Kimi K2.6 Agent评分80.95

消费级本地部署 Qwen 3.6-35B-A3B 单卡3090/3B激活

极致性价比API DeepSeek V4-Flash $0.28/M tokens

面试加分点

  1. 训练监控中最重要的三个指标是什么?

    梯度范数、Loss曲线平滑度、GPU利用率。梯度范数异常(爆炸/消失)是训练崩溃的早期信号;Loss曲线反映收敛速度和是否过拟合;GPU利用率决定训练效率。2026年实践中,梯度裁剪(max_norm=1.0)和W&B实时监控已成为标配。

  2. 为什么不能只看MMLU评估模型能力?

    MMLU是一个知识记忆型基准,考察的是「记得多少」而非「推理多强」。一个模型可能在MMLU上得分很高,但在需要多步推理的LiveCodeBench或需要工程能力的SWE-Bench上表现极差。2026年行业共识:至少覆盖知识、推理、代码、中文、安全五大类共10+项Benchmark。

  3. 如何检测微调过程中的灾难性遗忘?

    使用保留数据集(Held-out Benchmark)定期评估------在微调前先用MMLU-Pro、GSM8K等通用基准建立基线分数,然后每隔N步或每轮epoch重新评估。如果某个基准分数下降超过5%,说明发生了灾难性遗忘。常见的缓解策略包括:EWC(弹性权重巩固)、Replay Buffer(混合旧数据)、Layer-wise LR(底层低学习率)。

  4. 2026年模型评估的关键趋势

    从单一Benchmark到多维矩阵:知识+推理+代码+Agent+安全全覆盖

从静态到动态:LiveCodeBench实时更新竞赛题,防止「背题」

Agent评估崛起:τ²-Bench、BrowseComp等工具调用评测成为必选

过程奖励替代结果奖励:VisualPRM等过程奖励模型评估每一步推理质量

开源基准数据透明化:SuperCLUE公开完整榜单,支撑可复现对比

上一篇回顾:【训练与微调篇06】训练优化与加速:从单卡到万卡的全栈优化指南

下一篇预告:【训练与微调篇08】多模态大模型训练:从视觉编码到跨模态对齐的全流程实践

当训练和评估体系都完备后,下一个大方向是让模型「看懂世界」------多模态训练。下一篇将深入视觉编码器(SigLIP/InternViT)、桥接模块(MLP/Q-Former/Pixel Shuffle)和2026年最前沿的原生多模态预训练技术。

数据来源:

1 SuperCLUE中文大模型测评基准 2026.5.28 - https://www.superclueai.com/homepage

2 2026年6月本地AI大模型横向对比 - qubittool.com 2026大模型格局深度横评

3 SWE-Bench 2026战报 - 2026年5月最新榜单

4 2026年大模型编程能力对比 - CSDN技术博客