12. 为什么评估（Evals）比训练更重要

很多人认为：

训练决定模型能力。

实际上在工业界，更常见的说法是：

评估决定训练方向。

因为训练只是不断优化模型，

而评估（Evaluation）决定：

模型是否真的变好
下一步应该收集什么数据
应该调整什么奖励机制
应该优化哪些能力

因此：

评估不是训练结束后的验收环节，而是训练过程中的导航系统。

1. 为什么评估如此重要

训练过程本质上是：
#mermaid-svg-bnIYUiyG5mGEV6Bt{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bnIYUiyG5mGEV6Bt .error-icon{fill:#552222;}#mermaid-svg-bnIYUiyG5mGEV6Bt .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bnIYUiyG5mGEV6Bt .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .marker.cross{stroke:#333333;}#mermaid-svg-bnIYUiyG5mGEV6Bt svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bnIYUiyG5mGEV6Bt p{margin:0;}#mermaid-svg-bnIYUiyG5mGEV6Bt .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .cluster-label text{fill:#333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .cluster-label span{color:#333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .cluster-label span p{background-color:transparent;}#mermaid-svg-bnIYUiyG5mGEV6Bt .label text,#mermaid-svg-bnIYUiyG5mGEV6Bt span{fill:#333;color:#333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .node rect,#mermaid-svg-bnIYUiyG5mGEV6Bt .node circle,#mermaid-svg-bnIYUiyG5mGEV6Bt .node ellipse,#mermaid-svg-bnIYUiyG5mGEV6Bt .node polygon,#mermaid-svg-bnIYUiyG5mGEV6Bt .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-bnIYUiyG5mGEV6Bt .rough-node .label text,#mermaid-svg-bnIYUiyG5mGEV6Bt .node .label text,#mermaid-svg-bnIYUiyG5mGEV6Bt .image-shape .label,#mermaid-svg-bnIYUiyG5mGEV6Bt .icon-shape .label{text-anchor:middle;}#mermaid-svg-bnIYUiyG5mGEV6Bt .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-bnIYUiyG5mGEV6Bt .rough-node .label,#mermaid-svg-bnIYUiyG5mGEV6Bt .node .label,#mermaid-svg-bnIYUiyG5mGEV6Bt .image-shape .label,#mermaid-svg-bnIYUiyG5mGEV6Bt .icon-shape .label{text-align:center;}#mermaid-svg-bnIYUiyG5mGEV6Bt .node.clickable{cursor:pointer;}#mermaid-svg-bnIYUiyG5mGEV6Bt .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .arrowheadPath{fill:#333333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-bnIYUiyG5mGEV6Bt .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bnIYUiyG5mGEV6Bt .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-bnIYUiyG5mGEV6Bt .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bnIYUiyG5mGEV6Bt .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-bnIYUiyG5mGEV6Bt .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-bnIYUiyG5mGEV6Bt .cluster text{fill:#333;}#mermaid-svg-bnIYUiyG5mGEV6Bt .cluster span{color:#333;}#mermaid-svg-bnIYUiyG5mGEV6Bt div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-bnIYUiyG5mGEV6Bt .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-bnIYUiyG5mGEV6Bt rect.text{fill:none;stroke-width:0;}#mermaid-svg-bnIYUiyG5mGEV6Bt .icon-shape,#mermaid-svg-bnIYUiyG5mGEV6Bt .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bnIYUiyG5mGEV6Bt .icon-shape p,#mermaid-svg-bnIYUiyG5mGEV6Bt .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-bnIYUiyG5mGEV6Bt .icon-shape .label rect,#mermaid-svg-bnIYUiyG5mGEV6Bt .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bnIYUiyG5mGEV6Bt .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-bnIYUiyG5mGEV6Bt .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-bnIYUiyG5mGEV6Bt :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 评估发现问题
收集数据
训练模型
重新评估

这是一个持续循环。

例如：

评估发现：

text 复制代码

数学能力差

下一步：

text 复制代码

收集数学数据
增加数学RL训练

评估发现：

text 复制代码

代码能力下降

下一步：

text 复制代码

增加代码测试集
增加代码奖励模型

因此：

评估决定了数据收集和训练策略。

2. 强化学习中的评估

对于 RL 来说：

评估甚至更加重要。

因为 RL 的核心问题是：

Reward 是否真的反映了用户想要的能力？

例如：

text 复制代码

热情 +1

模型可能学会：

text 复制代码

Hello!!!
Hello!!!
Hello!!!

获得高奖励。

但用户体验反而变差。

因此需要评估系统不断发现：

Reward Hacking
奖励漏洞
模型作弊行为

3. RL 中的评估闭环

#mermaid-svg-YP2bAqRYWH5YGvOe{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-YP2bAqRYWH5YGvOe .error-icon{fill:#552222;}#mermaid-svg-YP2bAqRYWH5YGvOe .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-YP2bAqRYWH5YGvOe .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-YP2bAqRYWH5YGvOe .marker{fill:#333333;stroke:#333333;}#mermaid-svg-YP2bAqRYWH5YGvOe .marker.cross{stroke:#333333;}#mermaid-svg-YP2bAqRYWH5YGvOe svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-YP2bAqRYWH5YGvOe p{margin:0;}#mermaid-svg-YP2bAqRYWH5YGvOe .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-YP2bAqRYWH5YGvOe .cluster-label text{fill:#333;}#mermaid-svg-YP2bAqRYWH5YGvOe .cluster-label span{color:#333;}#mermaid-svg-YP2bAqRYWH5YGvOe .cluster-label span p{background-color:transparent;}#mermaid-svg-YP2bAqRYWH5YGvOe .label text,#mermaid-svg-YP2bAqRYWH5YGvOe span{fill:#333;color:#333;}#mermaid-svg-YP2bAqRYWH5YGvOe .node rect,#mermaid-svg-YP2bAqRYWH5YGvOe .node circle,#mermaid-svg-YP2bAqRYWH5YGvOe .node ellipse,#mermaid-svg-YP2bAqRYWH5YGvOe .node polygon,#mermaid-svg-YP2bAqRYWH5YGvOe .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-YP2bAqRYWH5YGvOe .rough-node .label text,#mermaid-svg-YP2bAqRYWH5YGvOe .node .label text,#mermaid-svg-YP2bAqRYWH5YGvOe .image-shape .label,#mermaid-svg-YP2bAqRYWH5YGvOe .icon-shape .label{text-anchor:middle;}#mermaid-svg-YP2bAqRYWH5YGvOe .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-YP2bAqRYWH5YGvOe .rough-node .label,#mermaid-svg-YP2bAqRYWH5YGvOe .node .label,#mermaid-svg-YP2bAqRYWH5YGvOe .image-shape .label,#mermaid-svg-YP2bAqRYWH5YGvOe .icon-shape .label{text-align:center;}#mermaid-svg-YP2bAqRYWH5YGvOe .node.clickable{cursor:pointer;}#mermaid-svg-YP2bAqRYWH5YGvOe .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-YP2bAqRYWH5YGvOe .arrowheadPath{fill:#333333;}#mermaid-svg-YP2bAqRYWH5YGvOe .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-YP2bAqRYWH5YGvOe .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-YP2bAqRYWH5YGvOe .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YP2bAqRYWH5YGvOe .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-YP2bAqRYWH5YGvOe .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YP2bAqRYWH5YGvOe .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-YP2bAqRYWH5YGvOe .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-YP2bAqRYWH5YGvOe .cluster text{fill:#333;}#mermaid-svg-YP2bAqRYWH5YGvOe .cluster span{color:#333;}#mermaid-svg-YP2bAqRYWH5YGvOe div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-YP2bAqRYWH5YGvOe .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-YP2bAqRYWH5YGvOe rect.text{fill:none;stroke-width:0;}#mermaid-svg-YP2bAqRYWH5YGvOe .icon-shape,#mermaid-svg-YP2bAqRYWH5YGvOe .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YP2bAqRYWH5YGvOe .icon-shape p,#mermaid-svg-YP2bAqRYWH5YGvOe .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-YP2bAqRYWH5YGvOe .icon-shape .label rect,#mermaid-svg-YP2bAqRYWH5YGvOe .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YP2bAqRYWH5YGvOe .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-YP2bAqRYWH5YGvOe .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-YP2bAqRYWH5YGvOe :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 模型生成答案
Reward评分
RL训练
评估系统
发现问题
修改Reward

4. 预训练中的评估指标

Pre-training 最关注：

Training Loss

训练集损失：

text 复制代码

模型对训练数据预测是否准确

Validation Loss

验证集损失：

text 复制代码

模型在未见数据上的表现

Perplexity（困惑度）

衡量：

模型预测下一个 Token 的难度。

公式可以理解为：

text 复制代码

Perplexity 越低越好

例如：

模型	PPL
模型A	10
模型B	5

说明：

text 复制代码

模型B预测能力更强

5. 为什么 Loss 不适合后训练

在后训练（Post-training）阶段：

Loss 的意义会迅速下降。

原因是：

后训练关注的是用户体验，而不是 Token 预测。

例如：

问题：

text 复制代码

如何学习Python？

回答A：

text 复制代码

学习Python。

回答B：

text 复制代码

建议先学习变量、函数和控制流，
再完成一些实际项目练习。

两者可能：

text 复制代码

Loss接近

但用户显然更喜欢：

text 复制代码

回答B

因此：

Loss 不等于用户满意度。

6. 后训练真正关注什么

后训练更关注：

text 复制代码

Helpful
Harmless
Honest
Reasoning
Tool Use

这些能力无法通过 Loss 完整衡量。

7. 什么是测试集（Test Set）

测试集：

模型从未见过的数据。

作用：

衡量模型真实泛化能力。

8. 为什么测试集必须未见过

如果：

text 复制代码

测试题
出现在训练数据中

模型可能只是：

text 复制代码

背答案

而不是学会能力。

这称为：

Data Leakage（数据泄漏）

9. 模型对比评估

评估不仅可以看：

text 复制代码

模型自己是否进步

还可以比较：

text 复制代码

模型A vs 模型B

例如：

text 复制代码

GPT-4
vs
Claude
vs
DeepSeek

10. ELO评分体系

很多 Chatbot Arena 使用：

ELO Rating

类似国际象棋评分。

流程

用户同时看到：

text 复制代码

回答A
回答B

然后投票：

text 复制代码

A更好

或者：

text 复制代码

B更好

系统不断更新：

text 复制代码

ELO Score

ELO示意图

#mermaid-svg-P8wt9cPZaS7N0YOa{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-P8wt9cPZaS7N0YOa .error-icon{fill:#552222;}#mermaid-svg-P8wt9cPZaS7N0YOa .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-P8wt9cPZaS7N0YOa .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-P8wt9cPZaS7N0YOa .marker{fill:#333333;stroke:#333333;}#mermaid-svg-P8wt9cPZaS7N0YOa .marker.cross{stroke:#333333;}#mermaid-svg-P8wt9cPZaS7N0YOa svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-P8wt9cPZaS7N0YOa p{margin:0;}#mermaid-svg-P8wt9cPZaS7N0YOa .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-P8wt9cPZaS7N0YOa .cluster-label text{fill:#333;}#mermaid-svg-P8wt9cPZaS7N0YOa .cluster-label span{color:#333;}#mermaid-svg-P8wt9cPZaS7N0YOa .cluster-label span p{background-color:transparent;}#mermaid-svg-P8wt9cPZaS7N0YOa .label text,#mermaid-svg-P8wt9cPZaS7N0YOa span{fill:#333;color:#333;}#mermaid-svg-P8wt9cPZaS7N0YOa .node rect,#mermaid-svg-P8wt9cPZaS7N0YOa .node circle,#mermaid-svg-P8wt9cPZaS7N0YOa .node ellipse,#mermaid-svg-P8wt9cPZaS7N0YOa .node polygon,#mermaid-svg-P8wt9cPZaS7N0YOa .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-P8wt9cPZaS7N0YOa .rough-node .label text,#mermaid-svg-P8wt9cPZaS7N0YOa .node .label text,#mermaid-svg-P8wt9cPZaS7N0YOa .image-shape .label,#mermaid-svg-P8wt9cPZaS7N0YOa .icon-shape .label{text-anchor:middle;}#mermaid-svg-P8wt9cPZaS7N0YOa .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-P8wt9cPZaS7N0YOa .rough-node .label,#mermaid-svg-P8wt9cPZaS7N0YOa .node .label,#mermaid-svg-P8wt9cPZaS7N0YOa .image-shape .label,#mermaid-svg-P8wt9cPZaS7N0YOa .icon-shape .label{text-align:center;}#mermaid-svg-P8wt9cPZaS7N0YOa .node.clickable{cursor:pointer;}#mermaid-svg-P8wt9cPZaS7N0YOa .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-P8wt9cPZaS7N0YOa .arrowheadPath{fill:#333333;}#mermaid-svg-P8wt9cPZaS7N0YOa .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-P8wt9cPZaS7N0YOa .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-P8wt9cPZaS7N0YOa .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-P8wt9cPZaS7N0YOa .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-P8wt9cPZaS7N0YOa .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-P8wt9cPZaS7N0YOa .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-P8wt9cPZaS7N0YOa .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-P8wt9cPZaS7N0YOa .cluster text{fill:#333;}#mermaid-svg-P8wt9cPZaS7N0YOa .cluster span{color:#333;}#mermaid-svg-P8wt9cPZaS7N0YOa div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-P8wt9cPZaS7N0YOa .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-P8wt9cPZaS7N0YOa rect.text{fill:none;stroke-width:0;}#mermaid-svg-P8wt9cPZaS7N0YOa .icon-shape,#mermaid-svg-P8wt9cPZaS7N0YOa .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-P8wt9cPZaS7N0YOa .icon-shape p,#mermaid-svg-P8wt9cPZaS7N0YOa .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-P8wt9cPZaS7N0YOa .icon-shape .label rect,#mermaid-svg-P8wt9cPZaS7N0YOa .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-P8wt9cPZaS7N0YOa .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-P8wt9cPZaS7N0YOa .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-P8wt9cPZaS7N0YOa :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 模型A
用户投票
模型B
ELO更新

11. RL 中也可以评估评分器

除了评估模型：

还需要评估：

Reward Model

因为：

Reward Model 也会出错。

例如：

模型输出：

text 复制代码

Hello!!!
Hello!!!
Hello!!!

Reward：

text 复制代码

10分

但人类：

text 复制代码

2分

说明：

text 复制代码

Reward Model失效

12. Calibration（校准）

现代模型不仅要正确。

还要：

知道自己什么时候不确定。

这叫：

Calibration（概率校准）

13. 什么是 Calibration

理想情况：

text 复制代码

模型说：
90%概率正确

现实中：

text 复制代码

真的90%正确

如果：

text 复制代码

模型说90%
实际只有50%

则：

过度自信（Overconfidence）

14. 校准的重要性

例如医疗场景：

模型说：

text 复制代码

99%确定

实际上：

text 复制代码

完全错误

风险极大。

15. Calibration 的评估方法

常见方法：

Token Probability Calibration

检查：

text 复制代码

Token概率

是否与：

text 复制代码

真实出现概率

一致。

Reliability Diagram

比较：

text 复制代码

预测概率

和：

text 复制代码

实际准确率

之间的关系。

16. 拒答机制

校准后：

模型可以学会：

text 复制代码

我不知道

例如：

text 复制代码

置信度 < 20%

输出：

text 复制代码

抱歉，我不确定这个答案。

而不是胡编乱造。

17. 效率评估（Efficiency）

除了正确率：

现代模型还关注：

响应速度。

TTFT

Time To First Token

即：

text 复制代码

用户提问
↓
第一个Token出现

所需时间。

TPOT

Time Per Output Token

即：

text 复制代码

平均生成一个Token
需要多少时间

18. 核心评估维度

现代 LLM 通常评估：

指标	作用
Accuracy	准确率
Fairness	公平性
Calibration	概率校准
Robustness	鲁棒性
Transparency	可解释性
Toxicity	有害内容
Efficiency	推理效率

19. Accuracy（准确率）

最基础指标：

text 复制代码

答对了多少题

例如：

数学
代码
QA

20. Fairness（公平性）

检查：

模型是否存在：

性别偏见
种族偏见
地域偏见

21. Robustness（鲁棒性）

测试：

输入稍微变化时，

模型是否稳定。

例如：

text 复制代码

2+2=?

和：

text 复制代码

请问2加2等于多少？

是否都能正确回答。

22. Transparency（可解释性）

检查：

模型是否能够解释：

text 复制代码

为什么这样回答

23. Toxicity（有害内容）

检查：

模型是否生成：

攻击性内容
歧视内容
危险内容

24. Efficiency（效率）

衡量：

text 复制代码

速度
成本
吞吐量

25. 常见公开评测集

MMLU

Massive Multitask Language Understanding

覆盖：

数学
法律
医学
历史
科学

共 50+ 学科。

是目前最经典的综合能力测试集之一。

GPQA

Graduate-Level Google-Proof Q&A

特点：

即使专业人士也很难答对。

主要测试：

物理
生物
化学

高级推理能力。

HumanEval

评估：

text 复制代码

代码生成能力

GSM8K

评估：

text 复制代码

数学推理能力

26. 评估驱动训练（Evaluation-Driven Development）

现代大模型研发越来越强调：
#mermaid-svg-l90dZTOE2LpIMQZs{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-l90dZTOE2LpIMQZs .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-l90dZTOE2LpIMQZs .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-l90dZTOE2LpIMQZs .error-icon{fill:#552222;}#mermaid-svg-l90dZTOE2LpIMQZs .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-l90dZTOE2LpIMQZs .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-l90dZTOE2LpIMQZs .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-l90dZTOE2LpIMQZs .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-l90dZTOE2LpIMQZs .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-l90dZTOE2LpIMQZs .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-l90dZTOE2LpIMQZs .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-l90dZTOE2LpIMQZs .marker{fill:#333333;stroke:#333333;}#mermaid-svg-l90dZTOE2LpIMQZs .marker.cross{stroke:#333333;}#mermaid-svg-l90dZTOE2LpIMQZs svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-l90dZTOE2LpIMQZs p{margin:0;}#mermaid-svg-l90dZTOE2LpIMQZs .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-l90dZTOE2LpIMQZs .cluster-label text{fill:#333;}#mermaid-svg-l90dZTOE2LpIMQZs .cluster-label span{color:#333;}#mermaid-svg-l90dZTOE2LpIMQZs .cluster-label span p{background-color:transparent;}#mermaid-svg-l90dZTOE2LpIMQZs .label text,#mermaid-svg-l90dZTOE2LpIMQZs span{fill:#333;color:#333;}#mermaid-svg-l90dZTOE2LpIMQZs .node rect,#mermaid-svg-l90dZTOE2LpIMQZs .node circle,#mermaid-svg-l90dZTOE2LpIMQZs .node ellipse,#mermaid-svg-l90dZTOE2LpIMQZs .node polygon,#mermaid-svg-l90dZTOE2LpIMQZs .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-l90dZTOE2LpIMQZs .rough-node .label text,#mermaid-svg-l90dZTOE2LpIMQZs .node .label text,#mermaid-svg-l90dZTOE2LpIMQZs .image-shape .label,#mermaid-svg-l90dZTOE2LpIMQZs .icon-shape .label{text-anchor:middle;}#mermaid-svg-l90dZTOE2LpIMQZs .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-l90dZTOE2LpIMQZs .rough-node .label,#mermaid-svg-l90dZTOE2LpIMQZs .node .label,#mermaid-svg-l90dZTOE2LpIMQZs .image-shape .label,#mermaid-svg-l90dZTOE2LpIMQZs .icon-shape .label{text-align:center;}#mermaid-svg-l90dZTOE2LpIMQZs .node.clickable{cursor:pointer;}#mermaid-svg-l90dZTOE2LpIMQZs .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-l90dZTOE2LpIMQZs .arrowheadPath{fill:#333333;}#mermaid-svg-l90dZTOE2LpIMQZs .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-l90dZTOE2LpIMQZs .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-l90dZTOE2LpIMQZs .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-l90dZTOE2LpIMQZs .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-l90dZTOE2LpIMQZs .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-l90dZTOE2LpIMQZs .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-l90dZTOE2LpIMQZs .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-l90dZTOE2LpIMQZs .cluster text{fill:#333;}#mermaid-svg-l90dZTOE2LpIMQZs .cluster span{color:#333;}#mermaid-svg-l90dZTOE2LpIMQZs div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-l90dZTOE2LpIMQZs .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-l90dZTOE2LpIMQZs rect.text{fill:none;stroke-width:0;}#mermaid-svg-l90dZTOE2LpIMQZs .icon-shape,#mermaid-svg-l90dZTOE2LpIMQZs .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-l90dZTOE2LpIMQZs .icon-shape p,#mermaid-svg-l90dZTOE2LpIMQZs .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-l90dZTOE2LpIMQZs .icon-shape .label rect,#mermaid-svg-l90dZTOE2LpIMQZs .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-l90dZTOE2LpIMQZs .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-l90dZTOE2LpIMQZs .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-l90dZTOE2LpIMQZs :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Evals
发现问题
数据收集
SFT
RL
重新评估

核心思想：

不要先训练再评估，

而是先设计评估，再决定如何训练。

一句话总结

训练决定模型会什么，评估决定模型应该学什么。

在后训练时代，最优秀的团队往往不是训练能力最强，而是评估体系最完善。