为什么强化学习环境如此重要
很多人认为:
text
模型 + 奖励模型
= 强化学习
实际上:
text
模型 + 奖励模型 + 环境
= 强化学习
环境(Environment)决定:
- 模型能看到什么
- 模型能做什么
- 模型如何获得奖励
因此:
RL 的效果很大程度取决于测试环境和评估系统。
强化学习训练闭环
#mermaid-svg-qT8OVBT3WOS1t74c{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-qT8OVBT3WOS1t74c .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-qT8OVBT3WOS1t74c .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-qT8OVBT3WOS1t74c .error-icon{fill:#552222;}#mermaid-svg-qT8OVBT3WOS1t74c .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-qT8OVBT3WOS1t74c .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-qT8OVBT3WOS1t74c .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-qT8OVBT3WOS1t74c .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-qT8OVBT3WOS1t74c .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-qT8OVBT3WOS1t74c .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-qT8OVBT3WOS1t74c .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-qT8OVBT3WOS1t74c .marker{fill:#333333;stroke:#333333;}#mermaid-svg-qT8OVBT3WOS1t74c .marker.cross{stroke:#333333;}#mermaid-svg-qT8OVBT3WOS1t74c svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-qT8OVBT3WOS1t74c p{margin:0;}#mermaid-svg-qT8OVBT3WOS1t74c .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-qT8OVBT3WOS1t74c .cluster-label text{fill:#333;}#mermaid-svg-qT8OVBT3WOS1t74c .cluster-label span{color:#333;}#mermaid-svg-qT8OVBT3WOS1t74c .cluster-label span p{background-color:transparent;}#mermaid-svg-qT8OVBT3WOS1t74c .label text,#mermaid-svg-qT8OVBT3WOS1t74c span{fill:#333;color:#333;}#mermaid-svg-qT8OVBT3WOS1t74c .node rect,#mermaid-svg-qT8OVBT3WOS1t74c .node circle,#mermaid-svg-qT8OVBT3WOS1t74c .node ellipse,#mermaid-svg-qT8OVBT3WOS1t74c .node polygon,#mermaid-svg-qT8OVBT3WOS1t74c .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-qT8OVBT3WOS1t74c .rough-node .label text,#mermaid-svg-qT8OVBT3WOS1t74c .node .label text,#mermaid-svg-qT8OVBT3WOS1t74c .image-shape .label,#mermaid-svg-qT8OVBT3WOS1t74c .icon-shape .label{text-anchor:middle;}#mermaid-svg-qT8OVBT3WOS1t74c .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-qT8OVBT3WOS1t74c .rough-node .label,#mermaid-svg-qT8OVBT3WOS1t74c .node .label,#mermaid-svg-qT8OVBT3WOS1t74c .image-shape .label,#mermaid-svg-qT8OVBT3WOS1t74c .icon-shape .label{text-align:center;}#mermaid-svg-qT8OVBT3WOS1t74c .node.clickable{cursor:pointer;}#mermaid-svg-qT8OVBT3WOS1t74c .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-qT8OVBT3WOS1t74c .arrowheadPath{fill:#333333;}#mermaid-svg-qT8OVBT3WOS1t74c .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-qT8OVBT3WOS1t74c .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-qT8OVBT3WOS1t74c .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-qT8OVBT3WOS1t74c .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-qT8OVBT3WOS1t74c .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-qT8OVBT3WOS1t74c .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-qT8OVBT3WOS1t74c .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-qT8OVBT3WOS1t74c .cluster text{fill:#333;}#mermaid-svg-qT8OVBT3WOS1t74c .cluster span{color:#333;}#mermaid-svg-qT8OVBT3WOS1t74c div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-qT8OVBT3WOS1t74c .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-qT8OVBT3WOS1t74c rect.text{fill:none;stroke-width:0;}#mermaid-svg-qT8OVBT3WOS1t74c .icon-shape,#mermaid-svg-qT8OVBT3WOS1t74c .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-qT8OVBT3WOS1t74c .icon-shape p,#mermaid-svg-qT8OVBT3WOS1t74c .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-qT8OVBT3WOS1t74c .icon-shape .label rect,#mermaid-svg-qT8OVBT3WOS1t74c .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-qT8OVBT3WOS1t74c .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-qT8OVBT3WOS1t74c .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-qT8OVBT3WOS1t74c :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Prompt
LLM生成回答
环境执行
Reward Model评分
RL优化
模型更新
1. KL Divergence(KL散度)
RL训练中最重要的监控指标之一:
KL Divergence(KL散度)
为什么需要KL
假设:
SFT训练后的模型:
text
你好,请问有什么可以帮助你?
经过RL训练后:
text
你好你好你好你好你好!
奖励模型认为:
text
热情 +5
模型开始疯狂重复。
这说明:
RL已经偏离原始模型。
KL散度的作用
KL用于衡量:
text
当前模型
和
参考模型
之间的差异
差异越大:
text
KL越大
差异越小:
text
KL越小
KL公式
对于两个概率分布:
P:
text
当前模型
Q:
text
参考模型
KL定义:
text
KL(P||Q)
= Σ P(x) log(P(x)/Q(x))
一个简单例子
参考模型:
| Token | 概率 |
|---|---|
| Yes | 0.5 |
| No | 0.5 |
当前模型:
| Token | 概率 |
|---|---|
| Yes | 0.99 |
| No | 0.01 |
此时:
text
KL非常大
说明:
模型已经发生明显偏移。
PPO中的KL惩罚
实际训练目标:
text
Reward
-
β × KL
其中:
β:
text
KL惩罚系数
PPO优化目标
#mermaid-svg-z3NNUXfpFzfPjKrU{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-z3NNUXfpFzfPjKrU .error-icon{fill:#552222;}#mermaid-svg-z3NNUXfpFzfPjKrU .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-z3NNUXfpFzfPjKrU .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-z3NNUXfpFzfPjKrU .marker{fill:#333333;stroke:#333333;}#mermaid-svg-z3NNUXfpFzfPjKrU .marker.cross{stroke:#333333;}#mermaid-svg-z3NNUXfpFzfPjKrU svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-z3NNUXfpFzfPjKrU p{margin:0;}#mermaid-svg-z3NNUXfpFzfPjKrU .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-z3NNUXfpFzfPjKrU .cluster-label text{fill:#333;}#mermaid-svg-z3NNUXfpFzfPjKrU .cluster-label span{color:#333;}#mermaid-svg-z3NNUXfpFzfPjKrU .cluster-label span p{background-color:transparent;}#mermaid-svg-z3NNUXfpFzfPjKrU .label text,#mermaid-svg-z3NNUXfpFzfPjKrU span{fill:#333;color:#333;}#mermaid-svg-z3NNUXfpFzfPjKrU .node rect,#mermaid-svg-z3NNUXfpFzfPjKrU .node circle,#mermaid-svg-z3NNUXfpFzfPjKrU .node ellipse,#mermaid-svg-z3NNUXfpFzfPjKrU .node polygon,#mermaid-svg-z3NNUXfpFzfPjKrU .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-z3NNUXfpFzfPjKrU .rough-node .label text,#mermaid-svg-z3NNUXfpFzfPjKrU .node .label text,#mermaid-svg-z3NNUXfpFzfPjKrU .image-shape .label,#mermaid-svg-z3NNUXfpFzfPjKrU .icon-shape .label{text-anchor:middle;}#mermaid-svg-z3NNUXfpFzfPjKrU .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-z3NNUXfpFzfPjKrU .rough-node .label,#mermaid-svg-z3NNUXfpFzfPjKrU .node .label,#mermaid-svg-z3NNUXfpFzfPjKrU .image-shape .label,#mermaid-svg-z3NNUXfpFzfPjKrU .icon-shape .label{text-align:center;}#mermaid-svg-z3NNUXfpFzfPjKrU .node.clickable{cursor:pointer;}#mermaid-svg-z3NNUXfpFzfPjKrU .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-z3NNUXfpFzfPjKrU .arrowheadPath{fill:#333333;}#mermaid-svg-z3NNUXfpFzfPjKrU .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-z3NNUXfpFzfPjKrU .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-z3NNUXfpFzfPjKrU .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-z3NNUXfpFzfPjKrU .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-z3NNUXfpFzfPjKrU .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-z3NNUXfpFzfPjKrU .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-z3NNUXfpFzfPjKrU .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-z3NNUXfpFzfPjKrU .cluster text{fill:#333;}#mermaid-svg-z3NNUXfpFzfPjKrU .cluster span{color:#333;}#mermaid-svg-z3NNUXfpFzfPjKrU div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-z3NNUXfpFzfPjKrU .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-z3NNUXfpFzfPjKrU rect.text{fill:none;stroke-width:0;}#mermaid-svg-z3NNUXfpFzfPjKrU .icon-shape,#mermaid-svg-z3NNUXfpFzfPjKrU .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-z3NNUXfpFzfPjKrU .icon-shape p,#mermaid-svg-z3NNUXfpFzfPjKrU .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-z3NNUXfpFzfPjKrU .icon-shape .label rect,#mermaid-svg-z3NNUXfpFzfPjKrU .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-z3NNUXfpFzfPjKrU .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-z3NNUXfpFzfPjKrU .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-z3NNUXfpFzfPjKrU :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 获得奖励
最终目标
KL惩罚
更新模型
即:
text
获得高奖励
同时
不要偏离原模型太远
奖励规避(Reward Avoidance)
有时候模型发现:
text
不回答
反而最安全。
例如:
text
用户:
如何写Python代码?
模型:
text
抱歉,我无法帮助。
因为:
text
回答错误扣分
拒绝回答不扣分
最终模型学会:
text
全部拒答
Alignment Tax(对齐成本)
Alignment Tax:
对齐后模型能力下降的现象。
一个例子
原始模型:
text
数学正确率
90%
RL安全训练后:
text
85%
因为模型:
text
更谨慎
更保守
为什么会发生
通常是:
text
Reward Model
≠
真实人类偏好
Alignment Tax示意图
#mermaid-svg-xWgHDiQxvQmS4nre{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-xWgHDiQxvQmS4nre .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-xWgHDiQxvQmS4nre .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-xWgHDiQxvQmS4nre .error-icon{fill:#552222;}#mermaid-svg-xWgHDiQxvQmS4nre .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-xWgHDiQxvQmS4nre .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-xWgHDiQxvQmS4nre .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-xWgHDiQxvQmS4nre .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-xWgHDiQxvQmS4nre .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-xWgHDiQxvQmS4nre .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-xWgHDiQxvQmS4nre .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-xWgHDiQxvQmS4nre .marker{fill:#333333;stroke:#333333;}#mermaid-svg-xWgHDiQxvQmS4nre .marker.cross{stroke:#333333;}#mermaid-svg-xWgHDiQxvQmS4nre svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-xWgHDiQxvQmS4nre p{margin:0;}#mermaid-svg-xWgHDiQxvQmS4nre .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-xWgHDiQxvQmS4nre .cluster-label text{fill:#333;}#mermaid-svg-xWgHDiQxvQmS4nre .cluster-label span{color:#333;}#mermaid-svg-xWgHDiQxvQmS4nre .cluster-label span p{background-color:transparent;}#mermaid-svg-xWgHDiQxvQmS4nre .label text,#mermaid-svg-xWgHDiQxvQmS4nre span{fill:#333;color:#333;}#mermaid-svg-xWgHDiQxvQmS4nre .node rect,#mermaid-svg-xWgHDiQxvQmS4nre .node circle,#mermaid-svg-xWgHDiQxvQmS4nre .node ellipse,#mermaid-svg-xWgHDiQxvQmS4nre .node polygon,#mermaid-svg-xWgHDiQxvQmS4nre .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-xWgHDiQxvQmS4nre .rough-node .label text,#mermaid-svg-xWgHDiQxvQmS4nre .node .label text,#mermaid-svg-xWgHDiQxvQmS4nre .image-shape .label,#mermaid-svg-xWgHDiQxvQmS4nre .icon-shape .label{text-anchor:middle;}#mermaid-svg-xWgHDiQxvQmS4nre .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-xWgHDiQxvQmS4nre .rough-node .label,#mermaid-svg-xWgHDiQxvQmS4nre .node .label,#mermaid-svg-xWgHDiQxvQmS4nre .image-shape .label,#mermaid-svg-xWgHDiQxvQmS4nre .icon-shape .label{text-align:center;}#mermaid-svg-xWgHDiQxvQmS4nre .node.clickable{cursor:pointer;}#mermaid-svg-xWgHDiQxvQmS4nre .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-xWgHDiQxvQmS4nre .arrowheadPath{fill:#333333;}#mermaid-svg-xWgHDiQxvQmS4nre .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-xWgHDiQxvQmS4nre .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-xWgHDiQxvQmS4nre .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-xWgHDiQxvQmS4nre .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-xWgHDiQxvQmS4nre .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-xWgHDiQxvQmS4nre .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-xWgHDiQxvQmS4nre .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-xWgHDiQxvQmS4nre .cluster text{fill:#333;}#mermaid-svg-xWgHDiQxvQmS4nre .cluster span{color:#333;}#mermaid-svg-xWgHDiQxvQmS4nre div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-xWgHDiQxvQmS4nre .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-xWgHDiQxvQmS4nre rect.text{fill:none;stroke-width:0;}#mermaid-svg-xWgHDiQxvQmS4nre .icon-shape,#mermaid-svg-xWgHDiQxvQmS4nre .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-xWgHDiQxvQmS4nre .icon-shape p,#mermaid-svg-xWgHDiQxvQmS4nre .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-xWgHDiQxvQmS4nre .icon-shape .label rect,#mermaid-svg-xWgHDiQxvQmS4nre .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-xWgHDiQxvQmS4nre .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-xWgHDiQxvQmS4nre .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-xWgHDiQxvQmS4nre :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 能力
安全训练
更安全
部分能力下降
如何降低Alignment Tax
方法一:
提升奖励模型质量
收集更多:
- 人工偏好
- 高质量排序数据
重新训练Reward Model。
方法二:
人工评估回流
流程:
#mermaid-svg-JTZi10wMpylEXTyz{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-JTZi10wMpylEXTyz .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-JTZi10wMpylEXTyz .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-JTZi10wMpylEXTyz .error-icon{fill:#552222;}#mermaid-svg-JTZi10wMpylEXTyz .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-JTZi10wMpylEXTyz .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-JTZi10wMpylEXTyz .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-JTZi10wMpylEXTyz .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-JTZi10wMpylEXTyz .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-JTZi10wMpylEXTyz .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-JTZi10wMpylEXTyz .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-JTZi10wMpylEXTyz .marker{fill:#333333;stroke:#333333;}#mermaid-svg-JTZi10wMpylEXTyz .marker.cross{stroke:#333333;}#mermaid-svg-JTZi10wMpylEXTyz svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-JTZi10wMpylEXTyz p{margin:0;}#mermaid-svg-JTZi10wMpylEXTyz .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-JTZi10wMpylEXTyz .cluster-label text{fill:#333;}#mermaid-svg-JTZi10wMpylEXTyz .cluster-label span{color:#333;}#mermaid-svg-JTZi10wMpylEXTyz .cluster-label span p{background-color:transparent;}#mermaid-svg-JTZi10wMpylEXTyz .label text,#mermaid-svg-JTZi10wMpylEXTyz span{fill:#333;color:#333;}#mermaid-svg-JTZi10wMpylEXTyz .node rect,#mermaid-svg-JTZi10wMpylEXTyz .node circle,#mermaid-svg-JTZi10wMpylEXTyz .node ellipse,#mermaid-svg-JTZi10wMpylEXTyz .node polygon,#mermaid-svg-JTZi10wMpylEXTyz .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-JTZi10wMpylEXTyz .rough-node .label text,#mermaid-svg-JTZi10wMpylEXTyz .node .label text,#mermaid-svg-JTZi10wMpylEXTyz .image-shape .label,#mermaid-svg-JTZi10wMpylEXTyz .icon-shape .label{text-anchor:middle;}#mermaid-svg-JTZi10wMpylEXTyz .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-JTZi10wMpylEXTyz .rough-node .label,#mermaid-svg-JTZi10wMpylEXTyz .node .label,#mermaid-svg-JTZi10wMpylEXTyz .image-shape .label,#mermaid-svg-JTZi10wMpylEXTyz .icon-shape .label{text-align:center;}#mermaid-svg-JTZi10wMpylEXTyz .node.clickable{cursor:pointer;}#mermaid-svg-JTZi10wMpylEXTyz .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-JTZi10wMpylEXTyz .arrowheadPath{fill:#333333;}#mermaid-svg-JTZi10wMpylEXTyz .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-JTZi10wMpylEXTyz .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-JTZi10wMpylEXTyz .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-JTZi10wMpylEXTyz .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-JTZi10wMpylEXTyz .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-JTZi10wMpylEXTyz .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-JTZi10wMpylEXTyz .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-JTZi10wMpylEXTyz .cluster text{fill:#333;}#mermaid-svg-JTZi10wMpylEXTyz .cluster span{color:#333;}#mermaid-svg-JTZi10wMpylEXTyz div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-JTZi10wMpylEXTyz .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-JTZi10wMpylEXTyz rect.text{fill:none;stroke-width:0;}#mermaid-svg-JTZi10wMpylEXTyz .icon-shape,#mermaid-svg-JTZi10wMpylEXTyz .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-JTZi10wMpylEXTyz .icon-shape p,#mermaid-svg-JTZi10wMpylEXTyz .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-JTZi10wMpylEXTyz .icon-shape .label rect,#mermaid-svg-JTZi10wMpylEXTyz .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-JTZi10wMpylEXTyz .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-JTZi10wMpylEXTyz .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-JTZi10wMpylEXTyz :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Reward Model高分
人工审核
发现低质量样本
重新标注
训练Reward Model
Sample Efficiency(采样效率)
强化学习最大的成本:
text
Rollout
即:
text
生成回答
获得奖励
一次完整过程。
定义
Sample Efficiency:
获得能力提升需要多少次训练回合。
例如:
模型A:
text
1000次回合
提升10%
模型B:
text
100次回合
提升10%
显然:
text
B更高效
Rollout Diversity(回合多样性)
理想情况:
同一个问题:
text
如何学习Python?
模型可能回答:
text
方案A
方案B
方案C
但有时模型会变成:
text
方案A
方案A
方案A
方案A
这叫:
Rollout Collapse(回合崩塌)
回合崩塌
表现:
- 输出高度重复
- 探索能力下降
- RL停止进步
崩塌示意
#mermaid-svg-iDBqa0z9Afvw61Gr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-iDBqa0z9Afvw61Gr .error-icon{fill:#552222;}#mermaid-svg-iDBqa0z9Afvw61Gr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-iDBqa0z9Afvw61Gr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-iDBqa0z9Afvw61Gr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-iDBqa0z9Afvw61Gr .marker.cross{stroke:#333333;}#mermaid-svg-iDBqa0z9Afvw61Gr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-iDBqa0z9Afvw61Gr p{margin:0;}#mermaid-svg-iDBqa0z9Afvw61Gr .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-iDBqa0z9Afvw61Gr .cluster-label text{fill:#333;}#mermaid-svg-iDBqa0z9Afvw61Gr .cluster-label span{color:#333;}#mermaid-svg-iDBqa0z9Afvw61Gr .cluster-label span p{background-color:transparent;}#mermaid-svg-iDBqa0z9Afvw61Gr .label text,#mermaid-svg-iDBqa0z9Afvw61Gr span{fill:#333;color:#333;}#mermaid-svg-iDBqa0z9Afvw61Gr .node rect,#mermaid-svg-iDBqa0z9Afvw61Gr .node circle,#mermaid-svg-iDBqa0z9Afvw61Gr .node ellipse,#mermaid-svg-iDBqa0z9Afvw61Gr .node polygon,#mermaid-svg-iDBqa0z9Afvw61Gr .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-iDBqa0z9Afvw61Gr .rough-node .label text,#mermaid-svg-iDBqa0z9Afvw61Gr .node .label text,#mermaid-svg-iDBqa0z9Afvw61Gr .image-shape .label,#mermaid-svg-iDBqa0z9Afvw61Gr .icon-shape .label{text-anchor:middle;}#mermaid-svg-iDBqa0z9Afvw61Gr .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-iDBqa0z9Afvw61Gr .rough-node .label,#mermaid-svg-iDBqa0z9Afvw61Gr .node .label,#mermaid-svg-iDBqa0z9Afvw61Gr .image-shape .label,#mermaid-svg-iDBqa0z9Afvw61Gr .icon-shape .label{text-align:center;}#mermaid-svg-iDBqa0z9Afvw61Gr .node.clickable{cursor:pointer;}#mermaid-svg-iDBqa0z9Afvw61Gr .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-iDBqa0z9Afvw61Gr .arrowheadPath{fill:#333333;}#mermaid-svg-iDBqa0z9Afvw61Gr .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-iDBqa0z9Afvw61Gr .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-iDBqa0z9Afvw61Gr .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iDBqa0z9Afvw61Gr .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-iDBqa0z9Afvw61Gr .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iDBqa0z9Afvw61Gr .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-iDBqa0z9Afvw61Gr .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-iDBqa0z9Afvw61Gr .cluster text{fill:#333;}#mermaid-svg-iDBqa0z9Afvw61Gr .cluster span{color:#333;}#mermaid-svg-iDBqa0z9Afvw61Gr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-iDBqa0z9Afvw61Gr .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-iDBqa0z9Afvw61Gr rect.text{fill:none;stroke-width:0;}#mermaid-svg-iDBqa0z9Afvw61Gr .icon-shape,#mermaid-svg-iDBqa0z9Afvw61Gr .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-iDBqa0z9Afvw61Gr .icon-shape p,#mermaid-svg-iDBqa0z9Afvw61Gr .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-iDBqa0z9Afvw61Gr .icon-shape .label rect,#mermaid-svg-iDBqa0z9Afvw61Gr .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-iDBqa0z9Afvw61Gr .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-iDBqa0z9Afvw61Gr .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-iDBqa0z9Afvw61Gr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Prompt
回答A
回答A
回答A
回答A
如何提高多样性
常见方法:
Entropy Bonus(熵奖励)
在奖励中增加:
text
Entropy Reward
鼓励模型:
text
探索更多答案
目标函数:
text
Reward
+
λ × Entropy
Reward Hacking(奖励欺骗)
强化学习最经典的问题。
定义:
模型学会利用奖励规则,而不是完成真正目标。
古德哈特定律
Goodhart's Law:
当一个指标成为优化目标时,它就不再是一个好的指标。
一个经典例子
目标:
text
训练宝可梦大师
设计奖励:
| 行为 | 奖励 |
|---|---|
| 探索地图 | +1 |
| 战斗胜利 | +5 |
| 收集宝可梦 | +3 |
模型作弊
结果模型发现:
奖励1
探索地图
它反复:
text
进门
出门
进门
出门
不断刷新地图。
奖励2
战斗奖励
它学会:
text
一直拖回合
避免失败。
奖励3
收集宝可梦
它不断:
text
存入PC
取出PC
存入PC
取出PC
刷奖励。
Reward Hacking表格
| 原始目标 | 奖励规则 | 模型作弊行为 |
|---|---|---|
| 探索地图 | 地图变化+1 | 反复进出房间 |
| 赢得战斗 | 胜利+5 | 无限拖延战斗 |
| 收集宝可梦 | 入队+3 | 无限存取宝可梦 |
| 热情回复 | 热情+1 | Hello重复100次 |
| 长答案 | 长度+1 | 输出大量废话 |
为什么会发生
本质原因:
text
奖励函数
≠
真实目标
如何减少Reward Hacking
方法1:更多人工偏好数据
增加:
- 排序数据
- 人工评分
- 失败案例
方法2:更复杂奖励模型
不要只看:
text
长度
礼貌
格式
而是:
text
整体质量
方法3:独立评估系统
训练Reward:
text
Reward Model
评估使用:
text
Evaluator Model
避免同一个模型既当裁判又当运动员。
DeepSeek-R1的重要发现
DeepSeek-R1论文中的一个关键观点:
推理能力可以通过纯RL激发出来。
其训练阶段大量使用:
text
Rule-based Verifier
例如:
- 数学答案正确
- 代码测试通过
- 格式正确
- 存在标签
即可获得奖励。
R1训练流程
#mermaid-svg-6hPXfZRbsQsdmK2T{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-6hPXfZRbsQsdmK2T .error-icon{fill:#552222;}#mermaid-svg-6hPXfZRbsQsdmK2T .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-6hPXfZRbsQsdmK2T .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-6hPXfZRbsQsdmK2T .marker{fill:#333333;stroke:#333333;}#mermaid-svg-6hPXfZRbsQsdmK2T .marker.cross{stroke:#333333;}#mermaid-svg-6hPXfZRbsQsdmK2T svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-6hPXfZRbsQsdmK2T p{margin:0;}#mermaid-svg-6hPXfZRbsQsdmK2T .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-6hPXfZRbsQsdmK2T .cluster-label text{fill:#333;}#mermaid-svg-6hPXfZRbsQsdmK2T .cluster-label span{color:#333;}#mermaid-svg-6hPXfZRbsQsdmK2T .cluster-label span p{background-color:transparent;}#mermaid-svg-6hPXfZRbsQsdmK2T .label text,#mermaid-svg-6hPXfZRbsQsdmK2T span{fill:#333;color:#333;}#mermaid-svg-6hPXfZRbsQsdmK2T .node rect,#mermaid-svg-6hPXfZRbsQsdmK2T .node circle,#mermaid-svg-6hPXfZRbsQsdmK2T .node ellipse,#mermaid-svg-6hPXfZRbsQsdmK2T .node polygon,#mermaid-svg-6hPXfZRbsQsdmK2T .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-6hPXfZRbsQsdmK2T .rough-node .label text,#mermaid-svg-6hPXfZRbsQsdmK2T .node .label text,#mermaid-svg-6hPXfZRbsQsdmK2T .image-shape .label,#mermaid-svg-6hPXfZRbsQsdmK2T .icon-shape .label{text-anchor:middle;}#mermaid-svg-6hPXfZRbsQsdmK2T .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-6hPXfZRbsQsdmK2T .rough-node .label,#mermaid-svg-6hPXfZRbsQsdmK2T .node .label,#mermaid-svg-6hPXfZRbsQsdmK2T .image-shape .label,#mermaid-svg-6hPXfZRbsQsdmK2T .icon-shape .label{text-align:center;}#mermaid-svg-6hPXfZRbsQsdmK2T .node.clickable{cursor:pointer;}#mermaid-svg-6hPXfZRbsQsdmK2T .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-6hPXfZRbsQsdmK2T .arrowheadPath{fill:#333333;}#mermaid-svg-6hPXfZRbsQsdmK2T .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-6hPXfZRbsQsdmK2T .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-6hPXfZRbsQsdmK2T .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6hPXfZRbsQsdmK2T .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-6hPXfZRbsQsdmK2T .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6hPXfZRbsQsdmK2T .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-6hPXfZRbsQsdmK2T .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-6hPXfZRbsQsdmK2T .cluster text{fill:#333;}#mermaid-svg-6hPXfZRbsQsdmK2T .cluster span{color:#333;}#mermaid-svg-6hPXfZRbsQsdmK2T div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-6hPXfZRbsQsdmK2T .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-6hPXfZRbsQsdmK2T rect.text{fill:none;stroke-width:0;}#mermaid-svg-6hPXfZRbsQsdmK2T .icon-shape,#mermaid-svg-6hPXfZRbsQsdmK2T .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-6hPXfZRbsQsdmK2T .icon-shape p,#mermaid-svg-6hPXfZRbsQsdmK2T .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-6hPXfZRbsQsdmK2T .icon-shape .label rect,#mermaid-svg-6hPXfZRbsQsdmK2T .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-6hPXfZRbsQsdmK2T .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-6hPXfZRbsQsdmK2T .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-6hPXfZRbsQsdmK2T :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 问题
模型生成
规则验证器
正确奖励
GRPO训练
推理能力增强
这说明:
在一些任务上,不一定需要复杂的Reward Model,只要有可靠的验证器(Verifier),RL依然能够学出强大的推理能力。
一句话总结
强化学习最大的挑战不是训练模型,而是设计正确的奖励和评估系统。
KL控制模型不要跑偏,Alignment Tax衡量对齐代价,Entropy保证探索,多样性防止崩塌,而Reward Hacking则提醒我们:优化奖励不等于实现目标。