Bradley-Terry 奖励模型
含义:给定选中和拒绝响应的隐藏状态,将其投影为标量奖励并计算偏好损失。
python
def reward_model_loss(chosen_hidden, rejected_hidden, reward_head):
r_chosen = (chosen_hidden @ reward_head).squeeze(-1) # (B,)
r_rejected = (rejected_hidden @ reward_head).squeeze(-1) # (B,)
margin = r_chosen - r_rejected
# manual log-sigmoid: log(1/(1+exp(-x))) = -log(1+exp(-x))
loss = -torch.log(1.0 / (1.0 + torch.exp(-margin))).mean()
return loss
- loss = -torch.log(1.0 / (1.0 + torch.exp(-margin))).mean()成对损失通常使用 log-sigmoid 形式,等同于二元交叉熵损失
DPO损失
含义: 无需强化学习即可将语言模型与人类偏好对齐,使用配对的选中/拒绝对数概率。。
python
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
ref_chosen_logps, ref_rejected_logps, beta=0.1):
chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)
diff = chosen_rewards - rejected_rewards
return -torch.log(torch.sigmoid(diff)).mean()
- 参考模型是为了防止模型偏离初始语言能力,避免退化。
- policy_chosen_logps这些参数是整个对话的对数似然,取每个token 的 log probs 求和
GRPO损失
含义:每个提示组内归一化奖励以计算优势值,然后使用这些组相对优势优化策略。
python
def grpo_loss(logps: Tensor, rewards: Tensor, group_ids: Tensor,
eps: float = 1e-5) -> Tensor:
"""Group Relative Policy Optimization (GRPO) loss.
logps: (B,) policy log-probs for each sampled response
rewards: (B,) scalar rewards for each response
group_ids: (B,) integers, same id = same prompt/group
returns: scalar loss (Tensor)
"""
# Compute per-group normalized advantages A_i
unique_ids = group_ids.unique()
advantages = torch.empty_like(rewards)
for gid in unique_ids:
mask = group_ids == gid
r_g = rewards[mask]
mean_g = r_g.mean()
std_g = r_g.std(unbiased=False)
advantages[mask] = (r_g - mean_g) / (std_g + eps)
# Stop gradient through advantages
advantages_detached = advantages.detach()
# GRPO objective: -E[A_i * logpi_i]
return -(advantages_detached * logps).mean()
- 在反向传播时,不通过优势值回传梯度。优势值被视为"常数"或"目标",只用于加权策略梯度。
- 无需 Critic 网络。传统 PPO 需要训练一个价值网络(critic)来估计优势,GRPO 用组内统计量替代,简化架构。
- 同一 prompt 的多个回答相互比较,消除 prompt 难度差异带来的偏差。
PPO 损失
含义:通过裁剪重要性采样比率来约束策略更新,防止强化学习中的破坏性大幅更新。
python
def ppo_loss(new_logps: Tensor, old_logps: Tensor, advantages: Tensor,
clip_ratio: float = 0.2) -> Tensor:
"""PPO clipped surrogate loss.
new_logps: (B,) current policy log-probs
old_logps: (B,) old policy log-probs (treated as constant)
advantages: (B,) advantage estimates (treated as constant)
returns: scalar loss (Tensor)
"""
# Detach old_logps and advantages so gradients only flow through new_logps
old_logps_detached = old_logps.detach()
adv_detached = advantages.detach()
# Importance sampling ratio r = pi_new / pi_old in log-space
ratios = torch.exp(new_logps - old_logps_detached)
# Unclipped and clipped objectives
unclipped = ratios * adv_detached
clipped = torch.clamp(ratios, 1.0 - clip_ratio, 1.0 + clip_ratio) * adv_detached
# PPO objective: negative mean of the more conservative objective
return -torch.min(unclipped, clipped).mean()
- 通过裁剪比率,防止单次更新步长过大,避免策略崩溃。