强化学习/对齐(个人理解)

Bradley-Terry 奖励模型

含义:给定选中和拒绝响应的隐藏状态,将其投影为标量奖励并计算偏好损失。

python 复制代码
def reward_model_loss(chosen_hidden, rejected_hidden, reward_head):
    r_chosen = (chosen_hidden @ reward_head).squeeze(-1)     # (B,)
    r_rejected = (rejected_hidden @ reward_head).squeeze(-1) # (B,)
    margin = r_chosen - r_rejected
    # manual log-sigmoid: log(1/(1+exp(-x))) = -log(1+exp(-x))
    loss = -torch.log(1.0 / (1.0 + torch.exp(-margin))).mean()
    return loss
  1. loss = -torch.log(1.0 / (1.0 + torch.exp(-margin))).mean()成对损失通常使用 log-sigmoid 形式,等同于二元交叉熵损失

DPO损失

含义: 无需强化学习即可将语言模型与人类偏好对齐,使用配对的选中/拒绝对数概率。。

python 复制代码
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)
    diff = chosen_rewards - rejected_rewards
    return -torch.log(torch.sigmoid(diff)).mean()
  1. 参考模型是为了防止模型偏离初始语言能力,避免退化。
  2. policy_chosen_logps这些参数是整个对话的对数似然,取每个token 的 log probs 求和

GRPO损失

含义:每个提示组内归一化奖励以计算优势值,然后使用这些组相对优势优化策略。

python 复制代码
def grpo_loss(logps: Tensor, rewards: Tensor, group_ids: Tensor,
              eps: float = 1e-5) -> Tensor:
    """Group Relative Policy Optimization (GRPO) loss.

    logps: (B,) policy log-probs for each sampled response
    rewards: (B,) scalar rewards for each response
    group_ids: (B,) integers, same id = same prompt/group
    returns: scalar loss (Tensor)
    """
    # Compute per-group normalized advantages A_i
    unique_ids = group_ids.unique()
    advantages = torch.empty_like(rewards)
    for gid in unique_ids:
        mask = group_ids == gid
        r_g = rewards[mask]
        mean_g = r_g.mean()
        std_g = r_g.std(unbiased=False)
        advantages[mask] = (r_g - mean_g) / (std_g + eps)

    # Stop gradient through advantages
    advantages_detached = advantages.detach()

    # GRPO objective: -E[A_i * logpi_i]
    return -(advantages_detached * logps).mean()
  1. 在反向传播时,不通过优势值回传梯度。优势值被视为"常数"或"目标",只用于加权策略梯度。
  2. 无需 Critic 网络。传统 PPO 需要训练一个价值网络(critic)来估计优势,GRPO 用组内统计量替代,简化架构。
  3. 同一 prompt 的多个回答相互比较,消除 prompt 难度差异带来的偏差。

PPO 损失

含义:通过裁剪重要性采样比率来约束策略更新,防止强化学习中的破坏性大幅更新。

python 复制代码
def ppo_loss(new_logps: Tensor, old_logps: Tensor, advantages: Tensor,
             clip_ratio: float = 0.2) -> Tensor:
    """PPO clipped surrogate loss.

    new_logps: (B,) current policy log-probs
    old_logps: (B,) old policy log-probs (treated as constant)
    advantages: (B,) advantage estimates (treated as constant)
    returns: scalar loss (Tensor)
    """
    # Detach old_logps and advantages so gradients only flow through new_logps
    old_logps_detached = old_logps.detach()
    adv_detached = advantages.detach()

    # Importance sampling ratio r = pi_new / pi_old in log-space
    ratios = torch.exp(new_logps - old_logps_detached)

    # Unclipped and clipped objectives
    unclipped = ratios * adv_detached
    clipped = torch.clamp(ratios, 1.0 - clip_ratio, 1.0 + clip_ratio) * adv_detached

    # PPO objective: negative mean of the more conservative objective
    return -torch.min(unclipped, clipped).mean()
  1. 通过裁剪比率,防止单次更新步长过大,避免策略崩溃。
相关推荐
韩师傅16 分钟前
当你的甲方吐槽天空不够蓝,你应该如何应对
python·计算机视觉
Warson_L1 小时前
python的类&继承
python
Warson_L1 小时前
类型标注/type annotation
python
ThreeS3 小时前
手搓MiniVLA全实战教程-一步一步用pytorch解释原理与思路
人工智能·python
金銀銅鐵5 小时前
[Python] 模 n 乘法的逆元计算器
python·数学·游戏
aqi005 小时前
15天学会AI应用开发(十)把文本嵌入模型换成国产模型
人工智能·python·ai编程
金銀銅鐵1 天前
[Python] 扩展欧几里得算法
python·数学·算法
Duckdblab1 天前
DuckDB 性能调优终极指南:打造闪电般的分析体验
python
带派擂总1 天前
Python全栈开发精华版最全合集(包含各种面试题) Day24_异常和错误
python