修改GRPO Advantages的一些思路(pass@k)

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

来自https://arxiv.org/pdf/2503.19595,在VeRL直接写了下面的code,这个code本质是最高分和次高分之间有gap就有Advantage(Advantage不为0),否则就没有advantages(Advantage为0)。文章对这个最高分和次高分怎么来的做了一些解释,首先定义了leave-one-out advantages estimate:

文章接下来讨论了Average reward、pass@k、Majority voting这几种情况,对于pass@k来说,排个序就是最高减去次高:

作者在实验中强调" With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@k objectives compared to the baseline method." 论文Review细节参考 https://openreview.net/forum?id=ZVWJO5YTz4

python 复制代码
@register_adv_est(AdvantageEstimator.GRPO_PASSK)  # or simply: @register_adv_est("grpo_passk")
def compute_grpo_passk_outcome_advantage(
    token_level_rewards: torch.Tensor,
    response_mask: torch.Tensor,
    index: np.ndarray,
    epsilon: float = 1e-6,
    norm_adv_by_std_in_grpo: bool = True,
    config: Optional[AlgoConfig] = None,
    **kwargs,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute advantage for Pass@k using a GRPO-style outcome reward formulation.
    Only the best response per group gets a non-zero advantage: r_max - r_second_max.

    Implemented as described in https://arxiv.org/abs/2503.19595.

    Args:
        token_level_rewards: (bs, response_length)
        response_mask: (bs, response_length)
        index: (bs,) → group ID per sample
        epsilon: float for numerical stability
        config: (AlgoConfig) algorithm settings, which contains "norm_adv_by_std_in_grpo"

    Returns:
        advantages: (bs, response_length)
        returns: (bs, response_length)
    """
    assert config is not None
    # if True, normalize advantage by std within group
    norm_adv_by_std_in_grpo = config.get("norm_adv_by_std_in_grpo", True)
    scores = token_level_rewards.sum(dim=-1)  # (bs,)
    advantages = torch.zeros_like(scores)

    id2scores = defaultdict(list)
    id2indices = defaultdict(list)

    with torch.no_grad():
        bsz = scores.shape[0]
        for i in range(bsz):
            idx = index[i]
            id2scores[idx].append(scores[i])
            id2indices[idx].append(i)

        for idx in id2scores:
            rewards = torch.stack(id2scores[idx])  # (k,)
            if rewards.numel() < 2:
                raise ValueError(
                    f"Pass@k requires at least 2 samples per group. Got {rewards.numel()} for group {idx}."
                )
            topk, topk_idx = torch.topk(rewards, 2)
            r_max, r_second_max = topk[0], topk[1]
            i_max = id2indices[idx][topk_idx[0].item()]
            advantage = r_max - r_second_max
            if norm_adv_by_std_in_grpo:
                std = torch.std(rewards)
                advantage = advantage / (std + epsilon)
            advantages[i_max] = advantage

    advantages = advantages.unsqueeze(-1) * response_mask
    return advantages, advantages

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

来自https://openreview.net/forum?id=eslxxopXTF,已被拒了,主要是novelty不够,这篇思路比较简单,代码在https://github.com/RUCAIBox/Passk_Training/blob/main/code/passk_adv.py,作者通过实验结果表明,Pass@k训练不仅提高了Pass@k性能,还保持了Pass@1得分,并在不同领域和任务上表现出更强的泛化能力

python 复制代码
from collections import defaultdict

import numpy as np
import torch
import random

from scipy.special import comb


def calc_adv(val, k):
    c = len(np.where(val==1)[0])
    n = len(val)
    rho = 1 - comb(n-c, k) / comb(n, k)
    sigma = np.sqrt(rho * (1 - rho))
    adv_p = (1 - rho) / (sigma + 1e-6)
    adv_n = (1 - rho - comb(n-c-1, k-1)/comb(n-1,k-1)) / (sigma + 1e-6)
    new_val = np.where(val==1, adv_p, val)
    new_val = np.where(new_val==0, adv_n, new_val)
    return new_val

def compute_advantage(token_level_rewards, response_mask, index, K):
    scores = token_level_rewards.sum(dim=-1)
    
    id2score = defaultdict(list)
    uid2sid = defaultdict(list)
    id2mean = {}
    id2std = {}

    with torch.no_grad():
        bsz = scores.shape[0]
        for i in range(bsz):
            id2score[index[i]].append(scores[i].detach().item())
            uid2sid[index[i]].append(i)
        for uid in id2score.keys():
            reward = np.array(id2score[uid])
            adv = calc_adv(reward, K)
            print(uid2sid[uid])
            for i in range(len(uid2sid[uid])):
                scores[uid2sid[uid][i]] = adv[i]

    scores = scores.unsqueeze(-1) * response_mask
    
    return scores, scores
相关推荐
Clarence Liu1 小时前
用大白话讲解人工智能(19) AI Agent:从“对话框“到“智能助手“的进化
人工智能
confiself1 小时前
GLM5+minimax2.5+qwen3.5技术报告对比学习
人工智能
深度之眼1 小时前
热点创新!基于Transformer与KAN网络的三种高阶玩法
人工智能·深度学习·transformer
HAREWORK_FFF2 小时前
非技术岗位与AI岗位的能力映射与转型成功概率评估
人工智能
tq10862 小时前
agent 记忆 = markdown + json + git
人工智能·git
石臻臻的杂货铺2 小时前
Codex + Claude Code + 一个编排器:独立开发者的「一人军团」实战手册
人工智能
壹通GEO2 小时前
GEO数据分析不再难:1键生成归因热力图+预警报告
人工智能·数据挖掘·数据分析
肾透侧视攻城狮2 小时前
《TensorFlow生态全景图:核心组件、扩展工具与工业级应用深度解读》
人工智能·深度学习·tensorflow生态系统·tfcore/.js/lite·tf extended/hub·tf serving·生态系统优势对比
两万五千个小时2 小时前
构建mini Claude Code:11 - 从「被动等待」到「主动找活」
人工智能·python·架构