修改GRPO Advantages的一些思路(pass@k)

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

来自https://arxiv.org/pdf/2503.19595,在VeRL直接写了下面的code,这个code本质是最高分和次高分之间有gap就有Advantage(Advantage不为0),否则就没有advantages(Advantage为0)。文章对这个最高分和次高分怎么来的做了一些解释,首先定义了leave-one-out advantages estimate:

文章接下来讨论了Average reward、pass@k、Majority voting这几种情况,对于pass@k来说,排个序就是最高减去次高:

作者在实验中强调" With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@k objectives compared to the baseline method." 论文Review细节参考 https://openreview.net/forum?id=ZVWJO5YTz4

python 复制代码
@register_adv_est(AdvantageEstimator.GRPO_PASSK)  # or simply: @register_adv_est("grpo_passk")
def compute_grpo_passk_outcome_advantage(
    token_level_rewards: torch.Tensor,
    response_mask: torch.Tensor,
    index: np.ndarray,
    epsilon: float = 1e-6,
    norm_adv_by_std_in_grpo: bool = True,
    config: Optional[AlgoConfig] = None,
    **kwargs,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute advantage for Pass@k using a GRPO-style outcome reward formulation.
    Only the best response per group gets a non-zero advantage: r_max - r_second_max.

    Implemented as described in https://arxiv.org/abs/2503.19595.

    Args:
        token_level_rewards: (bs, response_length)
        response_mask: (bs, response_length)
        index: (bs,) → group ID per sample
        epsilon: float for numerical stability
        config: (AlgoConfig) algorithm settings, which contains "norm_adv_by_std_in_grpo"

    Returns:
        advantages: (bs, response_length)
        returns: (bs, response_length)
    """
    assert config is not None
    # if True, normalize advantage by std within group
    norm_adv_by_std_in_grpo = config.get("norm_adv_by_std_in_grpo", True)
    scores = token_level_rewards.sum(dim=-1)  # (bs,)
    advantages = torch.zeros_like(scores)

    id2scores = defaultdict(list)
    id2indices = defaultdict(list)

    with torch.no_grad():
        bsz = scores.shape[0]
        for i in range(bsz):
            idx = index[i]
            id2scores[idx].append(scores[i])
            id2indices[idx].append(i)

        for idx in id2scores:
            rewards = torch.stack(id2scores[idx])  # (k,)
            if rewards.numel() < 2:
                raise ValueError(
                    f"Pass@k requires at least 2 samples per group. Got {rewards.numel()} for group {idx}."
                )
            topk, topk_idx = torch.topk(rewards, 2)
            r_max, r_second_max = topk[0], topk[1]
            i_max = id2indices[idx][topk_idx[0].item()]
            advantage = r_max - r_second_max
            if norm_adv_by_std_in_grpo:
                std = torch.std(rewards)
                advantage = advantage / (std + epsilon)
            advantages[i_max] = advantage

    advantages = advantages.unsqueeze(-1) * response_mask
    return advantages, advantages

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

来自https://openreview.net/forum?id=eslxxopXTF,已被拒了,主要是novelty不够,这篇思路比较简单,代码在https://github.com/RUCAIBox/Passk_Training/blob/main/code/passk_adv.py,作者通过实验结果表明,Pass@k训练不仅提高了Pass@k性能,还保持了Pass@1得分,并在不同领域和任务上表现出更强的泛化能力

python 复制代码
from collections import defaultdict

import numpy as np
import torch
import random

from scipy.special import comb


def calc_adv(val, k):
    c = len(np.where(val==1)[0])
    n = len(val)
    rho = 1 - comb(n-c, k) / comb(n, k)
    sigma = np.sqrt(rho * (1 - rho))
    adv_p = (1 - rho) / (sigma + 1e-6)
    adv_n = (1 - rho - comb(n-c-1, k-1)/comb(n-1,k-1)) / (sigma + 1e-6)
    new_val = np.where(val==1, adv_p, val)
    new_val = np.where(new_val==0, adv_n, new_val)
    return new_val

def compute_advantage(token_level_rewards, response_mask, index, K):
    scores = token_level_rewards.sum(dim=-1)
    
    id2score = defaultdict(list)
    uid2sid = defaultdict(list)
    id2mean = {}
    id2std = {}

    with torch.no_grad():
        bsz = scores.shape[0]
        for i in range(bsz):
            id2score[index[i]].append(scores[i].detach().item())
            uid2sid[index[i]].append(i)
        for uid in id2score.keys():
            reward = np.array(id2score[uid])
            adv = calc_adv(reward, K)
            print(uid2sid[uid])
            for i in range(len(uid2sid[uid])):
                scores[uid2sid[uid][i]] = adv[i]

    scores = scores.unsqueeze(-1) * response_mask
    
    return scores, scores
相关推荐
星越华夏7 小时前
计算机视觉:YOLOv12安装环境
人工智能·yolo·计算机视觉
Yolanda948 小时前
【人工智能】《从零搭建AI问答助手项目(九):Prompt优化》
人工智能·prompt
wj3055853788 小时前
课程 9:模型测试记录与 Prompt 策略
linux·人工智能·python·comfyui
小和尚同志8 小时前
深入使用 skill-creator:结合真实生产级实践
人工智能·aigc
DevSecOps选型指南8 小时前
安全419专访悬镜安全 | 穿越周期在 AI 浪潮中定义数字供应链安全新范式
人工智能
沪漂阿龙8 小时前
面试题详解:GraphRAG 全面解析——知识图谱增强 RAG、Local Search、Global Search、社区摘要、工程落地与评估指标一次讲透
人工智能·知识图谱
WangN28 小时前
Unitree RL Lab 学习笔记【通识】
人工智能·机器学习
haina20198 小时前
海纳AI亮相《科创中国》,解码招聘“智”变之路
人工智能·ai面试·ai招聘
阿星AI工作室9 小时前
刘润年中大课笔记:一句话说清AI落地之战的本质
大数据·人工智能·创业创新·商业
qingfeng154159 小时前
企业微信机器人开发:如何实现自动化与智能运营?
人工智能·python·机器人·自动化·企业微信