如何用bootstrap模拟估计pass@k

pass@k指标主要应用于评估模型生成函数的正确性。由于实际场景中测试问题数目一般都有限，直接计算可能存在方差较大的问题。

bootstrap方法通过重复从已有数据中随机抽取样本来模拟新的样本集，从而估计统计量。

这里通过bootstrap的方法模拟pass@k，并尝试对比bootstrap模拟和MC模拟，所用代码示例参考和修改自网络资料

1 pass@k需求描述

1.1 pass@k描述

首先，先了解pass@k是什么，模拟pass@k的估计过程会有哪些问题。

pass@k指标主要应用于评估模型生成函数的正确性。

对于每个测试问题，模型生成 k 个候选解。

对每个候选解进行测试，判断该解是否通过所有测试用例（即验证是否正确）。

如果k个解中有至少一个解通过了所有测试用例，那么该问题就视为通过。

然后计算所有测试问题中，通过测试的比例。

然而，由于实际场景中测试问题数目一般都有限，直接计算可能存在方差较大的问题。

1.2 bootstrap描述

在实际应用中，往往只有一组数据样本，而无法得知整个大环境总体的真实情况。

bootstrap方法通过反复从这组样本中抽取（有放回地选取），产生大量虚拟的"重新样本化"数据集，利用这些数据集来模拟可能的情况，并估算需要的统计量。

bootstrap不严重依赖于数据的分布形态，即使是非常偏态或者不规则的数据，也能够给出相对合理的估计。举个例子，在一所学校中，只有50名学生的考试成绩作为样本，但想估计整个学校学生的平均成绩及这个平均成绩的可靠性。用传统方法，我们可能会直接计算这50名学生的平均分作为全校的估计。如果使用bootstrap方法，从这50名学生的成绩中随机挑选出50个成绩（允许重复挑选同一学生的成绩），计算出一个平均值。重复这个过程许多次（比如1000次），就会得到1000个平均值。

这个过程每次都重新组成一个"虚拟的班级"，通过分析这些"虚拟班级"成绩的平均值分布，不仅能得到对全校学生成绩平均值的一个更好估计，还能了解这个估计的可靠性，比如计算出95置信区间。

1.3 月收入估计示例

假设进行了一次关于某地区居民月收入的调查，收集了100个数据样本。

现在想要估计这个地区居民月收入的平均值，以及这个平均值的95%置信区间。

下面是详细计算过程：

第一步：收集到100个居民月收入数据样本。

第二步：进行bootstrap重抽样，从这100个数据中有放回地抽取100次，形成一个新的样本集。这个过程重复进行1000次，产生1000个这样的样本集。

第三步：计算统计量。对每一个新的样本集计算平均值。

第四步：估计置信区间。根据这1000个平均值的分布，计算其5%到95%的值，作为平均值的95%置信区间。

示例代码如下所示

复制代码

import numpy as np
import matplotlib.pyplot as plt

# 假设原始数据是这样的100个居民月收入
data = np.random.normal(5000, 1200, 100)

# 进行1000次Bootstrap重抽样
bootstrap_means = []
for _ in range(1000):
    sample = np.random.choice(data, size=100, replace=True)
    bootstrap_means.append(np.mean(sample))

# 计算95%置信区间
lower = np.percentile(bootstrap_means, 2.5)
upper = np.percentile(bootstrap_means, 97.5)

print(f"estimated_per_month_income: {np.mean(bootstrap_means):.2f} RMB")
print(f"95% confidence interval: ({lower:.2f}, {upper:.2f}) RMB")

# 绘制结果
plt.hist(bootstrap_means, bins=30, alpha=0.7, color='blue')
plt.axvline(x=lower, color='red', linestyle='--', label='2.5%')
plt.axvline(x=upper, color='green', linestyle='--', label='97.5%')
plt.title('Bootstrap avg_income_per_month estimate')
plt.xlabel('avg_income_per_month')
plt.ylabel('freq')
plt.legend()
plt.show()

输出如下

estimated_per_month_income: 5038.48 RMB

95% confidence interval: (4803.00, 5293.71) RMB

2 bootstrap模拟

2.1 binomial模拟pass@k

这里将pass@k评估LLM生成代码能力的过程，抽象为一个从二项分布中抽样的过程。

如此，针对某一问题生成n个代码，就类似于从二项分布中抽取n个样本，样本值为True/False。

某个代码通过测试，就类似于抽取的某个样本，其值为True。

所以，整个模拟过程可用伪码表示如下。

samples = np.random.binomial(1, true_p, n_samples).astype(bool)

pass_at_k = estimator_func(samples.tolist(), k)

pass_at_k即为用抽取的n个samples估计出的pass@k的值。

estimate_func即为估计方法，比如即将要介绍的bootstrap模拟、MC采用模拟等。

2.2 bootstrap模拟实例

在bootstrap挑选样本时，采用放回采样 replace=True。

bootstrap_sample = np.random.choice(samples, size=min(k, n), replace=True)

在完成多次抽样后，通过np.mean(bootstrap_estimates)计算平均值。

bootstrap模拟代码示例是如下。

复制代码

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from typing import List, Callable
import seaborn as sns
from scipy.special import comb


def true_pass_at_k(p: float, k: int) -> float:
    """True pass@k value: 1 - (1-p)^k"""
    return 1 - (1 - p) ** k

def bootstrap_estimator(samples: List[bool], k: int, n_bootstrap: int = 1000) -> float:
    """Bootstrap estimator for pass@k"""
    n = len(samples)
    if n == 0:
        return 0.0
    
    bootstrap_estimates = []
    for _ in range(n_bootstrap):
        # Sample with replacement
        bootstrap_sample = np.random.choice(samples, size=min(k, n), replace=True)
        bootstrap_estimates.append(np.any(bootstrap_sample))
    
    return np.mean(bootstrap_estimates)

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

true_p = 0.3
max_samples=100
n_trials=200
k_values = [1, 3, 5, 10, 20]

# Colors for different k values
# Set up the plot
fig, axes = plt.subplots(1, 1, figsize=(15, 12))
colors = plt.cm.viridis(np.linspace(0, 1, len(k_values)))
ax = axes

for k_idx, k in enumerate(k_values):
    # True value
    true_value = true_pass_at_k(true_p, k)
            
    # Store estimates for different sample sizes
    sample_sizes = []
    estimates_mean = []
    estimates_std = []
    biases = []
    
    for n_samples in range(1, max_samples + 1, 5):    
        trial_estimates = []
        for trial in range(n_trials):
                # Generate samples
                samples = np.random.binomial(1, true_p, n_samples).astype(bool)
                estimate = bootstrap_estimator(samples.tolist(), k, 500)
                trial_estimates.append(estimate)
                
        sample_sizes.append(n_samples)
        estimates_mean.append(np.mean(trial_estimates))
        estimates_std.append(np.std(trial_estimates))
        biases.append(np.mean(trial_estimates) - true_value)
            
    # Plot mean estimates
    ax.plot(sample_sizes, estimates_mean, 
            color=colors[k_idx], label=f'k={k}', linewidth=2)
            
    # Add shaded region for ±1 std
    ax.fill_between(sample_sizes, 
                    np.array(estimates_mean) - np.array(estimates_std),
                    np.array(estimates_mean) + np.array(estimates_std),
                    alpha=0.2, color=colors[k_idx])
            
    # Plot true value as horizontal line
    ax.axhline(y=true_value, color=colors[k_idx], linestyle='--', alpha=0.7)
        
ax.set_xlabel('Number of Samples (n)')
ax.set_ylabel('pass@k Estimate')
ax.set_title(f'bootstrap Estimator\n'
             f'True p={true_p}')
ax.legend()
ax.grid(True, alpha=0.3)
    
plt.tight_layout()
plt.show()

输出如下所示，在样本数n比较小的时候，方差还是比较大的。

3 综合模拟对比

这里进一步对比bootstrap估计和其他估计模拟估计手段。

3.1 bootstrap

bootstrap模拟过程代码示例如下。

复制代码

def bootstrap_estimator(samples: List[bool], k: int, n_bootstrap: int = 1000) -> float:
    """Bootstrap estimator for pass@k"""
    n = len(samples)
    if n == 0:
        return 0.0
    
    bootstrap_estimates = []
    for _ in range(n_bootstrap):
        # Sample with replacement
        bootstrap_sample = np.random.choice(samples, size=min(k, n), replace=True)
        bootstrap_estimates.append(np.any(bootstrap_sample))
    
    return np.mean(bootstrap_estimates)

3.2 monte carlo

monte carlo模拟估计代码示例如下

复制代码

def monte_carlo_estimator(samples: List[bool], k: int, n_simulations: int = 1000) -> float:
    """Monte Carlo estimator for pass@k"""
    n = len(samples)
    if n == 0:
        return 0.0
    
    successes = 0
    for _ in range(n_simulations):
        # Randomly select k samples without replacement
        selected_indices = np.random.choice(n, size=min(k, n), replace=False)
        selected_samples = [samples[i] for i in selected_indices]
        if np.any(selected_samples):
            successes += 1
    
    return successes / n_simulations

3.3 直接模拟

直接模拟代码示例如下

复制代码

def direct_estimator(samples: List[bool], k: int) -> float:
    """Direct estimator using sample proportion"""
    n = len(samples)
    if n == 0:
        return 0.0
    
    # Take first k samples (or all if n < k)
    selected_samples = samples[:min(k, n)]
    return float(np.any(selected_samples))

3.4 unbiased模拟

无偏模拟，也就是OpenAI在HumanEval采用的模拟方法，代码示例如下

复制代码

def unbiased_estimator(samples: List[bool], k: int) -> float:
    """Unbiased estimator using combinatorial approach"""
    n = len(samples)
    if n < k:
        return float(np.any(samples))  # Fallback if not enough samples
    
    # Count number of successes
    c = np.sum(samples)
    
    if c == 0:
        return 0.0
    
    # Unbiased estimator: 1 - (n-c choose k) / (n choose k)
    n_choose_k = comb(n, k)
    if n - c < k:
        return 1.0
    
    n_minus_c_choose_k = comb(n - c, k)
    return 1 - n_minus_c_choose_k / n_choose_k

3.4 综合对比

综合以上多种模拟方案，进行对比示例。

复制代码

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from typing import List, Callable
import seaborn as sns
from scipy.special import comb
 
# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
 
def true_pass_at_k(p: float, k: int) -> float:
    """True pass@k value: 1 - (1-p)^k"""
    return 1 - (1 - p) ** k
 
def bootstrap_estimator(samples: List[bool], k: int, n_bootstrap: int = 1000) -> float:
    """Bootstrap estimator for pass@k"""
    n = len(samples)
    if n == 0:
        return 0.0
    
    bootstrap_estimates = []
    for _ in range(n_bootstrap):
        # Sample with replacement
        bootstrap_sample = np.random.choice(samples, size=min(k, n), replace=True)
        bootstrap_estimates.append(np.any(bootstrap_sample))
    
    return np.mean(bootstrap_estimates)
 
def direct_estimator(samples: List[bool], k: int) -> float:
    """Direct estimator using sample proportion"""
    n = len(samples)
    if n == 0:
        return 0.0
    
    # Take first k samples (or all if n < k)
    selected_samples = samples[:min(k, n)]
    return float(np.any(selected_samples))
 
def unbiased_estimator(samples: List[bool], k: int) -> float:
    """Unbiased estimator using combinatorial approach"""
    n = len(samples)
    if n < k:
        return float(np.any(samples))  # Fallback if not enough samples
    
    # Count number of successes
    c = np.sum(samples)
    
    if c == 0:
        return 0.0
    
    # Unbiased estimator: 1 - (n-c choose k) / (n choose k)
    n_choose_k = comb(n, k)
    if n - c < k:
        return 1.0
    
    n_minus_c_choose_k = comb(n - c, k)
    return 1 - n_minus_c_choose_k / n_choose_k
 
def monte_carlo_estimator(samples: List[bool], k: int, n_simulations: int = 1000) -> float:
    """Monte Carlo estimator for pass@k"""
    n = len(samples)
    if n == 0:
        return 0.0
    
    successes = 0
    for _ in range(n_simulations):
        # Randomly select k samples without replacement
        selected_indices = np.random.choice(n, size=min(k, n), replace=False)
        selected_samples = [samples[i] for i in selected_indices]
        if np.any(selected_samples):
            successes += 1
    
    return successes / n_simulations
 
def analyze_estimators(true_p: float = 0.3, max_samples: int = 100, k_values: List[int] = None, n_trials: int = 100):
    """Analyze bias and variance of different pass@k estimators"""
    if k_values is None:
        k_values = [1, 3, 5, 10, 20]
    
    estimators = {
        'Direct': direct_estimator,
        'Bootstrap': lambda samples, k: bootstrap_estimator(samples, k, 500),
        'Unbiased': unbiased_estimator,
        'Monte Carlo': lambda samples, k: monte_carlo_estimator(samples, k, 500)
    }
    
    # Set up the plot
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    axes = axes.flatten()
    
    # Colors for different k values
    colors = plt.cm.viridis(np.linspace(0, 1, len(k_values)))
    
    for idx, (estimator_name, estimator_func) in enumerate(estimators.items()):
        ax = axes[idx]
        
        for k_idx, k in enumerate(k_values):
            # True value
            true_value = true_pass_at_k(true_p, k)
            
            # Store estimates for different sample sizes
            sample_sizes = []
            estimates_mean = []
            estimates_std = []
            biases = []
            
            for n_samples in range(1, max_samples + 1, 5):
                if n_samples < k and estimator_name == 'Unbiased':
                    continue  # Unbiased estimator requires n >= k
                
                trial_estimates = []
                for trial in range(n_trials):
                    # Generate samples
                    samples = np.random.binomial(1, true_p, n_samples).astype(bool)
                    estimate = estimator_func(samples.tolist(), k)
                    trial_estimates.append(estimate)
                
                sample_sizes.append(n_samples)
                estimates_mean.append(np.mean(trial_estimates))
                estimates_std.append(np.std(trial_estimates))
                biases.append(np.mean(trial_estimates) - true_value)
            
            # Plot mean estimates
            ax.plot(sample_sizes, estimates_mean, 
                   color=colors[k_idx], label=f'k={k}', linewidth=2)
            
            # Add shaded region for ±1 std
            ax.fill_between(sample_sizes, 
                          np.array(estimates_mean) - np.array(estimates_std),
                          np.array(estimates_mean) + np.array(estimates_std),
                          alpha=0.2, color=colors[k_idx])
            
            # Plot true value as horizontal line
            ax.axhline(y=true_value, color=colors[k_idx], linestyle='--', alpha=0.7)
        
        ax.set_xlabel('Number of Samples (n)')
        ax.set_ylabel('pass@k Estimate')
        ax.set_title(f'{estimator_name} Estimator\n'
                    f'True p={true_p}')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Plot bias and variance comparison
    plot_bias_variance_comparison(estimators, true_p, max_samples, k_values, n_trials)
 
def plot_bias_variance_comparison(estimators, true_p, max_samples, k_values, n_trials):
    """Plot bias and variance comparison across estimators"""
    fig, axes = plt.subplots(2, 5, figsize=(15, 12))
    
    for plot_idx, k in enumerate(k_values[:5]):  # Plot first 5 k values
        print(f"plot_idx: {plot_idx}, k: {k}")
        print()
        ax_bias = axes[0, plot_idx]
        ax_variance = axes[1, plot_idx]
        
        true_value = true_pass_at_k(true_p, k)
        
        for estimator_name, estimator_func in estimators.items():
            biases = []
            variances = []
            sample_sizes = []
            
            for n_samples in range(5, max_samples + 1, 5):
                if n_samples < k and estimator_name == 'Unbiased':
                    continue
                
                trial_estimates = []
                for trial in range(n_trials):
                    samples = np.random.binomial(1, true_p, n_samples).astype(bool)
                    estimate = estimator_func(samples.tolist(), k)
                    trial_estimates.append(estimate)
                
                sample_sizes.append(n_samples)
                bias = np.mean(trial_estimates) - true_value
                variance = np.var(trial_estimates)
                
                biases.append(bias)
                variances.append(variance)
            
            ax_bias.plot(sample_sizes, biases, label=estimator_name, linewidth=2)
            ax_variance.plot(sample_sizes, variances, label=estimator_name, linewidth=2)
        
        ax_bias.set_title(f'Bias for k={k}\nTrue pass@{k}: {true_value:.3f}')
        ax_bias.set_xlabel('Number of Samples')
        ax_bias.set_ylabel('Bias')
        ax_bias.legend()
        ax_bias.grid(True, alpha=0.3)
        ax_bias.axhline(y=0, color='black', linestyle='-', alpha=0.5)
        
        ax_variance.set_title(f'Variance for k={k}')
        ax_variance.set_xlabel('Number of Samples')
        ax_variance.set_ylabel('Variance')
        ax_variance.legend()
        ax_variance.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
 
def plot_pass_at_k_curves(p_values: List[float] = None, k_max: int = 50):
    """Plot theoretical pass@k curves for different p values"""
    if p_values is None:
        p_values = [0.1, 0.3, 0.5, 0.7, 0.9]
    
    plt.figure(figsize=(12, 8))
    
    k_range = np.arange(1, k_max + 1)
    
    for p in p_values:
        pass_values = [true_pass_at_k(p, k) for k in k_range]
        plt.plot(k_range, pass_values, label=f'p={p}', linewidth=3)
    
    plt.xlabel('k (number of samples)')
    plt.ylabel('pass@k')
    plt.title('Theoretical pass@k Curves for Different Success Probabilities')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1)
    plt.show()
 
# Run the analysis
if __name__ == "__main__":
    print("Plotting theoretical pass@k curves...")
    # plot_pass_at_k_curves()
    
    print("\nAnalyzing estimator performance...")
    analyze_estimators(true_p=0.3, max_samples=100, n_trials=200)
    
    # print("\nAnalyzing estimator performance with different p...")
    # analyze_estimators(true_p=0.1, max_samples=100, n_trials=200)

输出如下，直接估计额方法比较大，收敛不明显。

bootstrap和mc估计，在n较小是方差较大，但当生成的代码数n增加到一定程度，方差越来越小。

unbiased估计，即无偏估计，方差收敛最明显，在n较小时估计值和真实值的差距也较小。

Analyzing estimator performance...

plot_idx: 0, k: 1

plot_idx: 1, k: 3

plot_idx: 2, k: 5

plot_idx: 3, k: 10

plot_idx: 4, k: 20

reference

pass@k代码生成模型评估指标的探索学习-基础版

https://blog.csdn.net/liliang199/article/details/155393240

SPoC: Search-based Pseudocode to Code

https://proceedings.neurips.cc/paper/2019/file/7298332f04ac004a0ca44cc69ecf6f6b-Paper.pdf

Evaluating Large Language Models Trained on Code

https://arxiv.org/pdf/2107.03374

啥是Bootstrap方法？为何那么多人都在用？

https://zhuanlan.zhihu.com/p/690904510