【LLM】On-Policy Distillation Survery链接:https://arxiv.org/pdf/2604.00626f-散度最小化:OPD方法将训练过程重新组织为围绕学生采样的轨迹进行优化,目标是减少复合误差,使其线性化。公式如下: L O P D ( θ ) = E y ∼ π mix [ ∑ t = 1 ∣ y ∣ D f ( p T ( ⋅ ∣ x , y < t ) , p θ ( ⋅ ∣ x , y < t ) ) ] \mathcal{L}_{OPD}(\theta) = E_{y \sim \pi_{\text{mix}}} \left