【RL】ORPO: Monolithic Preference Optimization without Reference ModelL ORPO = E ( x , y w , y l ) [ L SFT + λ ⋅ L OR ] \mathcal{L}_{\text{ORPO}} = \mathbb{E}_{(x, y_w, y_l)} \left[ \mathcal{L}_{\text{SFT}} + \lambda \cdot \mathcal{L}_{\text{OR}} \right] LORPO=E(x,yw,yl)[LSFT+λ⋅LOR]