RIME:用交叉熵 loss 大小分辨 preference 是否正确 + 内在奖励预训练 reward modelPreference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results i