offline RL · PbRL | LiRE:构造 A>B>C 的 RLT 列表,得到更多 preference 数据In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. Ho