技术栈

pbrl

MoonOut
6 个月前
offline rl·pbrl
offline RL · PbRL | LiRE:构造 A>B>C 的 RLT 列表,得到更多 preference 数据In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. Ho
MoonOut
6 个月前
pbrl
PbRL | Christiano 2017 年的开山之作,以及 Preference PPO / PrefPPOPrefPPO 首次(?)出现在 PEBBLE,作为 pebble 的一个 baseline,是用 PPO 复现 Christiano et al. (2017) 的 PbRL 算法。
MoonOut
10 个月前
pbrl
RIME:用交叉熵 loss 大小分辨 preference 是否正确 + 内在奖励预训练 reward modelPreference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results i
MoonOut
1 年前
pbrl
PbRL | Preference Transformer:反正感觉 transformer 很强大Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preference-based RL has been challenging to scale since it requires a large amount of human feedback to learn a rewar