Policy Optimization in RLHF: The Impact of Out-of-preference Data

التفاصيل البيبلوغرافية
العنوان:	Policy Optimization in RLHF: The Impact of Out-of-preference Data
المؤلفون:	Li, Ziniu, Xu, Tian, Yu, Yang
سنة النشر:	2023
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Machine Learning
الوصف:	Aligning intelligent agents with human preferences and values is important. This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO). A variant of RMB-PO, referred to as RMB-PO+ is also considered. These methods, either explicitly or implicitly, learn a reward model from preference data and differ in the data used for policy optimization to unlock the generalization ability of the reward model. In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data. We examine the impact of such out-of-preference data. Our study, conducted through controlled and synthetic experiments, demonstrates that DPO performs poorly, whereas RMB-PO+ performs the best. In particular, even when providing the policy model with a good feature representation, we find that policy optimization with adequate out-of-preference data significantly improves performance by harnessing the reward model's generalization capabilities.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2312.10584
رقم الانضمام:	edsarx.2312.10584
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.