In 2024, researchers E. Choi, A. Ahmadian, M. Geist, O. Pietquin, and M. G. Azar from Cohere have introduced a novel approach called Self-Improving Robust Preference Optimization (SRPO). This method trains a separate large language model (LLM) on a preference dataset to generate better responses based on given prompts and current LLM outputs. SRPO learns a self-improvement policy to revise suboptimal samples towards more preferred ones, optimizing alignment. This new approach aims to address the limitations of existing alignment methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization, which rely heavily on extensive human annotation and lack transparency, limiting their scalability and adaptability. The new LLM is referred to as the in-context LLM.
Existing alignment approaches, like Reinforcement Learning from Human Feedback or Direct Preference Optimization, rely heavily on extensive human annotation and lack transparency in enforcing behaviors, limiting scalability and adaptability. SelfControl is a gradient-based… https://t.co/m5eoP7XIsd
No-Regret Algorithms for Safe Bayesian Optimization with Monotonicity Constraints. https://t.co/cOJxvJSjsE
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. https://t.co/1yaole5LYE
Nearly Minimax Optimal Regret for Multinomial Logistic Bandit. https://t.co/EAWCnCtBtK
[LG] Self-Improving Robust Preference Optimization E Choi, A Ahmadian, M Geist, O Pietquin, M G Azar [Cohere] (2024) https://t.co/VJEdBrBWNQ - SRPO learns a self-improvement policy to revise suboptimal samples towards more preferred ones. This allows optimizing alignment�� https://t.co/9zItjOku9c
Self-Improving Robust Preference Optimization. https://t.co/77chymJFNr
Self-Improving Robust Preference Optimization https://t.co/0mZQbM7pzn
Self-Improving Robust Preference Optimization abs: https://t.co/MnRBMRdXfj New approach to utilizing preference data: 1. Train a separate LLM on the preference dataset so that given prompt and current LLM response, output better response. This LLM is called the in-context… https://t.co/1w5Fvte2bE