Researchers from various institutions have introduced Self-Improving Robust Preference Optimization (SRPO), a novel approach aimed at enhancing the performance and robustness of large language models (LLMs) through reinforcement learning from human feedback (RLHF). SRPO, developed by E Choi, A Ahmadian, M Geist, O Pietquin, and M G Azar, learns a self-improvement policy that revises suboptimal samples towards more preferred ones, optimizing alignment and reducing reliance on extensive human annotation. This method addresses the limitations of existing alignment approaches, such as Direct Preference Optimization and RLHF, which often lack transparency and scalability. The research involves contributions from notable entities including Cohere, Princeton University, Stevens Institute of Technology, University of Pennsylvania, Stanford University, and UMass Amherst.
Step-aware Preference Optimization Aligning Preference with Denoising Performance at Each Step Recently, Direct Preference Optimization (DPO) has extended its success from aligning large language models (LLMs) to aligning text-to-image diffusion models with human preferences. https://t.co/w4RNw1llKv
Need a Robust, Scalable and Self-Improving RLHF pipeline? Checkout our new work: Self-Improving Robust Preference Optimization (SRPO) a robust offline paradigm for RLHF. SRPO frames learning from human preferences as a self-improvement process and by doing so makes the… https://t.co/5J7ETJNnjo https://t.co/BTcpQXdiQ6
[LG] Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms R Rafailov, Y Chittepu, R Park, H Sikchi... [Stanford University & UMass Amherst] (2024) https://t.co/iaxaFeMCuh - Direct Alignment Algorithms (DAAs) like Direct Preference Optimization have… https://t.co/Q5j1oJcBso
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback. https://t.co/qIchogczk8
🤔Can we explicitly teach LLMs to self-improve using RLHF? Introducing “Self-Improving Robust Preference Optimization” (SRPO) which trains models that are self-improving and robust to eval tasks! w/ @221eugene Matthieu Geist Olivier Pietquin @Learnius 📜https://t.co/RjkUEUri5I https://t.co/PJny3wa5xv
[LG] Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity W Guo, J Long, Y Zeng, Z Liu… [Princeton University & Stevens Institute of Technology & University of Pennsylvania] (2024) https://t.co/VRFMM4HQCs - Large language models (LLMs) require fine-tuning for optimal… https://t.co/63o3ecERXv
Existing alignment approaches, like Reinforcement Learning from Human Feedback or Direct Preference Optimization, rely heavily on extensive human annotation and lack transparency in enforcing behaviors, limiting scalability and adaptability. SelfControl is a gradient-based… https://t.co/m5eoP7XIsd
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. https://t.co/1yaole5LYE
[LG] Self-Improving Robust Preference Optimization E Choi, A Ahmadian, M Geist, O Pietquin, M G Azar [Cohere] (2024) https://t.co/VJEdBrBWNQ - SRPO learns a self-improvement policy to revise suboptimal samples towards more preferred ones. This allows optimizing alignment�� https://t.co/9zItjOku9c
Self-Improving Robust Preference Optimization. https://t.co/77chymJFNr