Researchers from the University of Virginia and Princeton University have introduced SimPO (Simple Preference Optimization), a new offline preference optimization algorithm. SimPO, developed by Y Meng, M Xia, and D Chen in 2024, is designed to improve simplicity and training stability for offline preference tuning and significantly outperforms existing methods such as DPO (Direct Preference Optimization) and ORPO. The Llama-3-8B-SimPO model, utilizing SimPO, has achieved notable performance metrics, including a 44.7% LC win rate on AlpacaEval 2 and a 33.8% win rate on Arena-Hard. The algorithm is reference-free and uses the average log probability for optimization, making it a simpler yet effective alternative in the realm of reinforcement learning from human feedback (RLHF). Experts have praised SimPO for its effectiveness, with some noting its excellence in handling open-domain queries.
Implementing direct preference optimization (DPO) is so much more convenient than RLHF with PPO for LLM preference tuning. Can we go even simpler? Apparently, yes: "SimPO: Simple Preference Optimization with a Reference-Free Reward (https://t.co/s3UAalP1qX). I.e., remove the… https://t.co/Mygkzed7lw
SimPO (Simple Preference Optimization), a new RLHF method, was released to improve simplicity and training stability for offline preference tuning while outperforming DPO or ORPO. 👀 SimPO is very similar to DPO by being a reward-free method but uses the average log probability… https://t.co/Sc407klKiv
Fantastic paper from @PrincetonPLI team. Chatting with the model is also incredible. Llama3 8B is the best small model currently ---SimPO makes it better. Excels on my list of open-domain queries. https://t.co/tUc1oIvB3c
[CL] SimPO: Simple Preference Optimization with a Reference-Free Reward Y Meng, M Xia, D Chen [University of Virginia & Princeton University] (2024) https://t.co/TlrWr6YS0O - SimPO is a simple yet effective offline preference optimization algorithm that consistently outperforms… https://t.co/Xc9lUxKKAw
Introducing SimPO: Simpler & more effective Preference Optimization!🎉 Significantly outperforms DPO w/o a reference model!📈 Llama-3-8B-SimPO ranked among top on leaderboards!💪 ✅44.7% LC win rate on AlpacaEval 2 ✅33.8% win rate on Arena-Hard https://t.co/4KsS6PRQUH 🧵[1/n] https://t.co/Wtpm1J3awc