SnorkelAI, with the contributions of HoangTranDV and Chris M. Glaze, has achieved state-of-the-art performance on the AlpacaEval 2.0 leaderboard by using a 7B model, outperforming other advanced models including GPT-4 Turbo, GPT-4, Gemini Pro, Claude 2, Llama 2, and Mixtral. A smaller 0.4B reward model was key in curating the training process. Furthermore, SnorkelAI has launched programmatic alignment support in Snorkel Flow for steering LLMs without manual preference annotations. The company has made this model accessible for download, sandbox, or API calls. Additionally, a correction to the AlpacaEval 2.0 has been introduced, normalizing results based on response verbosity, which resulted in a length-penalized leaderboard that offers a more accurate comparison of the models.
Get your hands on the new 7B model that put @SnorkelAI SOTA on AlpacaEval 2.0! This work is foundational to programmatic alignment support in Snorkel Flow for steerable LLMs w/out manual preference annotations. Available for download, sandbox, or API calls at:… https://t.co/SSkPvdS47o
Get your hands on the new 7B model that put @SnorkelAI SOTA on AlpacaEval 2.0! This work is foundational to programmatic alignment support in Snorkel Flow for steerable LLMs w/out manual preference annotations. Download, sandbox, or API calls https://t.co/NcXJMG1MjP https://t.co/At2vGmBesh
Length-penalized AlpacaEval 2.0, with ranking deltas vs the vanilla leaderboard. https://t.co/Kbf9T5ozJ7
Way to go @HoangTranDV @chris_m_glaze !! State of the art on AlpacaEval 2.0- showing again that smaller model + better data wins! More exciting: pointing the way for our new *programmatic alignment* support in @SnorkelAI for steerable LLMs w/out manual preference annotations https://t.co/uraodhR8A5
Apropos of nothing, I've done the dumbest correction to AlpacaEval 2.0 one could think of: "normalized" by response verbosity. Win Rate*((avg length [GPT-4 Turbo; GPT 3.5])/$length) I think the new list (right) makes more sense, but we'd better correct for all biases seriously. https://t.co/YBxT9Y05HU
Fresh result out of @SnorkelAI Research: you can top the AlpacaEval 2.0 LLM leaderboard (under the judge, GPT-4 Turbo, and over GPT-4, Gemini Pro, Claude 2, Llama 2, Mixtral, etc.) with only a 7B model if you align it right! We use a small reward model (0.4B) to curate training…