Scale AI Launches SEAL Leaderboards for LLMs in Coding

Scale AI has launched SEAL Leaderboards for LLMs. Here are the rankings for -coding, -math, -instruction following we have a different winner in each category. I have just taken the top 5 from each criteria . More details in the 🧵. 1/4 Coding : https://t.co/0VeXUFe2i9

Deedy@deedydas

1 mo

Did OpenAI GPT-4o game the Lmsys leaderboard to be #1? It seemed suspect that they used ELO-based eval over other benchmarks. Results from @scale_AI's new private SEAL leaderboard shows Turbo is better for coding and Opus for math! https://t.co/Fq5DeHzQ8D

Deedy@deedydas

1 mo

OMG did OpenAI game the Lmsys eval? Results from @scale_AI's new private SEAL leaderboard! https://t.co/phuhocqKnE

Techmeme@Techmeme

1 mo

AI training data provider Scale AI releases SEAL Leaderboards, which uses private datasets to rank LLMs in domains like coding, instruction following, and math (@mike_wheatley / SiliconANGLE) https://t.co/P85rmKZFzX 📫 Subscribe: https://t.co/OyWeKSRpIM https://t.co/f1l9aty66G

Riley Goodside@goodside

1 mo

New from Scale: SEAL Leaderboards — a new benchmark arena for frontier LLMs - Private, novel assessments that models can’t train on - ELO-scale rankings (via Bradley-Terry) - Domain leaderboards (today: coding, math, instruct, Spanish — more soon!) (Links in reply) https://t.co/R550vrHJzf

Whole Mars Catalog@WholeMarsBlog

1 mo

New LLM leaderboards from Scale AI! https://t.co/76aRcSaQmz

Nat Friedman@natfriedman

1 mo

We're going to need a lot more investment in high-quality evals and benchmarks to help us understand the actual comparative utility of the various models. This new set of private evals and leaderboard from Scale are great to see. https://t.co/opRWuokcyV

Aran Komatsuzaki@arankomatsuzaki

1 mo

ScaleAI just released LLM leaderboards by extending GSM1K to various domains! This can be a great complement to lmsys eval. https://t.co/U3vnEsF1nz

Daniel Berrios@danielxberrios

1 mo

Excited to be launching the first-of-their-kind SEAL Leaderboards for LLMs. Everyone knows evals are broken, and we want to help fix that 🥇⚖️🔒 https://t.co/fRBBwCX3QA We're also taking the Scale Evaluation platform into GA, and look forward to getting it into more hands! 🚀 https://t.co/H0ucmiA5kd

Nathan Lambert@natolambert

1 mo

Glad to see scale going in this direction: Fully private LLM leaderboard. Even has the fine print 🤓: "If you’d like to add your model to this leaderboard or a future version, please contact [email protected]. To ensure leaderboard integrity, we require that models can only be… https://t.co/KYHIBrw8jl

Dustin Tran@dustinvtran

1 mo

New public leaderboard from Scale! It looks like a solid set of evals. Mitigates two of the biggest problems in evals today: eval sets contaminated in model training, and rater quality for human evaluation. https://t.co/VrnXU0MRR2

Alexandr Wang@alexandr_wang

1 mo

1/ We are launching SEAL Leaderboards—private, expert evaluations of leading frontier models. Our design principles: 🔒Private + Unexploitable. No overfitting on evals! 🎓Domain Expert Evals 🏆Continuously Updated w/new Data and Models Read more in 🧵 https://t.co/10nk53awrL https://t.co/MZZQX0KDc9

Summer Yue@summeryue0

1 mo

🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at https://t.co/bRdTbIMd20! Which evals should we build next? https://t.co/0mCk5hk6kK

Scale AI@scale_AI

1 mo

Scale is excited to release the SEAL leaderboards which rank frontier LLMs, kicking off the first truly expert-driven, trustworthy LLM contest open to all. https://t.co/KYHLUvS1CC https://t.co/lu7TQl7d0B

Scale AI@scale_AI

1 mo

Scale is excited to release the SEAL leaderboards which rank frontier LLMs, kicking off the first truly expert-driven, trustworthy LLM contest open to all. 🧵 https://t.co/KYHLUvS1CC https://t.co/bLlr0bj6cC

Similar Stories

Scale AI Launches SEAL Leaderboards for LLMs in Coding, Math, Instruction, and Spanish

Similar Stories

Sources

Scale AI Launches SEAL Leaderboards for LLMs in Coding, Math, Instruction, and Spanish