Scale AI has launched the SEAL Leaderboards, a new evaluation platform for large language models (LLMs). The leaderboards use private datasets and expert evaluations to rank LLMs in various domains, including coding, math, instruction following, and Spanish. This initiative aims to address common issues in model evaluations, such as data contamination and rater quality. The SEAL Leaderboards are continuously updated with new data and models, and Scale AI is inviting model developers to participate. The platform also extends GSM1K to various domains and uses ELO-scale rankings via the Bradley-Terry method. The launch has been well-received, with industry experts highlighting its potential to provide more trustworthy and accurate assessments of LLMs.
Scale AI has launched SEAL Leaderboards for LLMs. Here are the rankings for -coding, -math, -instruction following we have a different winner in each category. I have just taken the top 5 from each criteria . More details in the 🧵. 1/4 Coding : https://t.co/0VeXUFe2i9
Did OpenAI GPT-4o game the Lmsys leaderboard to be #1? It seemed suspect that they used ELO-based eval over other benchmarks. Results from @scale_AI's new private SEAL leaderboard shows Turbo is better for coding and Opus for math! https://t.co/Fq5DeHzQ8D
OMG did OpenAI game the Lmsys eval? Results from @scale_AI's new private SEAL leaderboard! https://t.co/phuhocqKnE
AI training data provider Scale AI releases SEAL Leaderboards, which uses private datasets to rank LLMs in domains like coding, instruction following, and math (@mike_wheatley / SiliconANGLE) https://t.co/P85rmKZFzX 📫 Subscribe: https://t.co/OyWeKSRpIM https://t.co/f1l9aty66G
New from Scale: SEAL Leaderboards — a new benchmark arena for frontier LLMs - Private, novel assessments that models can’t train on - ELO-scale rankings (via Bradley-Terry) - Domain leaderboards (today: coding, math, instruct, Spanish — more soon!) (Links in reply) https://t.co/R550vrHJzf
New LLM leaderboards from Scale AI! https://t.co/76aRcSaQmz
We're going to need a lot more investment in high-quality evals and benchmarks to help us understand the actual comparative utility of the various models. This new set of private evals and leaderboard from Scale are great to see. https://t.co/opRWuokcyV
ScaleAI just released LLM leaderboards by extending GSM1K to various domains! This can be a great complement to lmsys eval. https://t.co/U3vnEsF1nz
Excited to be launching the first-of-their-kind SEAL Leaderboards for LLMs. Everyone knows evals are broken, and we want to help fix that 🥇⚖️🔒 https://t.co/fRBBwCX3QA We're also taking the Scale Evaluation platform into GA, and look forward to getting it into more hands! 🚀 https://t.co/H0ucmiA5kd
Glad to see scale going in this direction: Fully private LLM leaderboard. Even has the fine print 🤓: "If you’d like to add your model to this leaderboard or a future version, please contact [email protected]. To ensure leaderboard integrity, we require that models can only be… https://t.co/KYHIBrw8jl
New public leaderboard from Scale! It looks like a solid set of evals. Mitigates two of the biggest problems in evals today: eval sets contaminated in model training, and rater quality for human evaluation. https://t.co/VrnXU0MRR2
1/ We are launching SEAL Leaderboards—private, expert evaluations of leading frontier models. Our design principles: 🔒Private + Unexploitable. No overfitting on evals! 🎓Domain Expert Evals 🏆Continuously Updated w/new Data and Models Read more in 🧵 https://t.co/10nk53awrL https://t.co/MZZQX0KDc9
🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at https://t.co/bRdTbIMd20! Which evals should we build next? https://t.co/0mCk5hk6kK
Scale is excited to release the SEAL leaderboards which rank frontier LLMs, kicking off the first truly expert-driven, trustworthy LLM contest open to all. https://t.co/KYHLUvS1CC https://t.co/lu7TQl7d0B
Scale is excited to release the SEAL leaderboards which rank frontier LLMs, kicking off the first truly expert-driven, trustworthy LLM contest open to all. 🧵 https://t.co/KYHLUvS1CC https://t.co/bLlr0bj6cC