The Open LLM Leaderboard 2 has been released, introducing new benchmarks and features to evaluate large language models (LLMs) more effectively. This update aims to address the plateauing performance of LLMs by incorporating new evaluation metrics such as IFEval, BBH, MATH Lvl 5, GPQA, MUSR, and MMLU-PRO. The leaderboard now includes high-quality datasets, chat templates, and a community voting system to prioritize model evaluations. Qwen 72B Instruct currently leads the leaderboard. The update also emphasizes fairer, transparent, and reproducible comparisons of LLMs, with new visualizations and technical details to enhance user experience. The leaderboard is available on the huggingface Hub, and 300 H100 GPUs were used to re-run new evaluations.
Fabulous talk today by @BorisMPower of @OpenAI at @Yale @yaledatascience @YINSedge on “ChatGPT and the Future of LLMs.” The developments are mind-blowing. #HNL https://t.co/tnY4c6USj9
Big news! The open llm leaderboard will be hard to game for a couple weeks! Looking forward to checking out Leaderboard 3 but for now, I'm choosing models based on my use case with MyxMatch Find fitness for free: https://t.co/uu8qp62QBB https://t.co/xVITCi3qq6 https://t.co/BJj2MoHlL2
Pumped to announce the brand new open LLM leaderboard. We burned 300 H100 to re-run new evaluations like MMLU-pro for all major open LLMs! Some learning: - Qwen 72B is the king and Chinese open models are dominating overall - Previous evaluations have become too easy for recent…
Very excited to release the new version of the Open LLM Leaderboard, v2 – it's much harder than the previous version as you can see on some of the v1 <> v2 scores comparison I'm posting below Updated: As open models keeps getting better and saturating some of the evaluations it… https://t.co/zv6dSQCnhJ
Open LLM Leaderboard 2⃣️ 开源大模型排行榜 is now available on the @huggingface Hub 🔥🚀🏆 https://t.co/xpcsXRgBi7 ✨ New high-quality datasets for various tests. ✨ Chat templates in @AiEleuther's harness. ✨ Community voting system to prioritize model evaluations.…
Open LLM Leaderboard 2⃣️ is now available on the @huggingface Hub 🔥🚀🏆 https://t.co/xpcsXRgBi7 ✨ New high-quality datasets for various tests. ✨ Chat templates in @AiEleuther's harness. ✨ Community voting system to prioritize model evaluations. ✨"Maintainer's highlight" :…
Qwen 72B Instruct wins the Open LLM Leaderboard 2.0, for now ;) https://t.co/4EEHFCo0Rz
🚀 Very big update on the Open LLM Leaderboard! 🔥 New Evals: 📊 IFEval 📚 BBH 🔢 MATH Lvl 5 🤖 GPQA 🤔 MUSR 🔬 MMLU-PRO https://t.co/yTRTFhydq6
To better get how the Open LLM Leaderboard v2 works, you should look at @ailozovskaya 's thread showcasing some of the cool front end features she added! https://t.co/7nXKEGozWy
Open LLM Leaderboard is now rebuilt to better monitor the performance of LLMs! Many things have changed, starting from the task composition, technical details of the evaluation, and interface, to cool new visualizations! 🤗🚀 Check out our blog: https://t.co/9U1AJGjX1J https://t.co/NSqDgCeTbb https://t.co/wN8uMXlItD
Look at that 👀 Actual benchmarks have become too easy for recent models, much like grading high school students on middle school problems makes little sense. So the team worked on a new version of the Open LLM Leaderboard with new benchmarks. Stellar work from @clefourrier… https://t.co/jdmfp8oSM3 https://t.co/qzzCesnp71
Exciting stuff now coming to the new-and-improved Open LLM Leaderboard! @clefourrier and team very hard at work 👀 https://t.co/RDai6LvgeC
Open LLM Leaderboard 2⃣️is out!!🚀🏆 And Qwen2 is 🔥🔥🔥 https://t.co/P3fGrARlHp https://t.co/rYYy4hAqeQ
🔥I'm super happy to see the new Open LLM Leaderboard 2 in production! It was immense work of the entire team 💖 🔗The link to Open LLM Leaderboard remains the same https://t.co/ecrYahipwt 🗒️And the blog is very informative, be sure to check it out! https://t.co/6zTEhyXDNo https://t.co/fwUH8aATzP https://t.co/yZXURnCioc
Open LLM Leaderboard 2⃣️is out!!🚀🏆 https://t.co/rYYy4hAqeQ
Open LLM Leaderboard 2 released! Evaluating LLMs is not easy. Finding new ways to compare LLM fairly, transparently, and reproducibly is important! Benchmarks are not perfect, but they give us a first understanding of how well models perform and where their strengths are. What's… https://t.co/G5nZVNMAj2
LLM performances have been plateauing... so we decided to make the Open LLM Leaderboard steep again 🏔️ 😈 Introducing the Leaderboard 2️⃣ Expect... - new benchmarks - fairer reporting - cool features (did I hear voting and chat template?) 🧵 https://t.co/6uKKuTSFrX
New video! How do we Evaluate LLMs? 👀 The what, why, when and how! https://t.co/Fx8UsjWU1i https://t.co/S21yKjMVvA
How do we Evaluate LLMs? https://t.co/Fx8UsjWU1i https://t.co/gAmk3kPPqw
Most businesses struggle to evaluate their AI models. Did you know few companies can accurately assess the performance of their AI systems? This skill, known as LLM evaluation, is crucial for better decision-making and improving the results from your genAI systems. Evaluation…