BigCodeBench Introduced by Terry Yue Zhuo to Evaluate

These benchmarks are impressive - Open Source LLMs are re-taking serious ground! https://t.co/1aUO781DdW

Reasonable evaluation must be the basis for a healthy development of LLMs, and I believe 🌸BigCodeBench will help us to trace the realistic performance of Code LLMs very well! Thanks to @terryyuezhuo lead for this excellent work! https://t.co/uuUElLeRZq

Jiawei Liu@JiaweiLiu_

10 d

LLMs are making progresses, so are the benchmarks! BigCodeBench is a new coding benchmark with comprehensive real-life tasks (e.g., beyond using std libraries). There’s still about 40% of tasks not solved by SOTA models yet. Awesome work led by @terryyuezhuo ♥️ https://t.co/gRWconIhpw

Yuling Gu@gu_yuling

10 d

LLMs are evaluated on the same tasks in so many different ways! 🤯 ✨ We introduce OLMES – a standard for reproducible LLM evaluations that is open, practical, completely documented, and can be applied to current leaderboards & eval code bases! ✨ 📜 https://t.co/SmjBV2Szsk 1/ https://t.co/KTvqkRvmtu

elvis@omarsar0

10 d

This is probably one of the best open code and math LLMs out there! So I spent some time reading through the technical report. Takeaways: > very strong in code and math > competes with GPT-4 on many code tasks > nothing too fancy architecture-wise > high-quality / filtered… https://t.co/qWeoaQX2IP

Terry Yue Zhuo@terryyuezhuo

10 d

In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios! -- Here comes BigCodeBench, benchmarking LLMs on solving… https://t.co/w3Z6N5wnVk https://t.co/byYL02mEp4

Leandro von Werra@lvwerra

10 d

BigCodeBench is a step towards evaluating LLMs on more realistic and harder coding tasks which involves a wide range tool/library calls. Even models like GPT-4o or the brand new @deepseek_ai coding models only achieve ~50%. Awesome work lead by @terryyuezhuo 🔥 https://t.co/9xAkzJGK8H

BigCode@BigCodeProject

10 d

Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks! BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks. https://t.co/9fGrWJ6BtX

Vlad Ruso PhD@vlruso

12 d

BiGGen Bench: A Benchmark Designed to Evaluate Nine Core Capabilities of Language Models https://t.co/WxETIVFGbf #BiGGenBench #LanguageModel #AI #Benchmark #PracticalAI #ai #news #llm #ml #research #ainews #innovation #artificialintelligence #machinelearning #technology #deep… https://t.co/9e1UI1VAJL

Matei Zaharia@matei_zaharia

14 d

Awesome to see more quality open source LLMs coming out! https://t.co/BdPbmxuZwn

nat://TheAIObserverX@TheAIObserverX

14 d

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery ◼ New research introduces CS-Bench, a pioneering bilingual benchmark for evaluating AI in 26 computer science subfields. This tool reveals how model scale affects AI performance and… https://t.co/5kiTDYmDGh

Similar Stories

BigCodeBench Introduced by Terry Yue Zhuo to Evaluate LLMs on Realistic Coding Tasks with 50% Success Rate

Similar Stories

Sources

BigCodeBench Introduced by Terry Yue Zhuo to Evaluate LLMs on Realistic Coding Tasks with 50% Success Rate