BigCodeBench, a new benchmark designed to evaluate large language models (LLMs) on practical and challenging programming tasks, has been introduced. Unlike simpler benchmarks like HumanEval and MBPP, BigCodeBench tests LLMs on more realistic and comprehensive coding scenarios, including open code and math LLMs. This initiative is led by Terry Yue Zhuo and aims to address the saturation of basic coding benchmarks by state-of-the-art (SOTA) LLMs. Current top models, including GPT-4 and new models from DeepSeek AI, achieve around 50% success on these tasks, which involve a wide range of tool/library calls. The benchmark highlights that approximately 40% of tasks remain unsolved by SOTA models, emphasizing the need for more robust evaluation standards in the field of AI and machine learning.
These benchmarks are impressive - Open Source LLMs are re-taking serious ground! https://t.co/1aUO781DdW
Reasonable evaluation must be the basis for a healthy development of LLMs, and I believe 🌸BigCodeBench will help us to trace the realistic performance of Code LLMs very well! Thanks to @terryyuezhuo lead for this excellent work! https://t.co/uuUElLeRZq
LLMs are making progresses, so are the benchmarks! BigCodeBench is a new coding benchmark with comprehensive real-life tasks (e.g., beyond using std libraries). There’s still about 40% of tasks not solved by SOTA models yet. Awesome work led by @terryyuezhuo ♥️ https://t.co/gRWconIhpw
LLMs are evaluated on the same tasks in so many different ways! 🤯 ✨ We introduce OLMES – a standard for reproducible LLM evaluations that is open, practical, completely documented, and can be applied to current leaderboards & eval code bases! ✨ 📜 https://t.co/SmjBV2Szsk 1/ https://t.co/KTvqkRvmtu
This is probably one of the best open code and math LLMs out there! So I spent some time reading through the technical report. Takeaways: > very strong in code and math > competes with GPT-4 on many code tasks > nothing too fancy architecture-wise > high-quality / filtered… https://t.co/qWeoaQX2IP
In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios! -- Here comes BigCodeBench, benchmarking LLMs on solving… https://t.co/w3Z6N5wnVk https://t.co/byYL02mEp4
BigCodeBench is a step towards evaluating LLMs on more realistic and harder coding tasks which involves a wide range tool/library calls. Even models like GPT-4o or the brand new @deepseek_ai coding models only achieve ~50%. Awesome work lead by @terryyuezhuo 🔥 https://t.co/9xAkzJGK8H
Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks! BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks. https://t.co/9fGrWJ6BtX
BiGGen Bench: A Benchmark Designed to Evaluate Nine Core Capabilities of Language Models https://t.co/WxETIVFGbf #BiGGenBench #LanguageModel #AI #Benchmark #PracticalAI #ai #news #llm #ml #research #ainews #innovation #artificialintelligence #machinelearning #technology #deep… https://t.co/9e1UI1VAJL
Awesome to see more quality open source LLMs coming out! https://t.co/BdPbmxuZwn
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery ◼ New research introduces CS-Bench, a pioneering bilingual benchmark for evaluating AI in 26 computer science subfields. This tool reveals how model scale affects AI performance and… https://t.co/5kiTDYmDGh