New research from Scale AI evaluates popular Large Language Models (LLMs) on overfitting using a new test set of GSM8K. Models like Mistral and Phi show evidence of systematic overfitting, with accuracy drops of up to 13%. Other models like GPT, Claude, Gemini, and Llama are not overfitting benchmarks.
With on GSM1k (a new benchmark called Grade School Math), the leading open and closed-source LLMs an accuracy drops of up to 13% is observed, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. "A… https://t.co/hpNg7DXl6u
This AI Paper by Scale AI Introduces GSM1k for Measuring Reasoning Accuracy in Large Language Models LLMs Quick read: https://t.co/VDybXCPgKg Researchers from Scale AI have introduced GSM1k, a new benchmark created to measure overfitting and reasoning capabilities in LLMs. The… https://t.co/tdhfqLHf7m
Interesting to see people try to quantify the overfitting we see on certain LLM problems! Some models appear badly contaminated/overfit, but the flagship models - Google's Gemini, OpenAI's GPT, Anthropic's Claude - seem pretty safe, at least on elementary school level math. https://t.co/gedcLW6uk2 https://t.co/JgFXuD2dvW
How overfit are popular LLMs on public benchmarks? New research from @scale_AI tries to figure this out with a new evaluation benchmark - GSM1K https://t.co/YqN4rVEPU9
Interesting to think that ML models are most performant when overfitted, whereas traditional models are generally underfit…
This is an interesting paper, but the conclusion and tweet are just plain wrong The "least overfit" models also happen to be the biggest models.. that are most likely to do well on any new benchmark.. 🤔 The author even says this in the paper https://t.co/Ag2KGwbK0h
How overfit are popular LLMs on public benchmarks? New research out of @scale_ai SEAL to answer this: - produced a new eval GSM1k - evaluated public LLMs for overfitting on GSM8k VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not. https://t.co/hRhcNQWo93
Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k. https://t.co/JgPQUaYsEc
Scale AI presents A Careful Examination of LLM Performance on Grade School Arithmetic - Evaluate existing LLMs on a new test set of GSM8K - Observe accuracy drops of up to 13%, with models like Phi and Mistral showing evidence of systematic overfitting https://t.co/XFPOF35l5X https://t.co/sSSA5ncYxu