Scale AI Research Reveals Mistral and Phi Overfitting

With on GSM1k (a new benchmark called Grade School Math), the leading open and closed-source LLMs an accuracy drops of up to 13% is observed, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. "A… https://t.co/hpNg7DXl6u

Marktechpost AI Research News ⚡@Marktechpost

2 mo

This AI Paper by Scale AI Introduces GSM1k for Measuring Reasoning Accuracy in Large Language Models LLMs Quick read: https://t.co/VDybXCPgKg Researchers from Scale AI have introduced GSM1k, a new benchmark created to measure overfitting and reasoning capabilities in LLMs. The… https://t.co/tdhfqLHf7m

Chris Paxton@chris_j_paxton

2 mo

Interesting to see people try to quantify the overfitting we see on certain LLM problems! Some models appear badly contaminated/overfit, but the flagship models - Google's Gemini, OpenAI's GPT, Anthropic's Claude - seem pretty safe, at least on elementary school level math. https://t.co/gedcLW6uk2 https://t.co/JgFXuD2dvW

AI Papers Podcast@aipaperspodcast

2 mo

How overfit are popular LLMs on public benchmarks? New research from @scale_AI tries to figure this out with a new evaluation benchmark - GSM1K https://t.co/YqN4rVEPU9

Arjun Raj@arjunrajlab

2 mo

Interesting to think that ML models are most performant when overfitted, whereas traditional models are generally underfit…

Matt Bornstein@BornsteinMatt

2 mo

This is an interesting paper, but the conclusion and tweet are just plain wrong The "least overfit" models also happen to be the biggest models.. that are most likely to do well on any new benchmark.. 🤔 The author even says this in the paper https://t.co/Ag2KGwbK0h

Alexandr Wang@alexandr_wang

2 mo

How overfit are popular LLMs on public benchmarks? New research out of @scale_ai SEAL to answer this: - produced a new eval GSM1k - evaluated public LLMs for overfitting on GSM8k VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not. https://t.co/hRhcNQWo93

Hugh Zhang@hughbzhang

2 mo

Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k. https://t.co/JgPQUaYsEc

Aran Komatsuzaki@arankomatsuzaki

2 mo

Scale AI presents A Careful Examination of LLM Performance on Grade School Arithmetic - Evaluate existing LLMs on a new test set of GSM8K - Observe accuracy drops of up to 13%, with models like Phi and Mistral showing evidence of systematic overfitting https://t.co/XFPOF35l5X https://t.co/sSSA5ncYxu

Similar Stories

Scale AI Research Reveals Mistral and Phi Overfitting in LLM Evaluation on GSM8K

Similar Stories

Sources

Scale AI Research Reveals Mistral and Phi Overfitting in LLM Evaluation on GSM8K