Search

Search

Business Crypto Culture Environment Politics Science Sports Tech Video Games World

AI AR-VR Fintech Infosec IoT Metaverse Mobile Policy Robotics Smart Home Social Software Startups Wearables

Similar Stories

Similar Stories

Footer

Business

Economics
Real Estate
VC

Crypto

Airdrops
Blockchains
DeFi
Hacks
Markets
Memecoin
Mining
NFT
Regulation

Culture

Celebrities
Crime
Education
Movies
Music
Obituary
TV

Environment

Climate
Energy
Natural Disasters
Natural Resources
Sustainability

Politics

Arizona
Boston
California
Chicago
Colorado
Detroit
Florida
Georgia
LA
Las Vegas
Los Angeles
New Jersey
New Mexico
New York
Ohio
Oregon
Philadelphia
San Francisco
Seattle
SF
Texas
Utah
Washington DC

Science

Bio
Health

Sports

Boxing
Chess
Golf
Hockey
MLB
NBA
NCAA
NFL
PGA
Poker
Racing
Rugby
Soccer
Tennis
UFC

Tech

AI
AR-VR
Fintech
Infosec
IoT
Metaverse
Mobile
Policy
Robotics
Smart Home
Social
Software
Startups
Wearables

Video Games

Esports
Releases

World

Africa
Asia
Australia
Brazil
Britain
Canada
China
Europe
France
Germany
Hong Kong
India
Israel
Italy
Japan
Latin America
Mexico
Middle East
North Korea
Pakistan
Poland
Russia
South America
Spain
Turkey
Ukraine
United States
US
USA

WhatsApp YouTube X

© 2024 DeepNFTValue, Inc. All rights reserved.

Similar Stories

Tiny LLM Llama8B Outperforms GPT-4 with 96.7% on GSM8K Math Benchmark, 200x Fewer Parameters
Authors
4
13 days
AI
Tech
Microsoft and Meta Advance LLMs with Phi-3 and Llama 3 Amid Challenges
Authors
11
7 days
AI
Tech
NoCha Benchmark Shows LLMs, Including GPT-4o, Struggle with Reasoning, Best Scoring 55.8%
Authors
12
4 days
AI
Tech
Science
BigCodeBench Introduced by Terry Yue Zhuo to Evaluate LLMs on Realistic Coding Tasks with 50% Success Rate
Authors
11
12 days
AI
Tech
CLLMs, vLLM, and GLM-4 Drive Advances in Large Language Model Efficiency
Authors
7
12 days
AI
Tech
AI Community Explores Reducing False Positives and Hallucinations with Lamini Memory Tuning and RAG Techniques to Improve LLM Accuracy
Authors
43
16 days
AI
Tech
Google Introduces Test of Time Benchmark for Evaluating Large Language Models
Authors
8
16 days
AI
Tech
MistralAI Releases Fine-Tuning API for open-mistral-7B, LlamaIndex Enhances Data Management and Multi-Agent RAG
Authors
5
10 days
AI
Tech
UCLA Research Uncovers Irregularities in LLMs' Decision Boundaries
Authors
5
4 days
AI
Education
Tech

Sources

Loading...

Similar Stories

Tiny LLM Llama8B Outperforms GPT-4 with 96.7% on GSM8K Math Benchmark, 200x Fewer Parameters
Authors
4
13 days
AI
Tech
Microsoft and Meta Advance LLMs with Phi-3 and Llama 3 Amid Challenges
Authors
11
7 days
AI
Tech
NoCha Benchmark Shows LLMs, Including GPT-4o, Struggle with Reasoning, Best Scoring 55.8%
Authors
12
4 days
AI
Tech
Science
BigCodeBench Introduced by Terry Yue Zhuo to Evaluate LLMs on Realistic Coding Tasks with 50% Success Rate
Authors
11
12 days
AI
Tech
CLLMs, vLLM, and GLM-4 Drive Advances in Large Language Model Efficiency
Authors
7
12 days
AI
Tech
AI Community Explores Reducing False Positives and Hallucinations with Lamini Memory Tuning and RAG Techniques to Improve LLM Accuracy
Authors
43
16 days
AI
Tech
Google Introduces Test of Time Benchmark for Evaluating Large Language Models
Authors
8
16 days
AI
Tech
MistralAI Releases Fine-Tuning API for open-mistral-7B, LlamaIndex Enhances Data Management and Multi-Agent RAG
Authors
5
10 days
AI
Tech
UCLA Research Uncovers Irregularities in LLMs' Decision Boundaries
Authors
5
4 days
AI
Education
Tech

May 3, 04:05 PM

Scale AI's New Study Finds Overfitting in Some LLMs on GSM8k Benchmark

Scale AI's New Study Finds Overfitting in Some LLMs on GSM8k Benchmark

Authors

32

Recent research conducted by Scale AI's SEAL team has produced a new evaluation set, GSM1k, aimed at assessing the overfitting of popular large language models (LLMs) on the commonly used GSM8k benchmark. The study revealed that models such as Mistral and Phi exhibit signs of overfitting, whereas others like GPT, Claude, Gemini, and Llama do not. This initiative highlights potential data contamination issues in some model families, suggesting that their performance on public benchmarks may not accurately reflect their real-world capabilities.

#Scale AI #SEAL #Mistral #Phi #GPT #Claude #Gemini #Llama

Written with ChatGPT (GPT-4).

Generative AI Workspace Canvas — Jeda.ai@goJedaAi
2 mo
🎉 Discover https://t.co/FZP1xF9ZNb's new features, including AI-driven art generation, enhanced file uploads for strategic analysis, and a dynamic AI command bar. ➡️ Learn more https://t.co/kpwRQJ8HwT #JedaAI #GenerativeAI #Meta #Llama3 #OpenAI #GPT4Turbo #Anthropic #Claude3
Zihan (Johan) Liu@zihan_johan_liu
2 mo
Introducing ChatQA-1.5, a family of models (Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B) that excel at conversational QA and RAG. We also open source our instruction tuning data, ChatRAG Bench for evaluation, and a multi-turn QA retriever. Link: https://t.co/2uxHQnfzZB https://t.co/gR9AOiHZLJ
Wei Ping@_weiping
2 mo
Introducing ChatQA-1.5, a family of models that surpasses GPT-4-0613 and Command-R-Plus on RAG and conversational QA. ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B, https://t.co/H7JvIFCD48 Llama3-ChatQA-1.5-70B, https://t.co/Ao3Yw8ECxA We also open source our instruction…
Wei Ping@_weiping
2 mo
Introducing ChatQA-1.5, a family of models that surpasses GPT-4-0613 and Command-R-Plus on RAG and conversational QA. ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B, https://t.co/H7JvIFCD48 and Llama3-ChatQA-1.5-70B, https://t.co/Ao3Yw8ECxA We also open source our…
Gradient@Gradient_AI_
2 mo
Don't worry, we didn't forget about 70b🥳 Take a look at the first @AIatMeta Llama-3 70b model with a context length of 262K - scoring a perfect retrieval for NIAH! We included an extensive proprietary chat dataset to give the model chat ability over long sequences as well. A… https://t.co/toYrUUTRJH
Gradient@Gradient_AI_
2 mo
You didn’t think we would forget about 70b did you? 🥳 Take a look at the first @AIatMeta Llama-3 70b model with a context length of 262K - scoring a perfect retrieval for NIAH! We included an extensive proprietary chat dataset to give the model chat ability over long sequences… https://t.co/TyQUO8KA0p
Groq Inc@GroqInc
2 mo
We're excited to see @ArtificialAnlys 's newly launched leaderboard on @Huggingface with @GroqInc continuing to set the bar for throughput (tokens/s). Groq's throughput for Llama 3 70B exceeds what the vast majority of providers can deliver for Llama 3 8B https://t.co/wa5gqnEBfe
Multiplatform.AI@MultiplatformAI
2 mo
Alibaba Cloud's Model Studio now supports Llama3, Meta's latest open-source LLM #70B #8B #AI #AIdevelopmentplatform #AItechnology #AlibabaCloud #artificialintelligence #deployment #generativeAIapplicationdevelopment #inferenceservices https://t.co/ukT3HStjS5 https://t.co/egFsmSfg8i
Maziyar PANAHI@MaziyarPanahi
2 mo
🚀 The first fine-tuned models to score higher than Llama-3-70B & achieve the best MMLU/GSM8K at the same time! - 3 out of the top 10 models on the Open LLM Leaderboard are now dominated by these fine-tuned models - Achieved the highest MMLU / GSM8K on @huggingface Leaderboard https://t.co/UeVaHPOe6t
OctoAI@OctoAICloud
2 mo
Hermes-2-Pro-Llama-3-8b The first Llama 3 fine tune from @NousResearch is up on the OctoAI Text Gen solution👏 Give it a try at https://t.co/GvHlMIRz3H🌟 https://t.co/nm7dRwMGlg
Artificial Analysis@ArtificialAnlys
2 mo
We're bringing our benchmarking leaderboard of >100 LLM API endpoints to @huggingface! Speed and price are just as important as quality when building applications with LLMs. We bring together all the data you need to consider all three when you need to pick a model and API… https://t.co/Xk2oe8uzeN
Meta for Developers@MetaforDevs
2 mo
With Meta Llama, we're democratizing access to AI models, tools and resources, helping developers shape the next wave of innovation. Read our guide to get started yourself: https://t.co/PgsClvqvx4 https://t.co/xc5vM43KAH
Awni Hannun@awnihannun
2 mo
Some awesome new models in the 🤗 MLX community: - Hermes 2 Pro Llama 3 by @NousResearch - Llama 3 OpenBio LLM for medical domain by @aadityaura - Llama3-ChatQA-1.5-8B by @nvidia All here: https://t.co/dUgErUXnM3 h/t @vkash16, @lucataco93, @Prince_Canuma for conversions!
Two Minute Papers@twominutepapers
2 mo
Meta’s Llama3 AI: ChatGPT Intelligence… For Free! https://t.co/P4BTfuNjxP
Maxime Labonne@maximelabonne
2 mo
Actually pretty insane: the base model scores higher than Llama 3 70B Instruct (aka "The Beast") on MMLU. A good finetune of Qwen1.5-110B could become SOTA open-source (or at least on par with L3). https://t.co/5keCMODqBw https://t.co/2HG5aDPkqt
Marktechpost AI Research News ⚡@Marktechpost
2 mo
This AI Paper Introduces Llama-3-8B-Instruct-80K-QLoRA: New Horizons in AI Contextual Understanding Quick read: https://t.co/VYNUBypNoj Researchers from the Beijing Academy of Artificial Intelligence and the Renmin University of China have introduced…
Jerry Liu@jerryjliu0
2 mo
Fully local advanced RAG with reranking: Llama 3, bge embeddings, miniLM reranker, @qdrant_client Check it out 👇 https://t.co/RA3BwX7PEu https://t.co/L7Gh1EMVch
Quotient AI@QuotientAI
2 mo
We benchmarked @Meta Llama3 @databricks DBRX @MistralAI Mixtral-8x22b and @OpenAI GPT4 on our Product Catalog Q&A dataset! 1⃣ Llama-3-70b matches GPT4's pace! 2⃣ Llama-3-8b and Mixtral-8x22b have almost identical performance. 3⃣ DBRX is definitely the chattiest of the bunch 😉 https://t.co/wkM3jEE5bL
AI Papers Podcast@aipaperspodcast
2 mo
How overfit are popular LLMs on public benchmarks? New research from @scale_AI tries to figure this out with a new evaluation benchmark - GSM1K https://t.co/YqN4rVEPU9
Dennis ⚡️@dennis_kortsch
2 mo
Nvidia released a new model, Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). The 8b version outperforms the base llama3-70b model in ConvRAG bench. https://t.co/Eyu7BtuMio
Zilliz@zilliz_universe
2 mo
Want to learn how to build a sophisticated question-answering (Q&A) chatbot using RAG (Retrieval Augmented Generation) with @Ollama, @LangChainAI, @milvusio and Llama 3? You're in luck, this step-by-step tutorial shows you the code each step of the way. https://t.co/Icc4bHy4vt
Philipp Schmid@_philschmid
2 mo
Hermes 2 Pro on Llama-3 8B just released. 🔥 @NousResearch released their first Llama-3 fine-tune, outperforming Meta's Llama 3 8B instruct. 🤯 Hermes Pro is optimized for Function Calling and Structured Output capabilities using a dedicated token for tool call. TL;DR: 🧠 based… https://t.co/bYmGO0YXZx
solbier@solbier1
2 mo
Great work from @mplappert @dpkingma on Llama3 and what it means the broder AI/LLM landscape. https://t.co/a4tnqdIjTg
factorialfunds@factorialfunds
2 mo
We did a research deep dive into @Meta recently released Llama 3 model family. We find that these models are very strong, sometimes performing on-par with frontier models like GPT-4, Gemini 1.5, and Claude 3, which is an impressive achievement. https://t.co/37h9ZR509e
Muratcan Koylan@youraimarketer
2 mo
AI21 just released the new Jamba-Instruct, instruction-tuned SSM-Transformer Model. It's competitive with Llama3 and Mixtral 8x7b but; It has a 256K context window (400 page novel) https://t.co/zbBr7XDamw
Rohan Paul@rohanpaul_ai
2 mo
Nvidia has published a competitive llama3-70b question answering and retrieval-augumented generation (RAG) fine tune - ChatQA-1.5 Benches look good, but missing that "llama 3" in the name as per the requirements of meta 😃 https://t.co/KiNHtgyCd3 https://t.co/eydmnYf5o8
Markus Zimmermann@zimmskal
2 mo
Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI's GPT 3.5 & 4, @Meta's Llama 3, @Google's Gemini 1.5 Pro, @cohere's Command R+ and 130+ other LLMs https://t.co/2ZGq8OTW4h
Gradio@Gradio
2 mo
🤩Llama-3-8B-Instruct-80K-QLoRA, an extension of Llama-3-8B-Instruct with an increased 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐥𝐞𝐧𝐠𝐭𝐡 𝐨𝐟 𝟖𝟎𝐊 using QLoRA & GPT-4 synthesized training data 📊Beats Llama3-8b-8k on majority long-context benchmarks - Needle in a Haystack, LongBench & InfiniteBench https://t.co/xVkf7ljV2q
Lucid Reality Labs 🇺🇦@LucidRealityLab
2 mo
Code generation and safer AI are the highlights of @Meta's Llama 3 AI model, which Meta recently launched. 🦙🌐 Will it become the most suitable AI tool for XR development? #MetaLlama3 #Llama3 #AI #CodeGeneration #AISafety #XRAI #MetaAI @AIatMeta https://t.co/w6yqV3wDLw
ModularMind@modularmindapp
2 mo
Claude 3 Opus, the best model in the market with a 200k context length, and Llama 3, the best open-source model with capabilities close to GPT-4 Turbo at 1/6th the cost, are now available. https://t.co/3rNlPiX3x5
Hyperstack@Hyperstackcloud
2 mo
Explore @Meta's #LLaMA3, the newest addition to the world of open-source LLMs, in our blog post. Discover LLaMA 3's unparalleled capabilities, from its state-of-the-art performance to its responsible AI approach, set to revolutionise the AI landscape. This infographic… https://t.co/MpSrJDsa4w
William Fedus@LiamFedus
2 mo
New work from Scale where they created a GSM8k-equivalent difficulty eval from scratch. The resulting performance gap surfaces some model families have data contamination issues and may not be as strong as the public eval would indicate https://t.co/hAb9fdMblZ
Alexandr Wang@alexandr_wang
2 mo
How overfit are popular LLMs on public benchmarks? New research out of @scale_ai SEAL to answer this: - produced a new eval GSM1k - evaluated public LLMs for overfitting on GSM8k VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not. https://t.co/hRhcNQWo93
Jack Rae@drjwrae
2 mo
Nice public service to evals from Scale! Creating a new grade-school math test set comparable to the commonly benchmarked gsm8k, many models drop in accuracy by a significant margin. https://t.co/l9RMTvdvAo