Recent research conducted by Scale AI's SEAL team has produced a new evaluation set, GSM1k, aimed at assessing the overfitting of popular large language models (LLMs) on the commonly used GSM8k benchmark. The study revealed that models such as Mistral and Phi exhibit signs of overfitting, whereas others like GPT, Claude, Gemini, and Llama do not. This initiative highlights potential data contamination issues in some model families, suggesting that their performance on public benchmarks may not accurately reflect their real-world capabilities.
🎉 Discover https://t.co/FZP1xF9ZNb's new features, including AI-driven art generation, enhanced file uploads for strategic analysis, and a dynamic AI command bar. ➡️ Learn more https://t.co/kpwRQJ8HwT #JedaAI #GenerativeAI #Meta #Llama3 #OpenAI #GPT4Turbo #Anthropic #Claude3
Introducing ChatQA-1.5, a family of models (Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B) that excel at conversational QA and RAG. We also open source our instruction tuning data, ChatRAG Bench for evaluation, and a multi-turn QA retriever. Link: https://t.co/2uxHQnfzZB https://t.co/gR9AOiHZLJ
Introducing ChatQA-1.5, a family of models that surpasses GPT-4-0613 and Command-R-Plus on RAG and conversational QA. ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B, https://t.co/H7JvIFCD48 Llama3-ChatQA-1.5-70B, https://t.co/Ao3Yw8ECxA We also open source our instruction…
Introducing ChatQA-1.5, a family of models that surpasses GPT-4-0613 and Command-R-Plus on RAG and conversational QA. ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B, https://t.co/H7JvIFCD48 and Llama3-ChatQA-1.5-70B, https://t.co/Ao3Yw8ECxA We also open source our…
Don't worry, we didn't forget about 70b🥳 Take a look at the first @AIatMeta Llama-3 70b model with a context length of 262K - scoring a perfect retrieval for NIAH! We included an extensive proprietary chat dataset to give the model chat ability over long sequences as well. A… https://t.co/toYrUUTRJH
You didn’t think we would forget about 70b did you? 🥳 Take a look at the first @AIatMeta Llama-3 70b model with a context length of 262K - scoring a perfect retrieval for NIAH! We included an extensive proprietary chat dataset to give the model chat ability over long sequences… https://t.co/TyQUO8KA0p
We're excited to see @ArtificialAnlys 's newly launched leaderboard on @Huggingface with @GroqInc continuing to set the bar for throughput (tokens/s). Groq's throughput for Llama 3 70B exceeds what the vast majority of providers can deliver for Llama 3 8B https://t.co/wa5gqnEBfe
Alibaba Cloud's Model Studio now supports Llama3, Meta's latest open-source LLM #70B #8B #AI #AIdevelopmentplatform #AItechnology #AlibabaCloud #artificialintelligence #deployment #generativeAIapplicationdevelopment #inferenceservices https://t.co/ukT3HStjS5 https://t.co/egFsmSfg8i
🚀 The first fine-tuned models to score higher than Llama-3-70B & achieve the best MMLU/GSM8K at the same time! - 3 out of the top 10 models on the Open LLM Leaderboard are now dominated by these fine-tuned models - Achieved the highest MMLU / GSM8K on @huggingface Leaderboard https://t.co/UeVaHPOe6t
Hermes-2-Pro-Llama-3-8b The first Llama 3 fine tune from @NousResearch is up on the OctoAI Text Gen solution👏 Give it a try at https://t.co/GvHlMIRz3H🌟 https://t.co/nm7dRwMGlg
We're bringing our benchmarking leaderboard of >100 LLM API endpoints to @huggingface! Speed and price are just as important as quality when building applications with LLMs. We bring together all the data you need to consider all three when you need to pick a model and API… https://t.co/Xk2oe8uzeN
With Meta Llama, we're democratizing access to AI models, tools and resources, helping developers shape the next wave of innovation. Read our guide to get started yourself: https://t.co/PgsClvqvx4 https://t.co/xc5vM43KAH
Some awesome new models in the 🤗 MLX community: - Hermes 2 Pro Llama 3 by @NousResearch - Llama 3 OpenBio LLM for medical domain by @aadityaura - Llama3-ChatQA-1.5-8B by @nvidia All here: https://t.co/dUgErUXnM3 h/t @vkash16, @lucataco93, @Prince_Canuma for conversions!
Meta’s Llama3 AI: ChatGPT Intelligence… For Free! https://t.co/P4BTfuNjxP
Actually pretty insane: the base model scores higher than Llama 3 70B Instruct (aka "The Beast") on MMLU. A good finetune of Qwen1.5-110B could become SOTA open-source (or at least on par with L3). https://t.co/5keCMODqBw https://t.co/2HG5aDPkqt
This AI Paper Introduces Llama-3-8B-Instruct-80K-QLoRA: New Horizons in AI Contextual Understanding Quick read: https://t.co/VYNUBypNoj Researchers from the Beijing Academy of Artificial Intelligence and the Renmin University of China have introduced…
Fully local advanced RAG with reranking: Llama 3, bge embeddings, miniLM reranker, @qdrant_client Check it out 👇 https://t.co/RA3BwX7PEu https://t.co/L7Gh1EMVch
We benchmarked @Meta Llama3 @databricks DBRX @MistralAI Mixtral-8x22b and @OpenAI GPT4 on our Product Catalog Q&A dataset! 1⃣ Llama-3-70b matches GPT4's pace! 2⃣ Llama-3-8b and Mixtral-8x22b have almost identical performance. 3⃣ DBRX is definitely the chattiest of the bunch 😉 https://t.co/wkM3jEE5bL
How overfit are popular LLMs on public benchmarks? New research from @scale_AI tries to figure this out with a new evaluation benchmark - GSM1K https://t.co/YqN4rVEPU9
Nvidia released a new model, Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). The 8b version outperforms the base llama3-70b model in ConvRAG bench. https://t.co/Eyu7BtuMio
Want to learn how to build a sophisticated question-answering (Q&A) chatbot using RAG (Retrieval Augmented Generation) with @Ollama, @LangChainAI, @milvusio and Llama 3? You're in luck, this step-by-step tutorial shows you the code each step of the way. https://t.co/Icc4bHy4vt
Hermes 2 Pro on Llama-3 8B just released. 🔥 @NousResearch released their first Llama-3 fine-tune, outperforming Meta's Llama 3 8B instruct. 🤯 Hermes Pro is optimized for Function Calling and Structured Output capabilities using a dedicated token for tool call. TL;DR: 🧠 based… https://t.co/bYmGO0YXZx
Great work from @mplappert @dpkingma on Llama3 and what it means the broder AI/LLM landscape. https://t.co/a4tnqdIjTg
We did a research deep dive into @Meta recently released Llama 3 model family. We find that these models are very strong, sometimes performing on-par with frontier models like GPT-4, Gemini 1.5, and Claude 3, which is an impressive achievement. https://t.co/37h9ZR509e
AI21 just released the new Jamba-Instruct, instruction-tuned SSM-Transformer Model. It's competitive with Llama3 and Mixtral 8x7b but; It has a 256K context window (400 page novel) https://t.co/zbBr7XDamw
Nvidia has published a competitive llama3-70b question answering and retrieval-augumented generation (RAG) fine tune - ChatQA-1.5 Benches look good, but missing that "llama 3" in the name as per the requirements of meta 😃 https://t.co/KiNHtgyCd3 https://t.co/eydmnYf5o8
Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI's GPT 3.5 & 4, @Meta's Llama 3, @Google's Gemini 1.5 Pro, @cohere's Command R+ and 130+ other LLMs https://t.co/2ZGq8OTW4h
🤩Llama-3-8B-Instruct-80K-QLoRA, an extension of Llama-3-8B-Instruct with an increased 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐥𝐞𝐧𝐠𝐭𝐡 𝐨𝐟 𝟖𝟎𝐊 using QLoRA & GPT-4 synthesized training data 📊Beats Llama3-8b-8k on majority long-context benchmarks - Needle in a Haystack, LongBench & InfiniteBench https://t.co/xVkf7ljV2q
Code generation and safer AI are the highlights of @Meta's Llama 3 AI model, which Meta recently launched. 🦙🌐 Will it become the most suitable AI tool for XR development? #MetaLlama3 #Llama3 #AI #CodeGeneration #AISafety #XRAI #MetaAI @AIatMeta https://t.co/w6yqV3wDLw
Claude 3 Opus, the best model in the market with a 200k context length, and Llama 3, the best open-source model with capabilities close to GPT-4 Turbo at 1/6th the cost, are now available. https://t.co/3rNlPiX3x5
Explore @Meta's #LLaMA3, the newest addition to the world of open-source LLMs, in our blog post. Discover LLaMA 3's unparalleled capabilities, from its state-of-the-art performance to its responsible AI approach, set to revolutionise the AI landscape. This infographic… https://t.co/MpSrJDsa4w
New work from Scale where they created a GSM8k-equivalent difficulty eval from scratch. The resulting performance gap surfaces some model families have data contamination issues and may not be as strong as the public eval would indicate https://t.co/hAb9fdMblZ
How overfit are popular LLMs on public benchmarks? New research out of @scale_ai SEAL to answer this: - produced a new eval GSM1k - evaluated public LLMs for overfitting on GSM8k VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not. https://t.co/hRhcNQWo93
Nice public service to evals from Scale! Creating a new grade-school math test set comparable to the commonly benchmarked gsm8k, many models drop in accuracy by a significant margin. https://t.co/l9RMTvdvAo