Hugging Face, a company in AI, has launched an open-source AI assistant maker to compete with OpenAI's custom GPTs. The release includes FineWeb, a large-scale dataset for LLM pretraining. The dataset is designed to improve the training of large language models.
HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining Hugging Face has introduced 🍷 FineWeb, a comprehensive dataset designed to enhance the training of large language models (LLMs). Published on May 31, 2024, this…
Is Fineweb-edu the best open text dataset ever released? A big step in empowering all companies to train their own GPT5! https://t.co/fSEngn3Eou https://t.co/Z8YJRQzB7N
🍷Preparing Fineweb - A Finely Cleaned Common Crawl Dataset🍷 Credit to @RealGDT, @HKydlicek, @LoubnaBenAllal1, @anton_lozhkov, @colinraffel, @lvwerra, @Thom_Wolf of @huggingface for the fine dataset and blog. TIMESTAMPS: 0:00 Common Crawl Data Processing Pipeline 0:42 Video… https://t.co/7PRsNm4G6B
Hugging Face just hit a whole new level of "democratization of ML" 👀 https://t.co/uHH2gsDXWm
Excellent work by the amazing @huggingface team on providing the highest quality truly open dataset for pre-training!! https://t.co/ESH9scLyzH
FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu
This is one of the most incredible papers on large scale data processing & LLM training Huge thanks + kudos to @GuiPendino & the @huggingface team for sharing so many valuable insights A must read if you’re in the space. https://t.co/fBf2wpAwyc
"TLDR: This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset." https://t.co/aNfCqox7sz
This article provides a great window into how training data for LLMs are created and curated. FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW https://t.co/Zh1QzYsvbB
Hugging Face launches open source #AI assistant maker to rival OpenAI’s custom GPTs by @carlfranzen @VentureBeat Learn more: https://t.co/lQiPEvt36U #Chatbots #ML #ArtificialIntelligence #MI cc: @theadamgabriel @miketamir @karpathy https://t.co/90xeH9e1UV