Hugging Face has released a groundbreaking technical report on FineWeb, an open-source dataset for large language model (LLM) training. The report details the creation and curation process, starting with raw data collection from 96 CommonCrawl snapshots. Key insights include the importance of balancing deduplication to avoid removing high-quality signals and the effectiveness of BERT classifiers trained on synthetic data. The team, including Thom Wolf and Gui Pendino, applied various filters to refine the dataset, making it a valuable resource for researchers and developers in the AI and machine learning community. The project is hosted on HuggingFaceFW.
FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu
This is one of the most incredible papers on large scale data processing & LLM training Huge thanks + kudos to @GuiPendino & the @huggingface team for sharing so many valuable insights A must read if you’re in the space. https://t.co/fBf2wpAwyc
Revolutionary work from Huggingface. Nothing less. --- FineWeb 🍷 | TL;DR : 0. Pretraining is far less intuitive than instruct finetune 1. Unclear what data to include to boost performance 2. HF's team simply tried different filters [3] -> trained small models on the filtered… https://t.co/OOHQyMhDN1 https://t.co/bzLtbi3UIY
the fine folks @huggingface have just recently published their guide to building 🍷FineWeb, a fully-open source training dataset for llms it makes for a fun and educational read thank you @thom_wolf and team https://t.co/FN2PPgoeQj
This article provides a great window into how training data for LLMs are created and curated. FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW https://t.co/Zh1QzYsvbB
Amazing release for everyone interested in LLM and pretraining data. Two major takeaway: 1. Too much deduplication is unhelpful as you remove signals from higher quality texts (more like to be reprinted) 2. Bert classifier trained on synthetic data rule! https://t.co/aXVjQ1hkni