Hugging Face Releases FineWeb, a New Open-Source Datas

FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu

TheHeroShep@TheHeroShep

1 mo

This is one of the most incredible papers on large scale data processing & LLM training Huge thanks + kudos to @GuiPendino & the @huggingface team for sharing so many valuable insights A must read if you’re in the space. https://t.co/fBf2wpAwyc

Yam Peleg@Yampeleg

1 mo

Revolutionary work from Huggingface. Nothing less. --- FineWeb 🍷 | TL;DR : 0. Pretraining is far less intuitive than instruct finetune 1. Unclear what data to include to boost performance 2. HF's team simply tried different filters [3] -> trained small models on the filtered… https://t.co/OOHQyMhDN1 https://t.co/bzLtbi3UIY

Nathan Benaich@nathanbenaich

1 mo

the fine folks @huggingface have just recently published their guide to building 🍷FineWeb, a fully-open source training dataset for llms it makes for a fun and educational read thank you @thom_wolf and team https://t.co/FN2PPgoeQj

🤖🇨🇭AI & Machine Learning @ HSLU@hslu_aiml

1 mo

This article provides a great window into how training data for LLMs are created and curated. FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW https://t.co/Zh1QzYsvbB

Alexander Doria@Dorialexander

1 mo

Amazing release for everyone interested in LLM and pretraining data. Two major takeaway: 1. Too much deduplication is unhelpful as you remove signals from higher quality texts (more like to be reprinted) 2. Bert classifier trained on synthetic data rule! https://t.co/aXVjQ1hkni

Similar Stories

Hugging Face Releases FineWeb, a New Open-Source Dataset for LLM Training, Led by Thom Wolf and Gui Pendino

Similar Stories

Sources

Hugging Face Releases FineWeb, a New Open-Source Dataset for LLM Training, Led by Thom Wolf and Gui Pendino