Hugging Face has released the FineWeb technical report, detailing the creation of FineWeb, a large-scale, open-source English web dataset derived from CommonCrawl. FineWeb consists of 15 trillion tokens and occupies 44TB of disk space. The report also introduces FineWeb-Edu, a high-quality subset of FineWeb, which includes 1.3 trillion and 5.4 trillion tokens. FineWeb-Edu is specifically filtered for high educational content and has shown remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA. The dataset outperforms all other open web datasets and is released under the permissive ODC-By 1.0 license, allowing for commercial use without excluding any field of endeavor.
The new FineWeb-Edu dataset from @huggingface shows once more that high-quality data leads to the best results in LLM trainings. https://t.co/xYrufV8Yae
FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu
FineWeb-Edu is a high-quality 1.3T and 5.4T token dataset derived from FineWeb 15T, which was itself derived from CommonCrawl and is higher quality than RedPajama2 (which is a massive 30T dataset). Great work :) https://t.co/53JwyI2B61
Awesome and highly useful: FineWeb-Edu 📚👏 High quality LLM dataset filtering the original 15 trillion FineWeb tokens to 1.3 trillion of the highest (educational) quality, as judged by a Llama 3 70B. +A highly detailed paper. Turns out that LLMs learn a lot better and faster… https://t.co/9nXaet5tmG https://t.co/f3wqPbNkJ5
Some really nice data research + artifacts coming out from HuggingFace these days. Creating the definitive open source dataset has probably more value than most open-sourced models at this stage. https://t.co/apW8pj3s1j
FineWeb technical report has been released. 🍷 FineWeb is a large-scale (15-trillion tokens, 44TB disk space) dataset with the permissive ODC-By 1.0 license. License explicitly include commercial use, and do not exclude any field of endeavour. They also released 📚… https://t.co/lkTPGseNoJ
the fine folks @huggingface have just recently published their guide to building 🍷FineWeb, a fully-open source training dataset for llms it makes for a fun and educational read thank you @thom_wolf and team https://t.co/FN2PPgoeQj
FineWeb Technical Report and FineWeb Edu released! 🍷 FineWeb is a 15T token open-source English web dataset derived from CommonCrawl! 📚 FineWeb-Edu is a 1.3T & 5.4T high-quality subset. 😍 TL;DR: 🍷 15T tokens in FineWeb outperforming other open datasets 📚 1.3T… https://t.co/rtbNEoWKqZ
The🍷 FineWeb technical report is out! Read in detail: 1. How the 🍷Fineweb was processed 2. How we extracted Educationa subset of Fineweb 📚, greatly surpassing Base FW on MMLU 3. Individual CC crawls perfomance and synthetic data contamination. Link: https://t.co/mnhI9IgyS3 https://t.co/WPBZIOtcfi
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA. Technical report: https://t.co/lfOZYYJKxq Dataset:… https://t.co/urC5qjmx3v
The FineWeb tech report is out!🍷 Find all the details involved in building a high quality pretraining dataset for LLMs including the new super strong FineWeb-Edu subset with 1.3T tokens. https://t.co/4zKtcSEKAQ [built with the beautiful @distillpub template by @ch402 et al] https://t.co/KIAHoJ5zxy
We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link: https://t.co/MRsc8Q5K9q https://t.co/HVfFnKbeso
🦆 ✖ 🤗 @duckdb 0.10.3 natively supports @huggingface datasets! Why does it matter? Because it unlocks new use cases, possibly the one you need! ⬇️ https://t.co/Z58rtKBs51
New blog post: Access 150k+ Datasets from Hugging Face with DuckDB This blog post, co-authored by the @huggingface and DuckDB teams, describes how you can use the hf:// prefix in DuckDB to access datasets in Hugging Face repositories. Read more at https://t.co/nQPn7RXBMg https://t.co/RtJvRvAtVf