Hugging Face Releases FineWeb (15T) and FineWeb-Edu (1

The new FineWeb-Edu dataset from @huggingface shows once more that high-quality data leads to the best results in LLM trainings. https://t.co/xYrufV8Yae

Philipp Schmid@_philschmid

30 d

FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu

Marco Mascorro@Mascobot

1 mo

FineWeb-Edu is a high-quality 1.3T and 5.4T token dataset derived from FineWeb 15T, which was itself derived from CommonCrawl and is higher quality than RedPajama2 (which is a massive 30T dataset). Great work :) https://t.co/53JwyI2B61

Andrej Karpathy@karpathy

1 mo

Awesome and highly useful: FineWeb-Edu 📚👏 High quality LLM dataset filtering the original 15 trillion FineWeb tokens to 1.3 trillion of the highest (educational) quality, as judged by a Llama 3 70B. +A highly detailed paper. Turns out that LLMs learn a lot better and faster… https://t.co/9nXaet5tmG https://t.co/f3wqPbNkJ5

Jack Rae@drjwrae

1 mo

Some really nice data research + artifacts coming out from HuggingFace these days. Creating the definitive open source dataset has probably more value than most open-sourced models at this stage. https://t.co/apW8pj3s1j

Rohan Paul@rohanpaul_ai

1 mo

FineWeb technical report has been released. 🍷 FineWeb is a large-scale (15-trillion tokens, 44TB disk space) dataset with the permissive ODC-By 1.0 license. License explicitly include commercial use, and do not exclude any field of endeavour. They also released 📚… https://t.co/lkTPGseNoJ

Nathan Benaich@nathanbenaich

1 mo

the fine folks @huggingface have just recently published their guide to building 🍷FineWeb, a fully-open source training dataset for llms it makes for a fun and educational read thank you @thom_wolf and team https://t.co/FN2PPgoeQj

Philipp Schmid@_philschmid

1 mo

FineWeb Technical Report and FineWeb Edu released! 🍷 FineWeb is a 15T token open-source English web dataset derived from CommonCrawl! 📚 FineWeb-Edu is a 1.3T & 5.4T high-quality subset. 😍 TL;DR: 🍷 15T tokens in FineWeb outperforming other open datasets 📚 1.3T… https://t.co/rtbNEoWKqZ

Hynek Kydlíček@HKydlicek

1 mo

The🍷 FineWeb technical report is out! Read in detail: 1. How the 🍷Fineweb was processed 2. How we extracted Educationa subset of Fineweb 📚, greatly surpassing Base FW on MMLU 3. Individual CC crawls perfomance and synthetic data contamination. Link: https://t.co/mnhI9IgyS3 https://t.co/WPBZIOtcfi

Loubna Ben Allal@LoubnaBenAllal1

1 mo

🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA. Technical report: https://t.co/lfOZYYJKxq Dataset:… https://t.co/urC5qjmx3v

Leandro von Werra@lvwerra

1 mo

The FineWeb tech report is out!🍷 Find all the details involved in building a high quality pretraining dataset for LLMs including the new super strong FineWeb-Edu subset with 1.3T tokens. https://t.co/4zKtcSEKAQ [built with the beautiful @distillpub template by @ch402 et al] https://t.co/KIAHoJ5zxy

Guilherme Penedo@gui_penedo

1 mo

We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link: https://t.co/MRsc8Q5K9q https://t.co/HVfFnKbeso

Sylvain Lesage@severo_dev

1 mo

🦆 ✖ 🤗 @duckdb 0.10.3 natively supports @huggingface datasets! Why does it matter? Because it unlocks new use cases, possibly the one you need! ⬇️ https://t.co/Z58rtKBs51

DuckDB@duckdb

1 mo

New blog post: Access 150k+ Datasets from Hugging Face with DuckDB This blog post, co-authored by the @huggingface and DuckDB teams, describes how you can use the hf:// prefix in DuckDB to access datasets in Hugging Face repositories. Read more at https://t.co/nQPn7RXBMg https://t.co/RtJvRvAtVf

Similar Stories

Hugging Face Releases FineWeb (15T) and FineWeb-Edu (1.3T, 5.4T) Datasets

Similar Stories

Sources

Hugging Face Releases FineWeb (15T) and FineWeb-Edu (1.3T, 5.4T) Datasets