MAP-Neo: New 7B Parameter Open-Source Bilingual LLM Re

As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since 2021: 🔎 https://t.co/LaEnDn6Q8O https://t.co/tlQENVIEpW

Matt Valoatto@mvaloatto

29 d

As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since 2021 🔎 https://t.co/LaEnDn6Q8O https://t.co/DQemRj2GjV

Matt Valoatto@mvaloatto

29 d

As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since October 2021 🔎… https://t.co/iVubT8AngP

Matt Valoatto@mvaloatto

29 d

As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since October 2021 🔎 All Spaces Of The… https://t.co/hLDYVUCnaV

Matt Valoatto@mvaloatto

29 d

As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here is the FULL history of the 950+ featured Spaces since October 2021 🔎… https://t.co/A6NAKuMlAU

Yuchen Jin@Yuchenj_UW

29 d

Compare FineWeb-Edu vs. FineWeb using @karpathy's llm.c 👩‍🏫📚 Another *fun* day has passed, it was fun because @huggingface team released a detailed tech report and the FineWeb-Edu dataset: a subset of the FineWeb dataset with high educational quality as classified by Llama3-70B.… https://t.co/4ZLynEslJ6 https://t.co/iVgCb9jegp

Aleksa Gordić 🍿🤖@gordic_aleksa

29 d

Extremely cool release from @huggingface! FineWeb - a high quality LLM pretraining dataset consisting of 15 trillion tokens! (counted using GPT-2 tokenizer) They show that LLMs trained on this dataset achieve higher aggregated score across a set of evals (HellaSwag, PIQA, MMLU,… https://t.co/816n3ZOSJX

Vlad Ruso PhD@vlruso

29 d

HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining https://t.co/6G9y2JINyK #HuggingFace #FineWeb #LLMPretraining #AI #PracticalSolutions #ai #news #llm #ml #research #ainews #innovation #artificialintelligence … https://t.co/gTtK7CLRmM

Marktechpost AI Research News ⚡@Marktechpost

29 d

HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining Hugging Face has introduced 🍷 FineWeb, a comprehensive dataset designed to enhance the training of large language models (LLMs). Published on May 31, 2024, this…

Florent Daudens@fdaudens

30 d

Amazing work by the Fantastic FineWeb team on: 1. explaining in detail how to create a large and high-quality web-scale dataset for LLM pertaining such as FineWeb 2. introducing the FineWeb-Edu subset, which outperforms all openly accessible web datasets on a number of…

Ksenia Se@Kseniase_

30 d

This article is a must-read for tech enthusiasts! It unveils the secrets of creating high-quality web-scale datasets, diving into the 15 trillion tokens FineWeb release. It also introduces FineWeb-Edu, a 1.3 trillion token subset with top educational content! https://t.co/zVaBU3paaG

Dr. Daniel Bender@AidfulAI

30 d

The new FineWeb-Edu dataset from @huggingface shows once more that high-quality data leads to the best results in LLM trainings. https://t.co/xYrufV8Yae

Philipp Schmid@_philschmid

30 d

FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu

Marco Mascorro@Mascobot

30 d

FineWeb-Edu is a high-quality 1.3T and 5.4T token dataset derived from FineWeb 15T, which was itself derived from CommonCrawl and is higher quality than RedPajama2 (which is a massive 30T dataset). Great work :) https://t.co/53JwyI2B61

Andrej Karpathy@karpathy

30 d

Awesome and highly useful: FineWeb-Edu 📚👏 High quality LLM dataset filtering the original 15 trillion FineWeb tokens to 1.3 trillion of the highest (educational) quality, as judged by a Llama 3 70B. +A highly detailed paper. Turns out that LLMs learn a lot better and faster… https://t.co/9nXaet5tmG https://t.co/f3wqPbNkJ5

Similar Stories

MAP-Neo: New 7B Parameter Open-Source Bilingual LLM Released on May 31, 2024

Similar Stories

Sources

MAP-Neo: New 7B Parameter Open-Source Bilingual LLM Released on May 31, 2024