MAP-Neo, a fully open-source and transparent bilingual large language model (LLM) series, has been released by researchers from M-A-P, University of Waterloo, and Wuhan AI Research. The model, which includes up to 7 billion parameters trained on 4.5 trillion tokens, is designed to close the gap with closed-source models. The release includes a detailed 49-page paper covering various aspects such as the tokenizer, data preprocessing, model architecture, training, and fine-tuning. MAP-Neo is noted for its superior performance, surpassing LLaMA2 while slightly trailing Mistral. The model and its associated resources, including the training script, data, and checkpoints, are available on the Hugging Face hub. The release was published on May 31, 2024.
As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since 2021: 🔎 https://t.co/LaEnDn6Q8O https://t.co/tlQENVIEpW
As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since 2021 🔎 https://t.co/LaEnDn6Q8O https://t.co/DQemRj2GjV
As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since October 2021 🔎… https://t.co/iVubT8AngP
As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here are the 950+ featured Spaces since October 2021 🔎 All Spaces Of The… https://t.co/hLDYVUCnaV
As of June 1st, @HuggingFace now boasts over 200,000+ public AI demos, known as "Spaces"! 🤗🎉: https://t.co/WZy0Hl2tJu 'Where should I start?' Each week, 8 Spaces are featured as the Spaces of the Week. Here is the FULL history of the 950+ featured Spaces since October 2021 🔎… https://t.co/A6NAKuMlAU
Compare FineWeb-Edu vs. FineWeb using @karpathy's llm.c 👩🏫📚 Another *fun* day has passed, it was fun because @huggingface team released a detailed tech report and the FineWeb-Edu dataset: a subset of the FineWeb dataset with high educational quality as classified by Llama3-70B.… https://t.co/4ZLynEslJ6 https://t.co/iVgCb9jegp
Extremely cool release from @huggingface! FineWeb - a high quality LLM pretraining dataset consisting of 15 trillion tokens! (counted using GPT-2 tokenizer) They show that LLMs trained on this dataset achieve higher aggregated score across a set of evals (HellaSwag, PIQA, MMLU,… https://t.co/816n3ZOSJX
HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining https://t.co/6G9y2JINyK #HuggingFace #FineWeb #LLMPretraining #AI #PracticalSolutions #ai #news #llm #ml #research #ainews #innovation #artificialintelligence … https://t.co/gTtK7CLRmM
HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining Hugging Face has introduced 🍷 FineWeb, a comprehensive dataset designed to enhance the training of large language models (LLMs). Published on May 31, 2024, this…
Amazing work by the Fantastic FineWeb team on: 1. explaining in detail how to create a large and high-quality web-scale dataset for LLM pertaining such as FineWeb 2. introducing the FineWeb-Edu subset, which outperforms all openly accessible web datasets on a number of…
This article is a must-read for tech enthusiasts! It unveils the secrets of creating high-quality web-scale datasets, diving into the 15 trillion tokens FineWeb release. It also introduces FineWeb-Edu, a 1.3 trillion token subset with top educational content! https://t.co/zVaBU3paaG
The new FineWeb-Edu dataset from @huggingface shows once more that high-quality data leads to the best results in LLM trainings. https://t.co/xYrufV8Yae
FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu
FineWeb-Edu is a high-quality 1.3T and 5.4T token dataset derived from FineWeb 15T, which was itself derived from CommonCrawl and is higher quality than RedPajama2 (which is a massive 30T dataset). Great work :) https://t.co/53JwyI2B61
Awesome and highly useful: FineWeb-Edu 📚👏 High quality LLM dataset filtering the original 15 trillion FineWeb tokens to 1.3 trillion of the highest (educational) quality, as judged by a Llama 3 70B. +A highly detailed paper. Turns out that LLMs learn a lot better and faster… https://t.co/9nXaet5tmG https://t.co/f3wqPbNkJ5
FineWeb technical report has been released. 🍷 FineWeb is a large-scale (15-trillion tokens, 44TB disk space) dataset with the permissive ODC-By 1.0 license. License explicitly include commercial use, and do not exclude any field of endeavour. They also released 📚… https://t.co/lkTPGseNoJ
the fine folks @huggingface have just recently published their guide to building 🍷FineWeb, a fully-open source training dataset for llms it makes for a fun and educational read thank you @thom_wolf and team https://t.co/FN2PPgoeQj
FineWeb Technical Report and FineWeb Edu released! 🍷 FineWeb is a 15T token open-source English web dataset derived from CommonCrawl! 📚 FineWeb-Edu is a 1.3T & 5.4T high-quality subset. 😍 TL;DR: 🍷 15T tokens in FineWeb outperforming other open datasets 📚 1.3T… https://t.co/rtbNEoWKqZ
The🍷 FineWeb technical report is out! Read in detail: 1. How the 🍷Fineweb was processed 2. How we extracted Educationa subset of Fineweb 📚, greatly surpassing Base FW on MMLU 3. Individual CC crawls perfomance and synthetic data contamination. Link: https://t.co/mnhI9IgyS3 https://t.co/WPBZIOtcfi
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA. Technical report: https://t.co/lfOZYYJKxq Dataset:… https://t.co/urC5qjmx3v
The FineWeb tech report is out!🍷 Find all the details involved in building a high quality pretraining dataset for LLMs including the new super strong FineWeb-Edu subset with 1.3T tokens. https://t.co/4zKtcSEKAQ [built with the beautiful @distillpub template by @ch402 et al] https://t.co/KIAHoJ5zxy
We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link: https://t.co/MRsc8Q5K9q https://t.co/HVfFnKbeso
MAP-NEO: A fully open-sourced Large Language Model Github: https://t.co/TEbE7KOa6p https://t.co/8aICiGLdq2
MAP-Neo is definitely the most detailed open-source model 🌟. It comes with a detailed paper on weights, all checkpoints, Tokenizer, Data, Data preprocessing, Model Architecture, Training, and Fine-Tuning 📚. 49 pages of extremely detailed technical thought process explanation… https://t.co/gPZwbDoNkG
MAP-Neo: A Fully Open-Source and Transparent Bilingual LLM Suite that Achieves Superior Performance to Close the Gap with Closed-Source Models https://t.co/JHFcZArBSr #LanguageModels #OpenSource #MAPNeo #AI #Transparency #ai #news #llm #ml #research #ainews #innovation #artifi… https://t.co/PM24MMGRyF
MAP-Neo: A Fully Open-Source and Transparent Bilingual LLM Suite that Achieves Superior Performance to Close the Gap with Closed-Source Models Researchers from M-A-P, University of Waterloo, Wuhan AI Research, and https://t.co/F68tZXEYu2 have released MAP-Neo, a highly capable… https://t.co/cGDnmAu0et
Check out the trending @huggingface collections for staying updated on AI developments. This week features: - Models with a non-English focus: @CohereForAI + @DiscoResearchAI - Research-focused collections: @failspy + @ChujieZheng - Sentence Transformers datasets + models https://t.co/LARqzpuj3H
MAP-Neo releases a detailed paper behind its open-source LLM! The paper includes detailed information about Tokenizer, Data preprocessing (Filtering, Deduplication, Quality), Model Architecture, Training, and Fine-Tuning. If you are interested in training LLMs, give it a read! 👀… https://t.co/t0OQRnLWG8
FULLY OPEN SOURCE LLM: MAP-NEO 7B👉Another amazing work by MAP on the @huggingface hub🔥 Model: https://t.co/XqgvKQBO0r Github:https://t.co/CH6N5TE8BP Paper: https://t.co/spwafg46G1
Our MAP-Neo paper is finally out. It's a fully open-sourced model with amazing performance, better than LLaMA2, slightly worse than Mistral. We release the full data preprocessing pipeline, pre-training dataset, training framework and all the checkpoints. The paper is very… https://t.co/vdOr0wPwvQ
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series - Truly opensourced (e.g. training script, data, checkpoints available) - Up to 7B params trained on 4.5T tokens proj: https://t.co/RdWgOvhJzq abs: https://t.co/qF7xFZCZip https://t.co/znbFznIu9L