Hugging Face Launches Open-Source AI Assistant Maker t

Marktechpost AI Research News ⚡@Marktechpost

HuggingFace Releases 🍷 FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining Hugging Face has introduced 🍷 FineWeb, a comprehensive dataset designed to enhance the training of large language models (LLMs). Published on May 31, 2024, this…

clem 🤗@ClementDelangue

29 d

Is Fineweb-edu the best open text dataset ever released? A big step in empowering all companies to train their own GPT5! https://t.co/fSEngn3Eou https://t.co/Z8YJRQzB7N

Trelis Research@TrelisResearch

29 d

🍷Preparing Fineweb - A Finely Cleaned Common Crawl Dataset🍷 Credit to @RealGDT, @HKydlicek, @LoubnaBenAllal1, @anton_lozhkov, @colinraffel, @lvwerra, @Thom_Wolf of @huggingface for the fine dataset and blog. TIMESTAMPS: 0:00 Common Crawl Data Processing Pipeline 0:42 Video… https://t.co/7PRsNm4G6B

Zach Mueller@TheZachMueller

29 d

Hugging Face just hit a whole new level of "democratization of ML" 👀 https://t.co/uHH2gsDXWm

Prime Intellect@PrimeIntellect

29 d

Excellent work by the amazing @huggingface team on providing the highest quality truly open dataset for pre-training!! https://t.co/ESH9scLyzH

Philipp Schmid@_philschmid

30 d

FineWeb Technical Report was released! Here is how the @huggingface team created FineWeb the best open-source dataset: 1. Collect Raw Data: Use CommonCrawl as the starting point, 96 CommonCrawl snapshots were used for FineWeb. 2. Url Filtering: Apply URL filtering using a… https://t.co/5NqP07ZPnu

TheHeroShep@TheHeroShep

1 mo

This is one of the most incredible papers on large scale data processing & LLM training Huge thanks + kudos to @GuiPendino & the @huggingface team for sharing so many valuable insights A must read if you’re in the space. https://t.co/fBf2wpAwyc

Julien Dorra@juliendorra

1 mo

"TLDR: This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset." https://t.co/aNfCqox7sz

🤖🇨🇭AI & Machine Learning @ HSLU@hslu_aiml

1 mo

This article provides a great window into how training data for LLMs are created and curated. FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW https://t.co/Zh1QzYsvbB

Ronald van Loon@Ronald_vanLoon

1 mo

Hugging Face launches open source #AI assistant maker to rival OpenAI’s custom GPTs by @carlfranzen @VentureBeat Learn more: https://t.co/lQiPEvt36U #Chatbots #ML #ArtificialIntelligence #MI cc: @theadamgabriel @miketamir @karpathy https://t.co/90xeH9e1UV

Similar Stories

Hugging Face Launches Open-Source AI Assistant Maker to Compete with OpenAI's Custom GPTs, Introduces FineWeb Dataset for LLM Pretraining

Similar Stories

Sources

Hugging Face Launches Open-Source AI Assistant Maker to Compete with OpenAI's Custom GPTs, Introduces FineWeb Dataset for LLM Pretraining