DeepSeek introduces 16.4B parameter MoE language model

Shahul Es@Shahules786

6 mo

The DeepSeek paper has made a significant breakthrough in Mixture of Experts (MoEs) models. 1/n https://t.co/EKFlztclOb

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

6 mo

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models abs: https://t.co/5z3rZZbxlp model: https://t.co/9xBnqUte4o Introduces DeepSeekMoE-16B, which achieves similar perf to Llama-2-7b but with 2.5x less activated parameters during inference.… https://t.co/JLGiTMGV2d

Maxime Labonne@maximelabonne

6 mo

I evaluated DeepSeekMoE (16B MoE) chat version. It's a little better than Phi-2 (base model) but not as good as fine-tuned versions of it (similar number of activated parameters). Overall, a very strong model, especially considering that it was probably a lot cheaper to train.… https://t.co/pR6FQuQtc8 https://t.co/xTxIbECJk7

Vaibhav (VB) Srivastav@reach_vb

6 mo

Welcome DeepSeek 16B MoE ✨ > 16.4B parameters > Trained on 2T tokens > 4096 sequence length > Comparable performance llama/ deepseek 7B with ~40% computations > Employs fine-grained expert segmentation and shared expert isolation strategies for training Requires a ~40GB GPU… https://t.co/j7IEqHJMgo

Omar Sanseviero@osanseviero

6 mo

Notes from DeepSeekMoE paper New MoE architecture with segmentation and isolation ideas; highly parameter efficient and a 145B MoE ongoing! Here are my notes: Fine-grained expert segmentation: - Same computational cost, more fine-grained expertise - decomposes knowledge more… https://t.co/HdPu4KckQV

Rohan Paul@rohanpaul_ai

6 mo

DeepSeek 7B VS DeepSeekMoE 16B DeepSeek 7B is a dense model trained on the same corpus as DeepSeekMoE 16B. With only 40.5% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B. https://t.co/BjAaL6SJb8 https://t.co/5qT5oQhLHl

Rohan Paul@rohanpaul_ai

6 mo

DeepSeek just announced DeepSeek-MoE DeepSeekMoE 16B is a Mixture-of-Experts (MoE). It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation. Trained from scratch on 2T tokens, and… https://t.co/xcLBiHG6k3

Adina Yakup@AdeenaY8

6 mo

DeepSeekMoE 16B : a new MoE with two innovative strategies, just released by @deepseek_ai 🔥 📊 16.4B parameters 🏋️ Trained on a 2T token dataset ♻️ 40% more efficient than DeekSeek 7B and LLaMA2 7B 💻 Deployed on a single GPU without quantization https://t.co/9Tik4jIkEV

AK@_akhaliq

6 mo

DeepSeek just announced DeepSeek-MoE model chat: https://t.co/Wu2yeKHhlB base: https://t.co/AG6tYWkLYO DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters. It employs an innovative MoE architecture, which involves two principal strategies:… https://t.co/5Z76s20UCR

DeepSeek@deepseek_ai

6 mo

🌟 Meet #DeepSeekMoE: The Next Gen of Large Language Models! Performance Highlights: 📈 DeepSeekMoE 2B matches its 2B dense counterpart with 17.5% computation. 🚀 DeepSeekMoE 16B rivals LLaMA2 7B with 40% computation. 🛠 DeepSeekMoE 145B significantly outperforms Gshard,… https://t.co/rNobwc2r54

Marco Mascorro@Mascobot

6 mo

DeepSeek MoE just came out, with 16.4B parameters, trained on 2T tokens. Comparable performance with DeekSeek 7B and LLaMA2 7B, with only about 40% of the computation. https://t.co/9kiaUb3Nej https://t.co/KFluWrTAMC

Similar Stories

DeepSeek introduces 16.4B parameter MoE language model, outperforming counterparts with 40% less computation and 2.5x less activation parameters

Similar Stories

Sources

DeepSeek introduces 16.4B parameter MoE language model, outperforming counterparts with 40% less computation and 2.5x less activation parameters