DeepSeekMoE, a new Mixture-of-Experts (MoE) language model with 16.4B parameters, has been announced by DeepSeek. It achieves comparable performance with DeepSeek 7B and LLaMA2 7B, with only about 40% of the computation. The model employs innovative MoE architecture involving fine-grained expert segmentation and shared experts isolation. Trained on a 2T token dataset, it is 40% more efficient than its counterparts and can be deployed on a single GPU without quantization. The model introduces ultimate expert specialization and achieves similar performance to Llama-2-7b with 2.5x less activated parameters during inference.
The DeepSeek paper has made a significant breakthrough in Mixture of Experts (MoEs) models. 1/n https://t.co/EKFlztclOb
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models abs: https://t.co/5z3rZZbxlp model: https://t.co/9xBnqUte4o Introduces DeepSeekMoE-16B, which achieves similar perf to Llama-2-7b but with 2.5x less activated parameters during inference.… https://t.co/JLGiTMGV2d
I evaluated DeepSeekMoE (16B MoE) chat version. It's a little better than Phi-2 (base model) but not as good as fine-tuned versions of it (similar number of activated parameters). Overall, a very strong model, especially considering that it was probably a lot cheaper to train.… https://t.co/pR6FQuQtc8 https://t.co/xTxIbECJk7
Welcome DeepSeek 16B MoE ✨ > 16.4B parameters > Trained on 2T tokens > 4096 sequence length > Comparable performance llama/ deepseek 7B with ~40% computations > Employs fine-grained expert segmentation and shared expert isolation strategies for training Requires a ~40GB GPU… https://t.co/j7IEqHJMgo
Notes from DeepSeekMoE paper New MoE architecture with segmentation and isolation ideas; highly parameter efficient and a 145B MoE ongoing! Here are my notes: Fine-grained expert segmentation: - Same computational cost, more fine-grained expertise - decomposes knowledge more… https://t.co/HdPu4KckQV
DeepSeek 7B VS DeepSeekMoE 16B DeepSeek 7B is a dense model trained on the same corpus as DeepSeekMoE 16B. With only 40.5% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B. https://t.co/BjAaL6SJb8 https://t.co/5qT5oQhLHl
DeepSeek just announced DeepSeek-MoE DeepSeekMoE 16B is a Mixture-of-Experts (MoE). It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation. Trained from scratch on 2T tokens, and… https://t.co/xcLBiHG6k3
DeepSeekMoE 16B : a new MoE with two innovative strategies, just released by @deepseek_ai 🔥 📊 16.4B parameters 🏋️ Trained on a 2T token dataset ♻️ 40% more efficient than DeekSeek 7B and LLaMA2 7B 💻 Deployed on a single GPU without quantization https://t.co/9Tik4jIkEV
DeepSeek just announced DeepSeek-MoE model chat: https://t.co/Wu2yeKHhlB base: https://t.co/AG6tYWkLYO DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters. It employs an innovative MoE architecture, which involves two principal strategies:… https://t.co/5Z76s20UCR
🌟 Meet #DeepSeekMoE: The Next Gen of Large Language Models! Performance Highlights: 📈 DeepSeekMoE 2B matches its 2B dense counterpart with 17.5% computation. 🚀 DeepSeekMoE 16B rivals LLaMA2 7B with 40% computation. 🛠 DeepSeekMoE 145B significantly outperforms Gshard,… https://t.co/rNobwc2r54
DeepSeek MoE just came out, with 16.4B parameters, trained on 2T tokens. Comparable performance with DeekSeek 7B and LLaMA2 7B, with only about 40% of the computation. https://t.co/9kiaUb3Nej https://t.co/KFluWrTAMC