Recent research by NVIDIA has provided a detailed comparison of various language models, including Transformer, Mamba, and hybrid models. These models, all of the same size (8 billion parameters), were trained on datasets containing up to 3.5 trillion tokens. The study, which includes work by R Waleffe and others, found that while Mamba models can match or surpass Transformer models on zero-shot evaluations, they lag behind in tasks such as MMLU and copying. However, hybrid models outperform Transformer models across the board. Notably, the 8B-parameter Mamba-2-Hybrid model exceeds the Transformer on all 12 standard tasks evaluated, achieving an average improvement of 2.65 points. Additionally, the hybrid model is predicted to be up to 8 times faster in token generation during inference. Another model, Samba, combines Mamba with Sliding Window Attention for efficient sequence processing and shows superior performance.
8B-parameter Mamba-2-Hybrid exceeds the 8B-parameter Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8× faster when generating tokens at inference time. 🤯 📌 The hybrid model also demonstrates strong long-context… https://t.co/DzDPihcglx
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling ◼ 🚀 Introducing Samba: a groundbreaking hybrid model combining Mamba (a State Space Model) with Sliding Window Attention for efficient sequence processing. 🧠 Achieves superior performance… https://t.co/C98uOfFW7a
[LG] An Empirical Study of Mamba-based Language Models R Waleffe, W Byeon, D Riach, B Norick, V Korthikanti, T Dao, A Gu, A Hatamizadeh, S Singh… [NVIDIA] (2024) https://t.co/KaQR6E6aMb - This paper presents a large-scale comparison between Mamba, Mamba-2, Mamba-2-Hybrid, and… https://t.co/PwN9JiddvN
An Empirical Study of Mamba-based Language Models abs: https://t.co/CtMXszmo0E code: https://t.co/kBjXNrejGM Presents a comparison of 8B param Mamba, Mamba-2, and Transformer models trained on up to 3.5T tokens and evaluated on 12 tasks, including short-form QA, long-form QA,… https://t.co/ZJu65uliOP
The most direct and detailed comparison of Transformer, Mamba, and hybrid models so far: same size (8B), datasets (1.1T & 3.5T), hparams. Mamba can match / beat Transformer on zero-shot evals, but lag behind on MMLU / copying. Hybrid outperforms Transformer. Some fun results: https://t.co/ipKWljCaJU