NVIDIA's 8B-Parameter Hybrid Models Outperform Transfo

8B-parameter Mamba-2-Hybrid exceeds the 8B-parameter Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8× faster when generating tokens at inference time. 🤯 📌 The hybrid model also demonstrates strong long-context… https://t.co/DzDPihcglx

nat://TheAIObserverX@TheAIObserverX

15 d

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling ◼ 🚀 Introducing Samba: a groundbreaking hybrid model combining Mamba (a State Space Model) with Sliding Window Attention for efficient sequence processing. 🧠 Achieves superior performance… https://t.co/C98uOfFW7a

fly51fly@fly51fly

15 d

[LG] An Empirical Study of Mamba-based Language Models R Waleffe, W Byeon, D Riach, B Norick, V Korthikanti, T Dao, A Gu, A Hatamizadeh, S Singh… [NVIDIA] (2024) https://t.co/KaQR6E6aMb - This paper presents a large-scale comparison between Mamba, Mamba-2, Mamba-2-Hybrid, and… https://t.co/PwN9JiddvN

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

15 d

An Empirical Study of Mamba-based Language Models abs: https://t.co/CtMXszmo0E code: https://t.co/kBjXNrejGM Presents a comparison of 8B param Mamba, Mamba-2, and Transformer models trained on up to 3.5T tokens and evaluated on 12 tasks, including short-form QA, long-form QA,… https://t.co/ZJu65uliOP

Tri Dao@tri_dao

15 d

The most direct and detailed comparison of Transformer, Mamba, and hybrid models so far: same size (8B), datasets (1.1T & 3.5T), hparams. Mamba can match / beat Transformer on zero-shot evals, but lag behind on MMLU / copying. Hybrid outperforms Transformer. Some fun results: https://t.co/ipKWljCaJU

Similar Stories

NVIDIA's 8B-Parameter Hybrid Models Outperform Transformers, Achieving 2.65 Points Improvement

Similar Stories

Sources

NVIDIA's 8B-Parameter Hybrid Models Outperform Transformers, Achieving 2.65 Points Improvement