Recent research in the field of natural language processing has introduced new models such as State Space Models (SSM) and hybrid models that combine attention mechanisms with other techniques like Mamba. These models, such as Samba 3.8B and Mamba-2-Hybrid, have shown promising results in terms of efficiency and performance when compared to traditional Transformer models. The hybrid models have demonstrated better accuracy, training efficiency, and inference cost, outperforming Transformers on various tasks and benchmarks.
The recent "Samba: Simple Hybrid State Space Models" by Microsoft look great! Basically a Mamba/transformer hybrid with MLP layers and sliding window attention. And it performs really well on a small <2B scale. Proud to see that that our LitGPT open-source library powered this! https://t.co/K7E1wygjeM
[CL] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling L Ren, Y Liu, Y Lu, Y Shen… [Microsoft] (2024) https://t.co/rF595l5f5K - SAMBA combines selective SSM layers (Mamba) with sliding window attention (SWA) to efficiently model sequences… https://t.co/zBJMfNSnnK
Is Samba a production ready non-transformer 3.8BN Large Language Model? Much of the takeoff in AI the past few years has been built on transformer and diffusion models. But a question is - how much of this progress is due to the capabilities of these architectures, relative to… https://t.co/T1OjeYT17n
careful systematic study of Mamba vs Transformer capabilities -- hybrid (~10% attn) wins again, where the attn seems to have specific roles (e.g. adhering to multiple choice format, copying abilities) fun collaboration with Nvidia! https://t.co/NQ9RXrU51Y
An Empirical Study of Mamba-based Language Models ◼ 🚀 New research pits Mamba models against Transformers in a head-to-head! Mamba models, while excelling in efficiency and some language tasks, fall short in tasks needing strong in-context learning. The Mamba2-Hybrid not only… https://t.co/G1ZHwtVsBL
8B-parameter Mamba-2-Hybrid exceeds the 8B-parameter Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8× faster when generating tokens at inference time. 🤯 📌 The hybrid model also demonstrates strong long-context… https://t.co/DzDPihcglx
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling ◼ 🚀 Introducing Samba: a groundbreaking hybrid model combining Mamba (a State Space Model) with Sliding Window Attention for efficient sequence processing. 🧠 Achieves superior performance… https://t.co/C98uOfFW7a
[LG] An Empirical Study of Mamba-based Language Models R Waleffe, W Byeon, D Riach, B Norick, V Korthikanti, T Dao, A Gu, A Hatamizadeh, S Singh… [NVIDIA] (2024) https://t.co/KaQR6E6aMb - This paper presents a large-scale comparison between Mamba, Mamba-2, Mamba-2-Hybrid, and… https://t.co/PwN9JiddvN
The most direct and detailed comparison of Transformer, Mamba, and hybrid models so far: same size (8B), datasets (1.1T & 3.5T), hparams. Mamba can match / beat Transformer on zero-shot evals, but lag behind on MMLU / copying. Hybrid outperforms Transformer. Some fun results: https://t.co/ipKWljCaJU
A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset: * 7% attention, the rest is Mamba2 * MMLU jumps from 50 to 53.6% * Training efficiency is the same * Inference cost is much less https://t.co/x62otbC5uN https://t.co/bBfFYEt0a0
Introducing Samba 3.8B, a simple Mamba+Sliding Window Attention architecture that outperforms Phi3-mini on major benchmarks (e.g., MMLU, GSM8K and HumanEval) by a large margin.😮 And it has an infinite context length with linear complexity.🤯 Paper: https://t.co/6OnfGG71Aj… https://t.co/f4IZdT1wGB
9 billion parameters State Space Model (SSM) alternative to attention is out. Recurrent transformers are now on par with attention transformers, like Gemma and Mistral, but by maintaining a state vector they can be capable of faster inference. https://t.co/eWbyQk58Cy