BurstAttention: DMC Improves Speed and Performance wit

Cool work on dynamic compression of the KV cache, i.e., they predict whether to grow the KV cache (append) or modify it without changing its size! Maintains performance of LLaMa 2 (7/13/70B) with ~4x throughput increase at inference time. https://t.co/j3GFQJgSeg

Jan P. Harries@jphme

4 mo

Great stuff - multilang performance already better than Mistral and we know from current SotA long context models (Claude/Gemini) and various RAG applications that linear context scaling (or affordable long context scaling) will have huge benefits for many usecases 🔥 https://t.co/WqHtKOvGTK

Fuzhao Xue on the job market!@XueFz

4 mo

KV cache may be the most redundant memory usage, but it’s non-trivial to compress it in a lossless way. My Takeaway: 1) Adaptively append/merge the current token to KV cache. This is a simple but smart solution to achieve better trade-off between RNN and Transformer, 2) Designed… https://t.co/9TJ2hxHdal

jack morris@jxmnop

4 mo

when contexts are long, attending to every single token in the past feels wasteful (and not at all how human brains work). feels like a natural setting for compression… DMC seems like a huge improvement in transformer inference speed — congrats to the authors! https://t.co/9wA5cJANuq

Piotr Nawrot@p_nawrot

4 mo

The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance… https://t.co/CzoKlwX9VQ

Aran Komatsuzaki@arankomatsuzaki

4 mo

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences Offers significant advantages for processing long sequences compared with these competitive baselines like Ring Attention, reducing 40% communication overheads and achieving 2 X speedup… https://t.co/E4s5RtkvaX

Similar Stories

BurstAttention: DMC Improves Speed and Performance with KV Cache for LLaMa 2

Similar Stories

Sources

BurstAttention: DMC Improves Speed and Performance with KV Cache for LLaMa 2