Researchers have developed BurstAttention, an efficient distributed attention framework for processing long sequences, reducing communication overhead by 40% and achieving 2X speedup compared to competitive baselines. They introduce Dynamic Memory Compression (DMC) to compress KV cache in Transformers, improving inference speed without sacrificing performance. The approach adapts by appending/merging the current token to KV cache, offering a better trade-off between RNN and Transformer models. The method maintains the performance of LLaMa 2 with a 4x throughput increase at inference time.
Cool work on dynamic compression of the KV cache, i.e., they predict whether to grow the KV cache (append) or modify it without changing its size! Maintains performance of LLaMa 2 (7/13/70B) with ~4x throughput increase at inference time. https://t.co/j3GFQJgSeg
Great stuff - multilang performance already better than Mistral and we know from current SotA long context models (Claude/Gemini) and various RAG applications that linear context scaling (or affordable long context scaling) will have huge benefits for many usecases 🔥 https://t.co/WqHtKOvGTK
KV cache may be the most redundant memory usage, but it’s non-trivial to compress it in a lossless way. My Takeaway: 1) Adaptively append/merge the current token to KV cache. This is a simple but smart solution to achieve better trade-off between RNN and Transformer, 2) Designed… https://t.co/9TJ2hxHdal
when contexts are long, attending to every single token in the past feels wasteful (and not at all how human brains work). feels like a natural setting for compression… DMC seems like a huge improvement in transformer inference speed — congrats to the authors! https://t.co/9wA5cJANuq
The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance… https://t.co/CzoKlwX9VQ
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences Offers significant advantages for processing long sequences compared with these competitive baselines like Ring Attention, reducing 40% communication overheads and achieving 2 X speedup… https://t.co/E4s5RtkvaX