Groq Inc. Achieves 40,792 Tokens/s on Llama3 70B Model

Put this mind bending achievement in perspective: @GroqInc runs Llama 70b in lossless precision on ~4 Wikipedia articles in quite literally the blink of an eye. - A 70B model in 16-bit precision with 32-bit accumulation (loss-less). - Processing ~8000 tokens in 0.2 seconds (or… https://t.co/N1arORIsf5

Burny — Effective Omni@burny_tech

24 d

Scalable MatMul-free Language Modeling MatMul operations are replaced with addition and negation operations >We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency https://t.co/6PrSlEPlIY

sunny madra@sundeep

24 d

40k tok/s input on llama3 8b. AI workloads are generally very input heavy. This, combined with our output speed will now make @GroqInc the only way to build performant AI applications. Some realtime feedback from a happy user: https://t.co/vTB4YnJ5Zd https://t.co/ktsGVwGjsB

sunny madra@sundeep

24 d

40k tok/s input on llama3 8b. AI workloads are generally input very heavy. This combined with our output speed will now make @GroqInc the only way to build performant AI applications. Some realtime feedback from a happy user: https://t.co/CtGXV2aYrb https://t.co/ktsGVwGjsB

Emergent Mind Bot@EmergentMind

24 d

CompSci Paper of the Day, Issue 33: Scalable MatMul-free Language Modeling 1/4 🧵 https://t.co/aNqSme85J5

Jonathan Ross@JonathanRoss321

24 d

Last Week: Groq exceeded 30,000 Tokens / second input rate on Llama3 8B❗️ This Week: Llama3 70B at 40,792 Tokens/s input rate‼️ - FP16 Multiply, FP32 Accumulate - 7989 tokens in - full Llama context length Next Week: ...? 😮 https://t.co/rIijD2Is76

sunny madra@sundeep

24 d

When it comes to precision and accuracy, we have another super power @GroqInc 🎯 https://t.co/OGgmX2vypO

Matt Shumer@mattshumer_

24 d

Pace of @GroqInc’s improvement is really impressive 1200+ tps on L3 8B Remember, they still can stack on a ton of software efficiencies https://t.co/zQMx7zkIDj https://t.co/AQ4hxZeYuL

Daniel Newman@danielnewmanUV

24 d

At a glance: inference speed 👇🏻 @GroqInc 🔥🚀 https://t.co/TnU7Unz7Wd

Aran Komatsuzaki@arankomatsuzaki

26 d

Scalable MatMul-free Language Modeling - Shows that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales - Provides a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an… https://t.co/YxeCPLn2xF

Similar Stories

Groq Inc. Achieves 40,792 Tokens/s on Llama3 70B Model in AI Breakthrough

Similar Stories

Sources

Groq Inc. Achieves 40,792 Tokens/s on Llama3 70B Model in AI Breakthrough