9 posts • ChatGPT (GPT-4o)
Updated
Groq Inc. has achieved a significant milestone by showcasing their ability to process over 500 tokens per second on Llama 3.1 70B, thanks to improvements in deterministic scheduling and communication between chips. Their recent work includes performance on 14nm silicon, with more excitement anticipated with their 4nm technology. Meanwhile, SambaNova AI has reported impressive results of 132 tokens per second on their 405B model. Cerebras Systems also announced an update in their inference performance, with Llama 3.1-8B reaching 1,927 tokens per second and Llama 3.1-70B achieving 481 tokens per second. Additionally, Neural Magic released vLLM v0.6.0, which has resulted in 2.7 times higher throughput and 5 times lower latency on Llama 8B using a single H100. Perplexity's Sonar Large, post-trained on Llama 3.1, now runs faster inference and similar improvements are expected for Sonar Huge. These advancements highlight the rapid progress in AI model performance and efficiency.