The AI community is witnessing significant advancements with the Gemma model, showcasing its versatility and performance across various platforms. Google's Gemma model, particularly the 2B and 7B variants, is now runnable on a wide range of devices including iPhones, Android phones like the Samsung S23, and even 5-year-old Windows laptops without GPU acceleration, thanks to updates and contributions from developers across the globe. The model's performance is noteworthy, with speeds exceeding 2500 tokens per second on an H100 and reaching up to 650 tokens per second on TPUs, depending on the conditions. Moreover, the Gemma model has been integrated into several platforms and APIs, including MLC LLM for Android and web browsers, KerasNLP, and Anyscale Endpoints, where it's compared favorably against other leading models like Mistral-7B. Open-source contributions and optimizations, such as those from the MLX community and the release of Ollama v0.1.27, have further enhanced Gemma's accessibility and efficiency, demonstrating the model's potential in driving forward machine learning applications. Notably, Gemma 2B runs at 22+ tokens per second on iPhone and over 475 tokens per second powered by JAX, Transformers, and TPUs. Ollama v0.1.27 has made the model up to 100x more usable on a 5-year old Windows laptop with no GPU.
Besides Android, iOS, and web browsers, Gemma is also supported in MLC LLM on various GPUs! A single model definition does it all -- thanks to the ML compiler infra lead by @junrushao and many others! Try it in Google Colab: https://t.co/Xh25ZFAXch https://t.co/mPAzahVb3a
https://t.co/in4MFlYKaW now adds Gemma from @GoogleDeepMind! The 2b model is perfect for building in-browser agents with @WebGPU acceleration -- everything local! Here is a 1x speed demo of 4-bit quantized gemma-2b-it on @GooglePixel_US 7 Pro with @googlechrome. https://t.co/7gKTZYD1FH
thrilled to find that @ollama 0.1.27 is now 100x more usable on a 5-year old Windows laptop with no GPU. https://t.co/RpJpfuDJjU
Ollama v0.1.27 is released! 🐎 Performance improvements (up to 2x) when running Gemma models 🚄Fixed performance issues on Windows without GPU acceleration. Systems with AVX and AVX2 instruction sets should be 2-4x faster. 💊Reduced likelihood of false positive Windows…
We’re now serving Gemma 7B Instruct on the Fireworks platform! Try out Google’s latest model on Fireworks to enjoy fast inference speeds, token-based pricing, and an OpenAI-compatible, user-friendly API. Get started on our playground at https://t.co/k2thudqXg0 or through our API… https://t.co/A2OrLrO5aI
Gemma models from @GoogleAI are now available on Together API! Check out the models on our playground 👇 Gemma-7b-it: https://t.co/OaW7rGFiLO Gemma-2b-it: https://t.co/OaW7rGFiLO Gemma-7b: https://t.co/zQM20Oc4sa Gemma-2b: https://t.co/2mQQSXKSbS
💪Gemma-7b from @GoogleAI, one of the best 7b models available, is now supported on Anyscale Endpoints! 📈Results on MMLU are comparable to Mistral-7b and surpass Llama2-13b Try it out: https://t.co/MStW5BG3lD https://t.co/D0Wdh603NO
Gemma-7b from @GoogleAI, one of the best 7b models available, is now supported on Anyscale Endpoints! Try it out: https://t.co/MStW5BG3lD https://t.co/bVtlEMRDdI
Wow! Expo It's a response to flutter in typescript. I'm sold! Hmmm I wonder if I can get a 2b model (Gemma maybe?) running local on a phone. https://t.co/BBYLroGL8i
Sneak peak, hey, pssst... @ollama 0.1.27 is out! Performance improvements (up to 2x) when running Gemma models Let's try it! https://t.co/hy050gpYl4
🚀 Excited to introduce the future of machine learning on Lepton AI! We've teamed up with Google's latest marvel, Gemma, to bring you an API that's as powerful as it is user-friendly. 🌟 Start exploring with Gemma today at https://t.co/81I1zjBuny. #MachineLearning #AI #Gemma
The power of open-source AI. Mistral Pro from Tencent is now at par with Gemma on math and coding talks. It only took 3 days for them to get there. 7B models will become really interesting when they reach GPT 3.5 performance. That's an MMLU of around 71-72. We will get there… https://t.co/Xr5wmOhZFZ
Just found that @MistralAI 7B model is available in KerasNLP with #Keras 3.0. Now, we have two leading open LLMs in Keras ecosystem #Gemma and Mistral-7B! This is what we have been waiting for a long time :) Personally, I will build end-to-end MLOps pipeline for them! https://t.co/PpNHECUFvc
This is how easy it is to run a model locally. 2 commands in the terminal to get Gemma running. 👏 @ollama It's awesome to see more open source models being released. Thank you @sundarpichai @demishassabis @JeffDean @ZoubinGhahrama1 @OriolVinyalsML @asoroken @alexanderchen… https://t.co/qozcd6IJ1g
💫 Google Gemma model updated to no longer output undesired text locally! It's really good! To update the models if you have previously downloaded: ollama pull gemma (default 7B) ollama pull gemma:2b (2B model) Run the pulled models with `ollama run` A big ❤️ thank you… https://t.co/j2qEnO7wMY
Google's Gemma model is now supported on Android using MLC LLM. Here is a demo of 4-bit quantized Gemma-2b model running on Samsung S23. Thanks to @ruihanglai and many others for bringing Gemma support to MLC! https://t.co/KXW7nse8OV https://t.co/8jNvd71cBV https://t.co/xuIEL7eJHW
Gemma 2B at > 475 tok/sec! 🫡 Powered by JAX, Transformers and TPUs. Up to 4x faster than PyTorch (on A100). Based on the prompt and conditions, it can go up to 650 tok/ sec. ⚡ Kudos to @sanchitgandhi99 for integrating Gemma in JAX transformers! Note: This is on TPU v2;… https://t.co/zB1Csc7da1
Gemma 2b running on iphone at 22+ tok/sec. 2B really hits a sweetspot for running local model on phone. Try it out via https://t.co/emAhXUXZew 👉 https://t.co/BvfBXbtCS9
Run Gemma model locally on iPhone - we get blazing fast 20 tok/s for 2B model. This shows amazing potential ahead for Gemma fine-tunes on phones, made possible by the new MLC SLM compilation flow by @junrushao from @OctoAICloud and many other contributors. https://t.co/1L0mvG1bWq https://t.co/c4kCO9DRos
The 🤗 MLX community is fast. Already quantized and uploaded all the Gemma model variants: Available here: https://t.co/dUgErUXnM3 Thanks @Prince_Canuma and @lazarustda ! https://t.co/fbEyBIy9GC
For my first official contribution to the @modal_labs examples: running Gemma 7B on an H100 at >2500 tok/s 🚀 With very little effort, that's already just ~75¢ per megatoken -- and you have full "tensors-and-a-shell" control over the execution environment https://t.co/D6ls1m8MAE