Apple's MLX library for on-device inference has seen significant updates, including fused attention, fast quantized kernels, and improved efficiency in MLX LM. Users report faster speeds and better memory usage, with examples of models running on M1 chips. The introduction of features like compilation, better data packing, and gradient checkpointing has led to increased performance in MLX LM. The Sophia optimizer in Levanter allows for training models approximately 2x faster, alongside support for various tools like Llama, Mistral, LoRA, TPU, and GPU. The community-driven mlxim project aims to enhance image model implementations on MLX. Users have observed speed improvements in MLX-LM, with a notable increase in token processing speed on M1 16GB devices.
MLX-LM got faster 🚀 My M1 16GB went from 12 tokens/s last week to ~16 tokens/s when running a 3B param model locally. https://t.co/qPhHefi5V7
Do you like Apple's MLX library for on-device inference? There's now mlxim too! It's a project led by community member @r_musmeci that takes inspiration on timm to create clean implementations of image models on top of MLX. Join the org to contribute! https://t.co/0QKe7L8Wcc
Levanter has the Sophia optimizer now, so you can train models ~2x faster. Together with Llama + Mistral, LoRA, TPU + GPU support, reproducibility, scalability, legibility, clean codebase, why not give Levanter a spin for your next LM training/fine-tuning run? https://t.co/hvm5pb0MgC
I never thought this would happen: I am using my Linux box with a GPU less and less these days. Inference and training speed using Apple MLX is blazing on my new Mac M3. This is a video of Mistral 7b (4bit) getting 30-40 tok/s @transformerlab @awnihannun #mlx https://t.co/xswEOsS1y4
(Q)LoRA in MLX LM is also faster and more memory efficient thanks to: - compilation - better data packing - gradient checkpointing pip install -U mlx-lm Fine-tuning 4-bit Mistral 7B on an 8GB (!) M1 is actually quite doable: https://t.co/WhFBDQLDHi
MLX Swift is updated with fused attention (from @argmaxinc) and fast quantized kernels. LLM example here: https://t.co/Qjo1DWwqfI A 4-bit Mistral 7B runs quite fast for thousands of tokens on my M1: https://t.co/zi3uxJxE3C