Apple's MLX Library Sees Speed Improvements with Fused

MLX-LM got faster 🚀 My M1 16GB went from 12 tokens/s last week to ~16 tokens/s when running a 3B param model locally. https://t.co/qPhHefi5V7

Pedro Cuenca@pcuenq

4 mo

Do you like Apple's MLX library for on-device inference? There's now mlxim too! It's a project led by community member @r_musmeci that takes inspiration on timm to create clean implementations of image models on top of MLX. Join the org to contribute! https://t.co/0QKe7L8Wcc

Percy Liang@percyliang

4 mo

Levanter has the Sophia optimizer now, so you can train models ~2x faster. Together with Llama + Mistral, LoRA, TPU + GPU support, reproducibility, scalability, legibility, clean codebase, why not give Levanter a spin for your next LM training/fine-tuning run? https://t.co/hvm5pb0MgC

Ali Asaria@aliasaria

4 mo

I never thought this would happen: I am using my Linux box with a GPU less and less these days. Inference and training speed using Apple MLX is blazing on my new Mac M3. This is a video of Mistral 7b (4bit) getting 30-40 tok/s @transformerlab @awnihannun #mlx https://t.co/xswEOsS1y4

Awni Hannun@awnihannun

4 mo

(Q)LoRA in MLX LM is also faster and more memory efficient thanks to: - compilation - better data packing - gradient checkpointing pip install -U mlx-lm Fine-tuning 4-bit Mistral 7B on an 8GB (!) M1 is actually quite doable: https://t.co/WhFBDQLDHi

Awni Hannun@awnihannun

4 mo

MLX Swift is updated with fused attention (from @argmaxinc) and fast quantized kernels. LLM example here: https://t.co/Qjo1DWwqfI A 4-bit Mistral 7B runs quite fast for thousands of tokens on my M1: https://t.co/zi3uxJxE3C

Similar Stories

Apple's MLX Library Sees Speed Improvements with Fused Attention and Fast Quantized Kernels in MLX LM

Similar Stories

Sources

Apple's MLX Library Sees Speed Improvements with Fused Attention and Fast Quantized Kernels in MLX LM