Apple and its collaborators are making significant strides in machine learning with the introduction of MLXServer and updates to MLX Swift. MLXServer, a new project announced by Mustafa (@maxaljadery) and Siddharth, offers an easy way for developers to work with Large Language Models (LLMs) locally, providing HTTP endpoints for text generation, chat, converting models, and more. It is designed to be easily set up via 'pip install mlxserver' and is optimized for Apple's metal, indicating a focus on performance. Concurrently, Apple's MLX Swift has been updated to include new features such as fused attention (from @argmaxinc) and fast quantized kernels, as well as enhanced flexibility and efficiency in model fine-tuning through LoRA support. These updates suggest Apple's ambition to position MLX as a solid competitor to TensorFlow and PyTorch, especially with its unified memory model that supports parallel operations and automatic dependency insertions. The MLX Swift update also includes a 4-bit Mistral 7B model that runs efficiently on M1 chips, highlighting Apple's commitment to optimizing machine learning operations on its hardware. This is further supported by the ability to perform lightweight fine-tuning on GPU or TPU, and improvements such as compilation, better data packing, and gradient checkpointing make fine-tuning a 4-bit Mistral 7B model on an 8GB M1 chip quite feasible.
(Q)LoRA in MLX LM is also faster and more memory efficient thanks to: - compilation - better data packing - gradient checkpointing pip install -U mlx-lm Fine-tuning 4-bit Mistral 7B on an 8GB (!) M1 is actually quite doable: https://t.co/WhFBDQLDHi
(Q)LoRA in MLX LM is a lot more flexible now: tune layers, rank, scale, and more. pip install -U mlx-lm Example config: https://t.co/0SzXyddDdb Thanks to Chimezie https://t.co/3EPLGxAys9 for the addition! https://t.co/XH0wVQgmiN
MLX Swift is updated with fused attention (from @argmaxinc) and fast quantized kernels. LLM example here: https://t.co/Qjo1DWwqfI A 4-bit Mistral 7B runs quite fast for thousands of tokens on my M1: https://t.co/zi3uxJxE3C
MLX Swift is updated with fused attention (from @argmaxinc) and fast quantized kernels. LLM example here: https://t.co/Qjo1DWwqfI A 4-bit Mistral 7B runs quite fast for thousands of token on my M1: https://t.co/rgPx3JF2Dm
Levanter has LoRA support! Now you can do lightweight fine-tuning in a fully reproducible way on GPU or TPU. https://t.co/Zf4jcQRRZD
You gotta love what @apple’s mlx team cooked: - A unified memory model that literally does compute-magic: parallel operations with automatic dependency insertions. - Supports off-the-shelf use of all the fun stuff in composable func transformations (differentiation,… https://t.co/fxz6CEoi9H
Apple is going all out with MLX. a few days ago they rereleased MLX with Swift so you can run LLMs locally. now they’re onto MLXServer so you can build APIs around them more easily. solid TF/Pytorch competitor in the making. https://t.co/8auXfSYvax
Exciting new project: MLXServer An easy way to get started with LLMs locally. HTTP endpoints for text generation, chat, converting models, and more. Setup: pip install mlxserver Docs: https://t.co/mLCWxUdcec Example: https://t.co/DEQLHOSAZp https://t.co/zSMgfoIGz1
Mustafa (@maxaljadery) and I are excited to announce MLXserver: a Python endpoint for downloading and performing inference with open-source models optimized for Apple metal ⚙️ Docs: https://t.co/69nBje4BJk https://t.co/vnLtMSJYtL