Researchers are exploring the benefits of quantizing larger language models (LLMs) to 1-bit and 2-bit, achieving superior performance with reduced memory usage. Techniques like HQQ+ and layer-pruning are being used to scale down inference and pre-training, offering efficiency breakthroughs in LLMs.
so last month msft published a paper showing a 1 bit parameter LLM with minimal performance loss. someone on huggingface just replicated the results today. this is at least a 10x reduction memory footprint and opens up a path for even more gains in training / inference speeds https://t.co/ApHeGZDrFA
Efficiency Breakthroughs in LLMs: Combining Quantization, LoRA, and Pruning for Scaled-down Inference and Pre-training Quick read: https://t.co/SKDFQJZeNo Researchers from Meta FAIR, UMD, Cisco, Zyphra, MIT, and Sequoia Capital examine a layer-pruning approach for popular…
Promising quantization method for 2bit and 1bit LLMs. Less useful for models that are already *small*, but doing this on a larger model is very interesting. Ex, the mixtral model can be brought down to 14GB of vram (from 94GB), the equivalent of mistral-7b running at fp16 but a… https://t.co/0ymQvxPpa9 https://t.co/AvEzkwZbyl
Very cool result that points towards existing LLMs being “too deep,” paying costs in compute w/o getting much back for performance! Similar to our conclusion in https://t.co/Gbsa80GUSy, but here they focus on pruning an existing model rather than training from scratch! https://t.co/NId0F1nY5j
The new era of 1-bit and 2-bit quantizations. Their "findings indicate that heavily quantizing larger models using techniques like HQQ+ can yield superior performance while still maintaining a relatively small memory footprint." https://t.co/maQCvColOf