Recent developments in the field of large language models (LLMs) quantization have marked significant progress, with Intel's Neural Compressor team leading the charge. Their new state-of-the-art (SOTA) low-bit quantization approach, AutoRound, has been highlighted for its ability to enable full 4-bit matrix multiplications (matmuls), significantly speeding up large batch inference processes. This advancement is considered a major step forward, making the deployment of LLMs at scale more efficient and cost-effective. Additionally, the team has shared insights into their work on FP8 inference and efficient post-training quantization, further contributing to the discourse on LLM optimization across hardware platforms. The community's response includes plans for a meetup in San Francisco, as well as a virtual session to discuss the era of 1-bit LLMs, with a focus on training a 1.58bit model named Bessie, indicating a growing interest and investment in refining LLM deployment techniques.
🔥Want to use FP8 inference easily? Intel Neural Compressor is your best choice: https://t.co/XklzQFSYdz 🎯Shared with you our MLSys'24 camera-ready paper: Efficient Post-Training Quantization with FP8 Formats 🤗https://t.co/CHJyvQZhA2 @_akhaliq @navikm @huggingface #IAmIntel https://t.co/v7HO1bq8Ed
AutoRound: Nice work from Intel(R) Neural Compressor team. ✨ 📌 SOTA Weight-Only Quantization Algorithm for LLMs Across Hardware Platforms 📌 Designed specifically for low-bit LLM inference, approaching near-lossless compression for a range of popular models 📌 Only tuning… https://t.co/OhcK7uve3y
This week our Arxiv Dive is both a MEETUP in SF and virtual. Paper is on "The Era of 1-bit LLMs"... both high-performing AND cost-effective🤯. We're training our own 1.58bit, Bessie too! We welcome the @Microsoft team nerd out with us. @ma_shuming @realHongyu_Wang @donglixp https://t.co/t4Cd4AmmQ4
⚡️AutoRound, new SOTA LLM low-bit quantization approach developed by Intel Neural Compressor team (https://t.co/XklzQFSYdz) 🎯Lots of interesting comparison with GPTQ, AWQ, HQQ, etc. Check out the blog for more details: https://t.co/1fdyEs8Khx @huggingface #IAmIntel
This is excellent work — a big step forward in quantization! It enables full 4-bit matmuls, which can speed up large batch inference by a lot. Anyone deploying LLMs at scale will soon use this or similar techniques. https://t.co/3Q0RCbFES2