The field of Large Language Models (LLMs) is advancing with significant developments in compression and inference techniques. A new paper titled 'Extreme Compression of Large Language Models via Additive Quantization' explores the potential of compressing LLMs to 2 to 3 bits per parameter, aiming to address the challenges of GPU constraints. Meanwhile, UnstructuredIO has announced a private beta of their no-code Enterprise Platform designed to support Retriever-Augmented Generation (RAG) applications by providing enterprise-grade data solutions. In a separate development, a team has introduced 'Hydragen,' a method that improves LLM inference throughput by up to 32x for sequences with shared prefixes, such as system prompts or few-shot examples. This technique, which does not require custom CUDA, is particularly beneficial for applications like ChatGPT, which utilize a 1700-token system prompt. Additionally, a new LLM quantization technique, referred to as AQLM, is highlighted as a significant compression method rather than just a quantization approach, emphasizing the importance of compression in AI development.
ChatGPT's 1700-token system prompt got you down? Led by @jordanjuravsky, @brad19brown, introducing Hydragen, a simple technique for Transformer LLM inference with shared prefixes! Up to 30x improvement in throughput with no custom CUDA! A few things I love in this project: 1/ https://t.co/D8DJIz0Bq9 https://t.co/Rq6XjCZlCp
Excited to share Hydragen, an exact implementation of attention that improves LLM inference throughput by up to 32x for shared prefix sequences (e.g., when we have a system prompt / use few-shot examples / generate many samples for the same prompt), with speedup growing with the… https://t.co/cdUEElj8HP
This new awesome LLM Quantization technique AQLM is more of a compression method rather than just a quant method. Compression is so important for good AI of any kind! Surprised there has not been much effort to take advantage of the similarities between experts in MoE models… https://t.co/axMbQf9J9K
This new awesome LLM Quantization technique AQLM is like a more of a compression method rather than just a quant method. Compression is so important for good AI of any kind! Surprised there has not been much effort to take advantage of the similarities between experts in MoE… https://t.co/axMbQf9J9K
The new awesome LLM Quantization technique is like a more of a compression method rather than just a quant method. Compression is so important for good AI of any kind! Surprised there has not been much effort to take advantage of the similarities between experts in MoE models… https://t.co/axMbQf9J9K
Excited to share my first PhD project! TLDR: Hydragen is an exact, simple (no custom CUDA) implementation of attention for large batches with shared prefixes. We can improve LLM throughput by over 30x for CodeLlama-13b. Also, adding lots more shared context becomes cheap:… https://t.co/Bs23Y51xYs
Hydragen High-Throughput LLM Inference with Shared Prefixes paper page: https://t.co/GCnYzqv5Dz Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix,… https://t.co/aaDVPFbTL3
BiLLM Pushing the Limit of Post-Training Quantization for LLMs paper page: https://t.co/aDPZ5SLXo0 Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a… https://t.co/kyojFF5lok
Hydragen: High-Throughput LLM Inference with Shared Prefixes Improves end-to-end LLM throughput by up to 32x against competitive baselines https://t.co/1wpnafRPN7 https://t.co/IgpIdAQTyK
Comparing Large Language Models Against Lawyers - Marginal REVOLUTION “Our empirical analysis benchmarks LLMs against a ground truth set by Senior Lawyers, uncovering that advanced models match or exceed human accuracy in determining legal issues. In speed, LLMs complete reviews…
Check out our blog post to learn more about our no-code, enterprise-ready, SOC2 compliant, connector rich, file type rich, chunk ready, embedding ready, #RAGready solution for continuous automatic hydration of unstructured data to your LLM applications.💥 The Unstructured… https://t.co/MZ8lDElWDG
Excited about our latest work on using LLMs to assist humans in answering questions! https://t.co/aU9VnHiFlf
🚀 Exciting news for #LLM enthusiasts! We've unveiled the fastest method to curate clean training data for LLMs during fine-tuning for Q/A tasks. Say goodbye to the hurdles that prevent your LLM from moving from demo to production. More details 👇 https://t.co/p7U8VNWIKB
⚡️We are excited to announce that our new no-code Enterprise Platform is NOW available in private beta! As RAG apps advance from prototype to production we’ve been overwhelmed by requests for an enterprise grade solution to provide these applications with the data they need.… https://t.co/Ew6ShDQsGb
Great paper for the GPU constrained LLM setup - "Extreme Compression of Large Language Models via Additive Quantization" 💡 📌 Revisits the problem of compressing upto 2 to 3 bits per parameter. ------ ❓ Fundamental problem with LLM Quantization that this Paper aims to… https://t.co/tFzpmbA4ER
Great paper for the GPU constrained LLM setup - "Extreme Compression of Large Language Models via Additive Quantization" 💡 📌 It revisits the problem of "extreme" LLM compression such as 2 to 3 bits per parameter. ------ ❓ Fundamental problem with LLM Quantization that this… https://t.co/ZIDgKgLmy2
How do we supervise models to answer truthfully when human labellers aren’t domain experts? 🤔 Our paper shows that non-expert humans can better judge answers after observing debates between expert LLMs. I’ll share tips for automated debates and getting LLM judges to work. 🧵 https://t.co/eKFXZRuCiC