In the era of Large Language Models (LLMs), researchers are focusing on less compute-intensive research directions, inspired by excellent reviews from leading groups. PowerInfer, a high-speed inference engine, is making strides in deploying LLMs locally on consumer GPUs, nearly reaching NVIDIA A100 performance levels. It significantly outperforms llama.cpp, with up to 11.69x faster performance on a single RTX 4090 GPU, using the Falcon(ReLU)-40B-FP16 model. PowerInfer achieves an average token generation rate of 13.20 tokens/s, peaking at 29.08 tokens/s. Apple has also joined the LLM fray with its 'LLM in a Flash' approach, promising efficient inference with limited memory, potentially doubling the size of models that can run on available DRAM. Research papers have introduced general-purpose coarse-to-fine vision-language models and techniques like 'Cascade Speculative Drafting' to further accelerate LLM inference. DeepMind's latest paper addresses LLM 'hallucinations', advancing their problem-solving capabilities. BAAI's Emu2, an open-source multimodal AI model, and MIT's Mini-GPTs, which utilize contextual pruning, are among the innovations pushing the boundaries of generative multimodal models and efficient LLMs.
[CL] Mini-GPTs: Efficient Large Language Models through Contextual Pruning T Valicenti, J Vidal, R Patnaik [MIT] (2023) https://t.co/0RAjbO1UAj - The paper introduces a novel approach to develop efficient, domain-specific large language models (LLMs) called Mini-GPTs using… https://t.co/Rrdvr3OHUR
Meet Emu2: BAAI's latest open-source multimodal AI model, advancing open and responsible AI research. Explore its capabilities in text and visual tasks with minimal guidance. 🔍 Project: https://t.co/ekJKO6owcT 📄 Paper: https://t.co/cN1ApnAQwR
High-Speed AI in Consumer-Grade Computers 👩💻 PowerInfer, a new tool for running advanced language models, brings high-speed AI processing to everyday computers with standard GPUs. This tool cleverly combines CPU and GPU capabilities to handle complex language tasks more… https://t.co/eeLCVaTrot
Mini-GPTs: Efficient Large Language Models through Contextual Pruning paper page: https://t.co/OYrcWAKqjX In AI research, the optimization of Large Language Models (LLMs) remains a significant challenge, crucial for advancing the field's practical applications and… https://t.co/dJDwzN14PU
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU paper page: https://t.co/GfwfNHOidp This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key… https://t.co/zIbJytkeAP
BAAI announces Generative Multimodal Models are In-Context Learners paper page: https://t.co/1sGkD35gjG The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely… https://t.co/vafFjgq5N7
Generative Multimodal Models are In-Context Learners abs: https://t.co/ZiqQ0DOVb3 project page: https://t.co/VLiUtKMgv3 demo: https://t.co/7rx5Q6kAJH Trains a 37b multimodal model called Emu2 with a unified autoregressive objective (predict next text token or visual embedding)… https://t.co/JNcV79gW0A
[CL] Cascade Speculative Drafting for Even Faster LLM Inference https://t.co/vZIrDssXv3 This paper introduces a new algorithm called "Cascade Speculative Drafting" for improving the inference speed of large language models (LLMs). By employing vertical and horizontal… https://t.co/zDTvNLGhu7
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU PowerInfer v.s. llama.cpp on a single RTX 4090(24G) running Falcon(ReLU)-40B-FP16 with a 11x speedup! Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak… https://t.co/rFhSYVXLnS
In a groundbreaking revelation, DeepMind's latest paper unveils a momentous leap in AI capabilities. Large Language Models (LLMs), often criticized for producing plausible but incorrect 'hallucinations', have now transcended this limitation to solve the formidable 'cap set' math… https://t.co/Xkrkw3ErZv
Big news! Get ready for even lower LLM API expenses "PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU" https://t.co/CnYRmThESc https://t.co/cLuZecQW3G
📌 Great paper for Local Inferencing of LLM - "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory" 🔥 By selectively loading only the necessary parameters, the authors demonstrate the ability to run models up to twice the size of the available DRAM.… https://t.co/YfizAFSrtt
Apple is back?! LLM in a flash: Efficient Large Language Model Inference with Limited Memory https://t.co/gCgD1wV5gh
Apple joins the LLM fray with LLM in a flash: Efficient Large Language Model Inference with Limited Memory Presumably embeddable, inherently private, small-footprint LLMs on iPhones and iPads tied to actions are part of the strategic plan https://t.co/ce7wP49bXY
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model paper page: https://t.co/o3retncra2 The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various… https://t.co/Xed2GQ8DLw
PowerInfer can massively speed up inference on consumer GPUs. Almost reaching A100 levels. It outperforms llama.cpp by up to 11.69x while retaining model accuracy. PowerInfer reached an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across… https://t.co/euMttCeEmB
PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.… https://t.co/hPD03oug3Y
NLP Research in the Era of LLMs Here is my take on research directions in the era of LLMs that are less compute-intensive. This is inspired by excellent reviews by @nsaphra, @togelius, @radamihalcea and @elgreco_winter's groups. https://t.co/SL80EIp8yM https://t.co/1mOdepFlJM