Recent research highlights significant challenges faced by large language models (LLMs) in reasoning over long contexts. Despite advancements in retrieval capabilities, LLMs struggle with compositional reasoning tasks. For instance, the NoCha benchmark, which involves verifying claims about fictional books, reveals that none of the 11 tested LLMs, including GPT-4o, achieved human-level performance, with the best model scoring only 55.8% compared to the human benchmark of 97%. This performance gap is attributed to LLMs' U-shape positional attention bias, which affects their generation behavior. Further studies using tasks like the needle-in-the-haystack and comprehension challenges by real human readers indicate that LLMs still perform at or below random chance on complex reasoning tasks, despite their strong performance on synthetic benchmarks.
Wondering how LLMs do on the Comprehension Challenge I proposed in 2014? New results from an easier (written not visual) version of that test: “no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks)” https://t.co/0YXkbIRWpt
Wondering how LLMs do on the Conprehension Challenge I proposed in 2014? New results from an easier (written not visual) version of that test: “no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks)” https://t.co/0YXkbIRozV
🤔 Do you think LLM reasoning is solved? 🏆 High leaderboard numbers may not tell the whole story! Check out our new paper investigating the robustness of LLMs in reasoning! 🧠 https://t.co/sspCqWtf6c
New paper with students @BarnardCollege on testing orthogonal thinking / abstract reasoning capabilities of Large Language Models using the fascinating yet frustratingly difficult @nytimes Connections game. #NLProc #LLMs #GPT4o #Claude3opus 🧵(1/n) https://t.co/jDfCbpPi2Z
New research shows long-context language models (LCLMs) can match top retrieval systems in real tasks without training but falter in compositional reasoning, underscoring the need for effective prompting: https://t.co/83IyXMTPBN https://t.co/5t955ZmkBQ
New paper w/ @benwu_ml and @NeelNanda5! LLMs don’t just output the next token, they also output confidence. How is this computed? We find two key neuron families: entropy neurons exploit final LN scale to change entropy, and token freq neurons boost logits proportional to freq 🧵 https://t.co/seayNY6d25
LLM + Memory + Planning + Tools = Agents 🤖 Last month, Job and I discussed how generative AI is shifting how companies offer customer support. How can we add more layers to our RAG apps to make it more agentic? LLM: Large language model alone Memory: Short-term and long-term… https://t.co/hpUOAmFz9d
I'm really excited to have supervised this paper! Attention Output Sparse Autoencoders Just Work (TM), which is a great contribution But more importantly, we provide some of the best evidence yet that SAEs are a helpful tool for interp research, shedding light on past mysteries https://t.co/mzX5qQJvFb
Sparse Autoencoders help us understand the MLPs of LLMs, but what's up with attention? In our new paper with @NeelNanda5, we introduce Attention Output SAEs to uncover what concepts attention layers learn. Further, we use them to find novel insights previously out-of-reach!🧵 https://t.co/4bBRVY1FCR
🧠🤖 How do LLMs think? What kind of thought processes can emerge from artificial intelligence? Our latest paper about multi-hop reasoning tasks reveals some new interesting insights. Check out this thread for more details! https://t.co/oiws4IW27e @GoldsteinYAriel @amir_feder https://t.co/WptqtsDS8R
Long context windows seem to work quite well for AIs retrieving information, but what about reasoning? This clever paper has real human readers of recent novels generate complex questions that require reading and understanding the entire book to answer. LLMs do not do very well. https://t.co/9gCPiNBybr
Why LLMs lost in the middle❓ 💡LLMs exhibit U-shape positional attention bias that dominates their generation behavior (often using leading/ending contexts in the response) 🚀By modeling and removing such bias, we hugely improve LLMs RAG performances! 📜: https://t.co/oZAoRz191p https://t.co/6vmdmMPCQd
Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about *NEW* fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%. https://t.co/beuo7q9KIj