NoCha Benchmark Shows LLMs, Including GPT-4o, Struggle

Wondering how LLMs do on the Comprehension Challenge I proposed in 2014? New results from an easier (written not visual) version of that test: “no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks)” https://t.co/0YXkbIRWpt

Gary Marcus@GaryMarcus

4 d

Wondering how LLMs do on the Conprehension Challenge I proposed in 2014? New results from an easier (written not visual) version of that test: “no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks)” https://t.co/0YXkbIRozV

Swaroop Mishra@Swarooprm7

4 d

🤔 Do you think LLM reasoning is solved? 🏆 High leaderboard numbers may not tell the whole story! Check out our new paper investigating the robustness of LLMs in reasoning! 🧠 https://t.co/sspCqWtf6c

Tuhin Chakrabarty@TuhinChakr

4 d

New paper with students @BarnardCollege on testing orthogonal thinking / abstract reasoning capabilities of Large Language Models using the fascinating yet frustratingly difficult @nytimes Connections game. #NLProc #LLMs #GPT4o #Claude3opus 🧵(1/n) https://t.co/jDfCbpPi2Z

Emergent Mind Bot@EmergentMind

4 d

New research shows long-context language models (LCLMs) can match top retrieval systems in real tasks without training but falter in compositional reasoning, underscoring the need for effective prompting: https://t.co/83IyXMTPBN https://t.co/5t955ZmkBQ

Alessandro Stolfo@alesstolfo

4 d

New paper w/ @benwu_ml and @NeelNanda5! LLMs don’t just output the next token, they also output confidence. How is this computed? We find two key neuron families: entropy neurons exploit final LN scale to change entropy, and token freq neurons boost logits proportional to freq 🧵 https://t.co/seayNY6d25

Erika Cardenas@ecardenas300

4 d

LLM + Memory + Planning + Tools = Agents 🤖 Last month, Job and I discussed how generative AI is shifting how companies offer customer support. How can we add more layers to our RAG apps to make it more agentic? LLM: Large language model alone Memory: Short-term and long-term… https://t.co/hpUOAmFz9d

Neel Nanda@NeelNanda5

4 d

I'm really excited to have supervised this paper! Attention Output Sparse Autoencoders Just Work (TM), which is a great contribution But more importantly, we provide some of the best evidence yet that SAEs are a helpful tool for interp research, shedding light on past mysteries https://t.co/mzX5qQJvFb

Connor Kissane@Connor_Kissane

4 d

Sparse Autoencoders help us understand the MLPs of LLMs, but what's up with attention? In our new paper with @NeelNanda5, we introduce Attention Output SAEs to uncover what concepts attention layers learn. Further, we use them to find novel insights previously out-of-reach!🧵 https://t.co/4bBRVY1FCR

Yuval Shalev@YuvalShalev1

4 d

🧠🤖 How do LLMs think? What kind of thought processes can emerge from artificial intelligence? Our latest paper about multi-hop reasoning tasks reveals some new interesting insights. Check out this thread for more details! https://t.co/oiws4IW27e @GoldsteinYAriel @amir_feder https://t.co/WptqtsDS8R

Ethan Mollick@emollick

5 d

Long context windows seem to work quite well for AIs retrieving information, but what about reasoning? This clever paper has real human readers of recent novels generate complex questions that require reading and understanding the entire book to answer. LLMs do not do very well. https://t.co/9gCPiNBybr

Cheng-Yu Hsieh@cydhsieh

5 d

Why LLMs lost in the middle❓ 💡LLMs exhibit U-shape positional attention bias that dominates their generation behavior (often using leading/ending contexts in the response) 🚀By modeling and removing such bias, we hugely improve LLMs RAG performances! 📜: https://t.co/oZAoRz191p https://t.co/6vmdmMPCQd

Marzena Karpinska@mar_kar_

5 d

Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about *NEW* fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%. https://t.co/beuo7q9KIj

Similar Stories

NoCha Benchmark Shows LLMs, Including GPT-4o, Struggle with Reasoning, Best Scoring 55.8%

Similar Stories

Sources

NoCha Benchmark Shows LLMs, Including GPT-4o, Struggle with Reasoning, Best Scoring 55.8%