Recent advancements in large language models (LLMs) have shown mixed results in mathematical reasoning tasks. A new benchmark reveals that while LLMs excel in straightforward math problems, they struggle with creative and multi-step questions, even with the use of chain-of-thought (CoT) prompting. Google's latest project, 'Improve Mathematical Reasoning in Language Models by Automated Process Supervision,' employs Monte Carlo Tree Search (MCTS) to enhance the collection of high-quality process supervision data, resulting in a significant performance boost from 51% to 69.4% on MATH benchmarks without human intervention. Additionally, researchers from Fudan University and The Hong Kong Polytechnic University introduced a method to access GPT-4 level Mathematical Olympiad solutions using Monte Carlo Tree Self-refine with LLaMa-3 8B, significantly improving success rates across MATH and Olympiad-level benchmarks.
[AI] Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B D Zhang, J Li, X Huang, D Zhou, Y Li, W Ouyang [Fudan University & The Hong Kong Polytechnic University] (2024) https://t.co/G9W0Tc1sKf - Introduces MCT Self-Refine… https://t.co/BAnKfj2a8P
Improve Mathematical Reasoning in Language Models by Automated Process Supervision Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). https://t.co/jM32yAtrdB
Google presents Improve Mathematical Reasoning in Language Models by Automated Process Supervision - MCTS for the efficient collection of high-quality process supervision data - 51% -> 69.4% on MATH - No human intervention https://t.co/1Kh8rVyTat https://t.co/NCFbUiLrli
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B Significantly improves success rates across MATH and Olympiad-level benchmarks https://t.co/lV6J9Cz2rb https://t.co/QvpUhnDAOg
LLMs excel in math. Introducing a new benchmark, we observe: They struggle with creative and many-step questions (even with CoT), their performance varies widely even on similar topics, and they engage in genuine reasoning only in about half of cases. 1/n https://t.co/nC1BiBQTLZ https://t.co/nMtn5CRXBe