The ARC-AGI benchmark has garnered significant attention recently as a challenging problem for large language models (LLMs) to solve. Ryan Greenblatt achieved 71% accuracy on a set of examples where humans typically achieve 85%, marking a state-of-the-art (SOTA) performance. The benchmark, which offers a $1 million prize, involves generating many possible Python programs to implement transformations, using a carefully-crafted few-shot prompt, generating ~5k guesses, and selecting the best ones using examples and a debugging step. Some experts argue that solving ARC-AGI does not equate to achieving artificial general intelligence (AGI) but recognize it as a valuable challenge highlighting LLMs' weaknesses in cell-based rules like the Game of Life. Another attempt using GPT-4o reached 50% accuracy, demonstrating progress through clever tricks and increased computational search.
Progress on $1M ARC-AGI benchmark that is very hard for LLMs by carefully-crafted few-shot prompt to generate many possible Python programs to implement the transformations, generating ~5k guesses, selecting the best ones using the examples, and a debugging step, which is… https://t.co/jCfuY1fsps
Progress on $1M ARC-AGI benchmark that is very hard for LLMs by carefully-crafted few-shot prompt to generate many possible Python programs to implement the transformations, generating ~5k guesses, selecting the best ones using the examples, and a debugging step. https://t.co/jCfuY1fsps
50% on ARC-AGI with GPT-4o This wonderful blog post brings out another point that I didn't explicitly mention in my blog -- ARC-AGI gets solved with a bunch of very clever tricks around existing models, and more search compute. https://t.co/YvoT4PC3yz https://t.co/CeXqixsbSF
The solution to ARC-AGI will not be considered remotely close to AGI. Going thru samples, this strikes me as a very narrow intelligence problem. But it's a very cool challenge and uses an area LLMs in particular tend to be weak at: cell-based rules (like Game of Life).
ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA. https://t.co/tqrzcMz9qD