ARC-AGI-2: The Next Challenge for AI Reasoning

What is ARC-AGI?

Imagine giving a computer a puzzle that any 8-year-old could solve, but watching it struggle despite having access to the entire internet’s worth of knowledge. That’s the essence of ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) – a benchmark designed to test whether AI can truly think, not just memorize.

Created by François Chollet in 2019, ARC-AGI presents AI systems with simple visual puzzles: grids with colored cells that transform according to hidden rules. The AI must figure out the pattern from just a few examples and apply it to new cases. No language needed, no specialized knowledge required – just pure reasoning ability.

Why ARC-AGI-2 Was Needed

After five years and multiple competitions, ARC-AGI-1 was finally approaching its limits. In late 2024, OpenAI’s o3 model achieved scores between 76-88% (depending on compute usage), marking the first time an AI system approached human-level performance on the benchmark. But this success came with a catch – the highest score required an estimated $20,000 worth of compute per task!

The original benchmark also had some weaknesses:

  • Nearly half the tasks could be “brute-forced” by trying millions of possible solutions
  • The same 100 hidden test tasks had been reused across all competitions, risking overfitting
  • It couldn’t differentiate between varying levels of human intelligence well

What Makes ARC-AGI-2 Different

ARC-AGI-2, released in May 2025, preserves what made the original special while addressing its limitations. Every task is entirely novel and designed to resist brute-force approaches. The new benchmark introduces four key challenges:

Multi-rule reasoning: Tasks require combining multiple transformation rules simultaneously, like cropping, rescaling, and pattern-matching all at once.

Sequential dependencies: Solutions must be built step-by-step, where each step depends on the previous one – you can’t skip ahead.

Contextual clues: The same basic operation might work differently based on subtle contextual hints in the grid.

Symbol interpretation: Some tasks use abstract symbols whose meaning must be deduced from the examples.

Example of contextual rule application, ARC-AGI-2 Public Eval Task #b5ca7ac4. Solve this task!

The Human Baseline

Before releasing ARC-AGI-2, researchers conducted extensive testing with over 400 human participants from diverse backgrounds. The results were encouraging: every single task was solved by at least two humans within two attempts, with participants solving an average of 66% of tasks they attempted in about 2.7 minutes per task.

Interestingly, performance didn’t correlate with technical expertise – programmers and mathematicians performed no better than teachers or healthcare workers. This confirms that ARC-AGI-2 tests general intelligence, not specialized knowledge.

The Current State of AI

The contrast between human and AI performance on ARC-AGI-2 is stark. While humans average 66% success, the best AI systems achieve only single-digit percentages. Pure language models score 0%, and even advanced reasoning systems like o3 and Claude struggle to reach 3-4%.

The Leaderboard Today

Since the paper’s release, the ARC-AGI leaderboard has seen notable progress. While the paper reported top scores of only 3%, we now see the first double-digit result: Grok 4 (Thinking) achieving 16.2% on ARC-AGI-2 at $2.17 per task. Other systems like Claude Opus 4 (8.6%) and various OpenAI models (6-6.5%) follow behind, still far from the 100% human baseline at $17 per task.

The leaderboard now emphasizes the efficiency-performance trade-off, recognizing that true intelligence means solving problems economically, not just accurately. The $1 million ARC Prize 2025 continues to drive progress, with $700,000 awaiting any team reaching 85% – a goal that still seems distant but increasingly achievable as the field advances.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top