ARC-AGI-2: The Next Challenge for AI Reasoning

What is ARC-AGI?

Imagine giving a computer a puzzle that any 8-year-old could solve, but watching it struggle despite having access to the entire internet’s worth of knowledge. That’s the essence of ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) – a benchmark designed to test whether AI can truly think, not just memorize.

Created by François Chollet in 2019, ARC-AGI presents AI systems with simple visual puzzles: grids with colored cells that transform according to hidden rules. The AI must figure out the pattern from just a few examples and apply it to new cases. No language needed, no specialized knowledge required – just pure reasoning ability.

Why ARC-AGI-2 Was Needed

After five years and multiple competitions, ARC-AGI-1 was finally approaching its limits. In late 2024, OpenAI’s o3 model achieved scores between 76-88% (depending on compute usage), marking the first time an AI system approached human-level performance on the benchmark. But this success came with a catch – the highest score required an estimated $20,000 worth of compute per task!

The original benchmark also had some weaknesses:

Nearly half the tasks could be “brute-forced” by trying millions of possible solutions
The same 100 hidden test tasks had been reused across all competitions, risking overfitting
It couldn’t differentiate between varying levels of human intelligence well

What Makes ARC-AGI-2 Different

ARC-AGI-2, released in May 2025, preserves what made the original special while addressing its limitations. Every task is entirely novel and designed to resist brute-force approaches. The new benchmark introduces four key challenges:

Multi-rule reasoning: Tasks require combining multiple transformation rules simultaneously, like cropping, rescaling, and pattern-matching all at once.

Sequential dependencies: Solutions must be built step-by-step, where each step depends on the previous one – you can’t skip ahead.

Contextual clues: The same basic operation might work differently based on subtle contextual hints in the grid.

Symbol interpretation: Some tasks use abstract symbols whose meaning must be deduced from the examples.

Example of contextual rule application, ARC-AGI-2 Public Eval Task #b5ca7ac4. Solve this task!

The Human Baseline

Before releasing ARC-AGI-2, researchers conducted extensive testing with over 400 human participants from diverse backgrounds. The results were encouraging: every single task was solved by at least two humans within two attempts, with participants solving an average of 66% of tasks they attempted in about 2.7 minutes per task.

Interestingly, performance didn’t correlate with technical expertise – programmers and mathematicians performed no better than teachers or healthcare workers. This confirms that ARC-AGI-2 tests general intelligence, not specialized knowledge.

The Current State of AI

The contrast between human and AI performance on ARC-AGI-2 is stark. While humans average 66% success, the best AI systems achieve only single-digit percentages. Pure language models score 0%, and even advanced reasoning systems like o3 and Claude struggle to reach 3-4%.

The Leaderboard Today

Since the paper’s release, the ARC-AGI leaderboard has seen notable progress. While the paper reported top scores of only 3%, we now see the first double-digit result: Grok 4 (Thinking) achieving 16.2% on ARC-AGI-2 at $2.17 per task. Other systems like Claude Opus 4 (8.6%) and various OpenAI models (6-6.5%) follow behind, still far from the 100% human baseline at $17 per task.

The leaderboard now emphasizes the efficiency-performance trade-off, recognizing that true intelligence means solving problems economically, not just accurately. The $1 million ARC Prize 2025 continues to drive progress, with $700,000 awaiting any team reaching 85% – a goal that still seems distant but increasingly achievable as the field advances.

Read the full paper

Nicolas Jamet

Nicolas Jamet is Senior Fund Manager at RAM AI’s Systematic Equity team since 2016. He co-manages the systematic equity funds, working on the development of low-frequency systematic investment strategies. He is involved in enhancing AI infrastructure, incorporating sustainability into the fund range and contributes as a member of the Responsible Investment Committee.

Related Posts

Leave a Comment Cancel Reply