What is ARC-AGI?
Imagine giving a computer a puzzle that any 8-year-old could solve, but watching it struggle despite having access to the entire internet’s worth of knowledge. That’s the essence of ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) – a benchmark designed to test whether AI can truly think, not just memorize.
Created by François Chollet in 2019, ARC-AGI presents AI systems with simple visual puzzles: grids with colored cells that transform according to hidden rules. The AI must figure out the pattern from just a few examples and apply it to new cases. No language needed, no specialized knowledge required – just pure reasoning ability.
Why ARC-AGI-2 Was Needed
After five years and multiple competitions, ARC-AGI-1 was finally approaching its limits. In late 2024, OpenAI’s o3 model achieved scores between 76-88% (depending on compute usage), marking the first time an AI system approached human-level performance on the benchmark. But this success came with a catch – the highest score required an estimated $20,000 worth of compute per task!
The original benchmark also had some weaknesses:
- Nearly half the tasks could be “brute-forced” by trying millions of possible solutions
- The same 100 hidden test tasks had been reused across all competitions, risking overfitting
- It couldn’t differentiate between varying levels of human intelligence well
What Makes ARC-AGI-2 Different
ARC-AGI-2, released in May 2025, preserves what made the original special while addressing its limitations. Every task is entirely novel and designed to resist brute-force approaches. The new benchmark introduces four key challenges:
Multi-rule reasoning: Tasks require combining multiple transformation rules simultaneously, like cropping, rescaling, and pattern-matching all at once.
Sequential dependencies: Solutions must be built step-by-step, where each step depends on the previous one – you can’t skip ahead.
Contextual clues: The same basic operation might work differently based on subtle contextual hints in the grid.
Symbol interpretation: Some tasks use abstract symbols whose meaning must be deduced from the examples.

Example of contextual rule application, ARC-AGI-2 Public Eval Task #b5ca7ac4. Solve this task!
The Human Baseline
Before releasing ARC-AGI-2, researchers conducted extensive testing with over 400 human participants from diverse backgrounds. The results were encouraging: every single task was solved by at least two humans within two attempts, with participants solving an average of 66% of tasks they attempted in about 2.7 minutes per task.
Interestingly, performance didn’t correlate with technical expertise – programmers and mathematicians performed no better than teachers or healthcare workers. This confirms that ARC-AGI-2 tests general intelligence, not specialized knowledge.
The Current State of AI
The contrast between human and AI performance on ARC-AGI-2 is stark. While humans average 66% success, the best AI systems achieve only single-digit percentages. Pure language models score 0%, and even advanced reasoning systems like o3 and Claude struggle to reach 3-4%.
The Leaderboard Today
Since the paper’s release, the ARC-AGI leaderboard has seen notable progress. While the paper reported top scores of only 3%, we now see the first double-digit result: Grok 4 (Thinking) achieving 16.2% on ARC-AGI-2 at $2.17 per task. Other systems like Claude Opus 4 (8.6%) and various OpenAI models (6-6.5%) follow behind, still far from the 100% human baseline at $17 per task.
The leaderboard now emphasizes the efficiency-performance trade-off, recognizing that true intelligence means solving problems economically, not just accurately. The $1 million ARC Prize 2025 continues to drive progress, with $700,000 awaiting any team reaching 85% – a goal that still seems distant but increasingly achievable as the field advances.

Nicolas Jamet is Senior Fund Manager at RAM AI’s Systematic Equity team since 2016. He co-manages the systematic equity funds, working on the development of low-frequency systematic investment strategies. He is involved in enhancing AI infrastructure, incorporating sustainability into the fund range and contributes as a member of the Responsible Investment Committee.