ARC-AGI-3 發布：頂尖 AI 全部得分不到 1% | ARC-AGI-3: Every Top AI Model Scored Under 1%

3月 29, 2026

By Kit 小克 | AI Tool Observer | 2026-03-30

🇹🇼 ARC-AGI-3 發布：頂尖 AI 全部得分不到 1%

ARC-AGI-3 於 2026 年 3 月 25 日正式發布，這個全新互動式基準測試立刻震驚整個 AI 圈：Gemini 3.1 Pro、GPT-5.4、Claude Opus 4.6 等頂尖模型，全部得分不到 1%。這不只是一個數字，而是對當前 AI 能力極限最直接的一次公開測試。

ARC-AGI-3 是什麼？

ARC-AGI（Abstraction and Reasoning Corpus）系列由 Francois Chollet 創立，目的是測試機器是否具備真正的一般智能，而非死記硬背或過度訓練特定題型。

第三代與前兩代最大的不同在於：它是完全互動式的。AI 不是解題，而是要在一個沒有說明書的遊戲環境中自主探索、學習規則、設定目標、並完成長期任務。測試的四個核心能力是：

探索能力（Exploration）：在未知環境中主動試探
世界建模（World Modeling）：理解環境規則並形成心理模型
目標設定（Goal-setting）：自行定義並追求有意義的目標
長期規劃（Long-horizon Planning）：跨多步驟維持行動一致性

各大模型得分：讓人冷靜的數字

以下是目前已知的 ARC-AGI-3 各模型成績：

Google Gemini 3.1 Pro Preview：0.37%（目前最高）
OpenAI GPT-5.4（High）：0.26%
Anthropic Claude Opus 4.6（Max）：0.25%
xAI Grok-4.20（Reasoning Beta）：0.00%
人類基準線：100%

差距不是些微落後，而是接近零。這個結果讓不少在 Hacker News 討論的開發者感嘆：「我們一直在測量的，根本不是智能。」

為什麼這次不一樣？

過去的 AI 基準測試，往往在幾年內就被「攻克」——模型訓練資料越來越接近測試集，得分隨之飆升。ARC-AGI-3 試圖從根本上改變這個循環：

沒有預先給定的目標或規則，AI 必須自己發現
環境是動態的，每個難度等級的學習要延續到下一關
無法靠死背訓練資料得分

這正是 Chollet 一直強調的「真正的泛化能力」——不是記憶，而是從少量經驗中推導出新規則。

獎金與競賽時程

ARC Prize 2026 提供超過 200 萬美元總獎金，ARC-AGI-3 賽道設有 70 萬美元大獎（第一個達到 100% 者）。里程碑截止日期為 2026 年 6 月 30 日和 9 月 30 日，最終結果於 12 月 4 日公布。

身為開發者，這代表什麼？

不必過度解讀 ARC-AGI-3 的結果，說 AI 沒有未來。但它確實提醒我們：目前的大型語言模型本質上是模式匹配器，在需要真正探索和自主推理的任務上，仍然嚴重不足。如果你在建構 AI Agent，這個基準值得關注——它定義了「真正自主」的下一道門檻。

好不好用，試了才知道。

🇺🇸 ARC-AGI-3: Every Top AI Model Scored Under 1%

ARC-AGI-3 launched on March 25, 2026, and the AI world hasn't stopped talking about it. The reason? Every major frontier model — Gemini 3.1 Pro, GPT-5.4, Claude Opus 4.6 — scored under 1%. Not just behind humans. Effectively zero.

What Is ARC-AGI-3?

The Abstraction and Reasoning Corpus (ARC-AGI) benchmark was created by Francois Chollet to test whether AI systems have genuine general intelligence — not just the ability to memorize patterns from training data. The third version marks a fundamental shift: it's fully interactive.

Instead of solving static visual puzzles, AI agents are dropped into game environments with no instructions, no stated goals, and no rule book. They must figure everything out on their own. The four core capabilities being tested:

Exploration: Probing an unknown environment to gather information
World Modeling: Building an internal model of how the environment works
Goal-setting: Autonomously defining and pursuing meaningful objectives
Long-horizon Planning: Maintaining coherent action across many steps

The Benchmark Scores That Humbled Everyone

Here are the official ARC-AGI-3 scores from the launch:

Google Gemini 3.1 Pro Preview: 0.37% (top scorer)
OpenAI GPT-5.4 (High): 0.26%
Anthropic Claude Opus 4.6 (Max): 0.25%
xAI Grok-4.20 (Reasoning Beta): 0.00%
Human baseline: 100%

The gap isn't close. It's a chasm. The Hacker News thread (497 points, 365 comments) captured the mood well: developers noted that the benchmarks we've been celebrating don't measure what we thought they did.

Why This Benchmark Is Harder to Game

Previous benchmarks got "solved" within years as model training data inched closer to the test sets. ARC-AGI-3 is designed to break that cycle:

No goals or rules are given — the agent must discover them
Environments are dynamic; knowledge must carry forward across difficulty levels
Memorizing training data doesn't help

This tests what Chollet calls "true generalization" — deriving new rules from minimal experience, not pattern-matching against a memorized corpus.

The Prize and Timeline

ARC Prize 2026 offers over $2 million in total prizes. The ARC-AGI-3 track has a $700,000 grand prize for the first agent to hit 100%, with milestone checkpoints on June 30 and September 30, 2026. Results announced December 4.

What This Means for Developers

Don't overcorrect and conclude AI is useless — LLMs remain extraordinarily capable for specific tasks. But ARC-AGI-3 is a clear-eyed reminder that current models are sophisticated pattern matchers, not autonomous reasoning systems. If you're building AI agents that need to explore unknown environments, adapt on the fly, and self-direct over long horizons, you're working at the frontier of what's genuinely hard.

Worth watching as the competition progresses — the first team to crack even 10% will have built something genuinely new.

好不好用，試了才知道 — You won't know until you try it yourself.

Sources / 資料來源

延伸閱讀 / Related Articles

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

搜尋此網誌

AI小貼士