ARC-AGI-3 測試揭真相：GPT、Claude、Gemini 全數得分不到 1%，普通人隨便應考卻能得滿分 | ARC-AGI-3 Exposes the Truth: Every Major AI Model Scores Under 1% While Untrained Humans Score 100%

3月 27, 2026

By Kit 小克 | AI Tool Observer | 2026-03-27

🇹🇼 ARC-AGI-3 測試揭真相：GPT、Claude、Gemini 全數得分不到 1%，普通人隨便應考卻能得滿分

如果你最近一直聽到「AI 即將超越人類」、「AGI 快來了」這類說法，這篇文章的數據可能會讓你冷靜一下。

ARC-AGI-3 是什麼？

ARC Prize Foundation 在 2026 年 3 月底發布了 ARC-AGI-3，這是 ARC（Abstraction and Reasoning Corpus）系列基準測試的最新版本。與過去版本不同，ARC-AGI-3 要求 AI 代理進入一個全新的、即時互動的回合制遊戲環境，沒有任何說明書、沒有預訓練優勢——就像把一個不知道規則的玩家丟進一個全新桌遊，看他能不能自己摸索出玩法並贏得遊戲。

這個設計的核心理念是：如果 AI 真的具備「推理能力」，它應該能夠在全新情境下學習規則，而不是依賴記憶過的資料。

結果：慘不忍睹

以下是目前已知的各大頂尖模型得分：

Gemini 3.1 Pro：0.37%（目前「最高分」）
GPT 系列頂尖模型：0.26%
Claude Opus 4.6：0.25%
Grok 最新版：接近 0%
未經訓練的普通人類：100%

沒有看錯。人類從未見過這個測試、沒有任何準備，卻能在 100% 的任務中找出規律並解題。而地球上最強大的 AI 模型，最高只能拿到 0.37%。

為什麼這件事值得認真看待？

ARC-AGI-3 精心設計了一個 AI 無法「作弊」的環境——沒有可以死背的答案，沒有可以套用的模式，只有純粹的動態探索與推理。

這暴露了現有大型語言模型的核心限制：它們本質上是強大的模式識別與記憶機器，而非真正能夠在全新情境下靈活推理的「智慧體」。當任務需要像人類一樣「不靠過去知識、當場理解規則並行動」，模型就會完全失去方向。

200 萬美元獎金還在等待領取者

ARC Prize Foundation 提供 200 萬美元獎金，獎勵任何能在 ARC-AGI-3 上達到人類水準的 AI 系統。目前無人達標，獎金繼續懸掛。

這不是說 AI 沒用——在許多實際工作中，現有模型的表現已相當出色。但這個基準提醒我們：「AI 能把話說得很好」和「AI 能夠真正推理」是兩件截然不同的事。

對開發者與使用者的啟示

別過度依賴 AI 處理「全新情境下的推理任務」，它在陌生規則下容易出錯
RAG、工具呼叫、結構化 Prompt 等方法仍是彌補模型推理缺口的重要手段
追蹤 ARC-AGI 系列進展，是判斷 AI 真實進步的好方法——比行銷話術更可信

好不好用，試了才知道。但這次，數字說得比任何行銷詞彙都誠實。

🇺🇸 ARC-AGI-3 Exposes the Truth: Every Major AI Model Scores Under 1% While Untrained Humans Score 100%

If you have been hearing a lot of "AI is about to surpass humans" and "AGI is right around the corner" lately, these benchmark numbers might be the cold water you need.

What Is ARC-AGI-3?

The ARC Prize Foundation released ARC-AGI-3 in late March 2026, the latest iteration of the Abstraction and Reasoning Corpus benchmark series. Unlike previous versions, ARC-AGI-3 places AI agents inside a completely novel, turn-based game environment — no instructions, no pre-training advantage, no patterns to memorize. Think of it as dropping someone into a brand-new board game with zero rulebook and asking them to figure out the rules and win.

The design principle is direct: if an AI truly has reasoning ability, it should be able to learn rules in a new context from scratch — not retrieve memorized training data.

The Results: Brutal

Here are the scores from the leading frontier models:

Gemini 3.1 Pro: 0.37% (currently the "top performer")
Leading GPT model: 0.26%
Claude Opus 4.6: 0.25%
Latest Grok: near 0%
Untrained humans: 100%

That is not a typo. Humans who had never seen this benchmark, with zero preparation, solved 100% of the tasks. The most powerful AI systems in the world topped out at 0.37%.

Why This Matters

ARC-AGI-3 is specifically engineered to prevent AI from cheating — there are no memorizable answers, no transferable patterns, only pure in-context exploration and reasoning.

This exposes a core limitation of current large language models: they are fundamentally powerful pattern-matching and memory retrieval systems, not reasoning agents that can flexibly operate in genuinely novel environments. When a task demands the kind of "figure it out on the spot" intelligence that humans demonstrate naturally, frontier models fall apart completely.

This is not a knock on AI's practical usefulness. Current models are genuinely excellent at many real-world tasks. But this benchmark draws a clear line between "articulate" and "intelligent" — and right now, the gap is enormous.

The $2 Million Prize Is Still Unclaimed

The ARC Prize Foundation is offering $2 million USD to any AI system that can match human-level performance on ARC-AGI-3. As of today, the prize remains uncollected. No current architecture — not GPT, not Claude, not Gemini — comes close.

What Developers and Users Should Take Away

Do not rely on AI for novel-context reasoning tasks — models struggle badly when the rules are not in their training data
Techniques like RAG, structured prompting, and tool use remain critical for bridging the reasoning gap in practice
Following the ARC-AGI benchmark series is one of the most reliable ways to track real AI progress — far more honest than vendor announcements
The next time someone claims AGI is "6 months away," ask them about ARC-AGI-3 scores

You won't know until you try it — but in this case, the numbers already tried for us, and they told the truth.

Sources / 資料來源

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

搜尋此網誌

AI小貼士