ARC-AGI-3 測試揭真相:GPT、Claude、Gemini 全數得分不到 1%,普通人隨便應考卻能得滿分 | ARC-AGI-3 Exposes the Truth: Every Major AI Model Scores Under 1% While Untrained Humans Score 100%
By Kit 小克 | AI Tool Observer | 2026-03-27
🇹🇼 ARC-AGI-3 測試揭真相:GPT、Claude、Gemini 全數得分不到 1%,普通人隨便應考卻能得滿分
如果你最近一直聽到「AI 即將超越人類」、「AGI 快來了」這類說法,這篇文章的數據可能會讓你冷靜一下。
ARC-AGI-3 是什麼?
ARC Prize Foundation 在 2026 年 3 月底發布了 ARC-AGI-3,這是 ARC(Abstraction and Reasoning Corpus)系列基準測試的最新版本。與過去版本不同,ARC-AGI-3 要求 AI 代理進入一個全新的、即時互動的回合制遊戲環境,沒有任何說明書、沒有預訓練優勢——就像把一個不知道規則的玩家丟進一個全新桌遊,看他能不能自己摸索出玩法並贏得遊戲。
這個設計的核心理念是:如果 AI 真的具備「推理能力」,它應該能夠在全新情境下學習規則,而不是依賴記憶過的資料。
結果:慘不忍睹
以下是目前已知的各大頂尖模型得分:
- Gemini 3.1 Pro:0.37%(目前「最高分」)
- GPT 系列頂尖模型:0.26%
- Claude Opus 4.6:0.25%
- Grok 最新版:接近 0%
- 未經訓練的普通人類:100%
沒有看錯。人類從未見過這個測試、沒有任何準備,卻能在 100% 的任務中找出規律並解題。而地球上最強大的 AI 模型,最高只能拿到 0.37%。
為什麼這件事值得認真看待?
ARC-AGI-3 精心設計了一個 AI 無法「作弊」的環境——沒有可以死背的答案,沒有可以套用的模式,只有純粹的動態探索與推理。
這暴露了現有大型語言模型的核心限制:它們本質上是強大的模式識別與記憶機器,而非真正能夠在全新情境下靈活推理的「智慧體」。當任務需要像人類一樣「不靠過去知識、當場理解規則並行動」,模型就會完全失去方向。
200 萬美元獎金還在等待領取者
ARC Prize Foundation 提供 200 萬美元獎金,獎勵任何能在 ARC-AGI-3 上達到人類水準的 AI 系統。目前無人達標,獎金繼續懸掛。
這不是說 AI 沒用——在許多實際工作中,現有模型的表現已相當出色。但這個基準提醒我們:「AI 能把話說得很好」和「AI 能夠真正推理」是兩件截然不同的事。
對開發者與使用者的啟示
- 別過度依賴 AI 處理「全新情境下的推理任務」,它在陌生規則下容易出錯
- RAG、工具呼叫、結構化 Prompt 等方法仍是彌補模型推理缺口的重要手段
- 追蹤 ARC-AGI 系列進展,是判斷 AI 真實進步的好方法——比行銷話術更可信
好不好用,試了才知道。但這次,數字說得比任何行銷詞彙都誠實。
🇺🇸 ARC-AGI-3 Exposes the Truth: Every Major AI Model Scores Under 1% While Untrained Humans Score 100%
If you have been hearing a lot of "AI is about to surpass humans" and "AGI is right around the corner" lately, these benchmark numbers might be the cold water you need.
What Is ARC-AGI-3?
The ARC Prize Foundation released ARC-AGI-3 in late March 2026, the latest iteration of the Abstraction and Reasoning Corpus benchmark series. Unlike previous versions, ARC-AGI-3 places AI agents inside a completely novel, turn-based game environment — no instructions, no pre-training advantage, no patterns to memorize. Think of it as dropping someone into a brand-new board game with zero rulebook and asking them to figure out the rules and win.
The design principle is direct: if an AI truly has reasoning ability, it should be able to learn rules in a new context from scratch — not retrieve memorized training data.
The Results: Brutal
Here are the scores from the leading frontier models:
- Gemini 3.1 Pro: 0.37% (currently the "top performer")
- Leading GPT model: 0.26%
- Claude Opus 4.6: 0.25%
- Latest Grok: near 0%
- Untrained humans: 100%
That is not a typo. Humans who had never seen this benchmark, with zero preparation, solved 100% of the tasks. The most powerful AI systems in the world topped out at 0.37%.
Why This Matters
ARC-AGI-3 is specifically engineered to prevent AI from cheating — there are no memorizable answers, no transferable patterns, only pure in-context exploration and reasoning.
This exposes a core limitation of current large language models: they are fundamentally powerful pattern-matching and memory retrieval systems, not reasoning agents that can flexibly operate in genuinely novel environments. When a task demands the kind of "figure it out on the spot" intelligence that humans demonstrate naturally, frontier models fall apart completely.
This is not a knock on AI's practical usefulness. Current models are genuinely excellent at many real-world tasks. But this benchmark draws a clear line between "articulate" and "intelligent" — and right now, the gap is enormous.
The $2 Million Prize Is Still Unclaimed
The ARC Prize Foundation is offering $2 million USD to any AI system that can match human-level performance on ARC-AGI-3. As of today, the prize remains uncollected. No current architecture — not GPT, not Claude, not Gemini — comes close.
What Developers and Users Should Take Away
- Do not rely on AI for novel-context reasoning tasks — models struggle badly when the rules are not in their training data
- Techniques like RAG, structured prompting, and tool use remain critical for bridging the reasoning gap in practice
- Following the ARC-AGI benchmark series is one of the most reliable ways to track real AI progress — far more honest than vendor announcements
- The next time someone claims AGI is "6 months away," ask them about ARC-AGI-3 scores
You won't know until you try it — but in this case, the numbers already tried for us, and they told the truth.
Sources / 資料來源
- ARC-AGI-3 Official — ARC Prize Foundation
- ARC-AGI-3 Offers $2M — Every Frontier Model Scores Below 1% (The Decoder)
- ARC-AGI-3 Benchmark Reveals Major Gap Between Frontier Models and Human Reasoning (MLQ.ai)
AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends
留言
張貼留言