跳到主要內容

ARC-AGI-3 測試揭真相:GPT、Claude、Gemini 全數得分不到 1%,普通人隨便應考卻能得滿分 | ARC-AGI-3 Exposes the Truth: Every Major AI Model Scores Under 1% While Untrained Humans Score 100%

By Kit 小克 | AI Tool Observer | 2026-03-27

🇹🇼 ARC-AGI-3 測試揭真相:GPT、Claude、Gemini 全數得分不到 1%,普通人隨便應考卻能得滿分

如果你最近一直聽到「AI 即將超越人類」、「AGI 快來了」這類說法,這篇文章的數據可能會讓你冷靜一下。

ARC-AGI-3 是什麼?

ARC Prize Foundation 在 2026 年 3 月底發布了 ARC-AGI-3,這是 ARC(Abstraction and Reasoning Corpus)系列基準測試的最新版本。與過去版本不同,ARC-AGI-3 要求 AI 代理進入一個全新的、即時互動的回合制遊戲環境,沒有任何說明書、沒有預訓練優勢——就像把一個不知道規則的玩家丟進一個全新桌遊,看他能不能自己摸索出玩法並贏得遊戲。

這個設計的核心理念是:如果 AI 真的具備「推理能力」,它應該能夠在全新情境下學習規則,而不是依賴記憶過的資料。

結果:慘不忍睹

以下是目前已知的各大頂尖模型得分:

  • Gemini 3.1 Pro:0.37%(目前「最高分」)
  • GPT 系列頂尖模型:0.26%
  • Claude Opus 4.6:0.25%
  • Grok 最新版:接近 0%
  • 未經訓練的普通人類:100%

沒有看錯。人類從未見過這個測試、沒有任何準備,卻能在 100% 的任務中找出規律並解題。而地球上最強大的 AI 模型,最高只能拿到 0.37%。

為什麼這件事值得認真看待?

ARC-AGI-3 精心設計了一個 AI 無法「作弊」的環境——沒有可以死背的答案,沒有可以套用的模式,只有純粹的動態探索與推理

這暴露了現有大型語言模型的核心限制:它們本質上是強大的模式識別與記憶機器,而非真正能夠在全新情境下靈活推理的「智慧體」。當任務需要像人類一樣「不靠過去知識、當場理解規則並行動」,模型就會完全失去方向。

200 萬美元獎金還在等待領取者

ARC Prize Foundation 提供 200 萬美元獎金,獎勵任何能在 ARC-AGI-3 上達到人類水準的 AI 系統。目前無人達標,獎金繼續懸掛。

這不是說 AI 沒用——在許多實際工作中,現有模型的表現已相當出色。但這個基準提醒我們:「AI 能把話說得很好」和「AI 能夠真正推理」是兩件截然不同的事

對開發者與使用者的啟示

  • 別過度依賴 AI 處理「全新情境下的推理任務」,它在陌生規則下容易出錯
  • RAG、工具呼叫、結構化 Prompt 等方法仍是彌補模型推理缺口的重要手段
  • 追蹤 ARC-AGI 系列進展,是判斷 AI 真實進步的好方法——比行銷話術更可信

好不好用,試了才知道。但這次,數字說得比任何行銷詞彙都誠實。


🇺🇸 ARC-AGI-3 Exposes the Truth: Every Major AI Model Scores Under 1% While Untrained Humans Score 100%

If you have been hearing a lot of "AI is about to surpass humans" and "AGI is right around the corner" lately, these benchmark numbers might be the cold water you need.

What Is ARC-AGI-3?

The ARC Prize Foundation released ARC-AGI-3 in late March 2026, the latest iteration of the Abstraction and Reasoning Corpus benchmark series. Unlike previous versions, ARC-AGI-3 places AI agents inside a completely novel, turn-based game environment — no instructions, no pre-training advantage, no patterns to memorize. Think of it as dropping someone into a brand-new board game with zero rulebook and asking them to figure out the rules and win.

The design principle is direct: if an AI truly has reasoning ability, it should be able to learn rules in a new context from scratch — not retrieve memorized training data.

The Results: Brutal

Here are the scores from the leading frontier models:

  • Gemini 3.1 Pro: 0.37% (currently the "top performer")
  • Leading GPT model: 0.26%
  • Claude Opus 4.6: 0.25%
  • Latest Grok: near 0%
  • Untrained humans: 100%

That is not a typo. Humans who had never seen this benchmark, with zero preparation, solved 100% of the tasks. The most powerful AI systems in the world topped out at 0.37%.

Why This Matters

ARC-AGI-3 is specifically engineered to prevent AI from cheating — there are no memorizable answers, no transferable patterns, only pure in-context exploration and reasoning.

This exposes a core limitation of current large language models: they are fundamentally powerful pattern-matching and memory retrieval systems, not reasoning agents that can flexibly operate in genuinely novel environments. When a task demands the kind of "figure it out on the spot" intelligence that humans demonstrate naturally, frontier models fall apart completely.

This is not a knock on AI's practical usefulness. Current models are genuinely excellent at many real-world tasks. But this benchmark draws a clear line between "articulate" and "intelligent" — and right now, the gap is enormous.

The $2 Million Prize Is Still Unclaimed

The ARC Prize Foundation is offering $2 million USD to any AI system that can match human-level performance on ARC-AGI-3. As of today, the prize remains uncollected. No current architecture — not GPT, not Claude, not Gemini — comes close.

What Developers and Users Should Take Away

  • Do not rely on AI for novel-context reasoning tasks — models struggle badly when the rules are not in their training data
  • Techniques like RAG, structured prompting, and tool use remain critical for bridging the reasoning gap in practice
  • Following the ARC-AGI benchmark series is one of the most reliable ways to track real AI progress — far more honest than vendor announcements
  • The next time someone claims AGI is "6 months away," ask them about ARC-AGI-3 scores

You won't know until you try it — but in this case, the numbers already tried for us, and they told the truth.

Sources / 資料來源


AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

留言

這個網誌中的熱門文章

MCP 突破 9700 萬次下載:AI Agent 的「USB-C」為何成為 2026 年最重要的標準? | MCP Hits 97 Million Downloads: Why Model Context Protocol Became the Most Important AI Standard of 2026

歡迎來到 AI 工具觀察站 | Welcome to AI Tool Observer

ARC-AGI-3 發布:頂尖 AI 全部得分不到 1% | ARC-AGI-3: Every Top AI Model Scored Under 1%