GPT-5.4 電腦操作能力超越人類：OSWorld 75% 勝過人類 72.4%，百萬 Token 上下文首度開放 | GPT-5.4 Beats Humans at Computer Use: 75% on OSWorld vs 72.4% Human Baseline

4月 13, 2026

By Kit 小克 | AI Tool Observer | 2026-04-14

🇹🇼 GPT-5.4 電腦操作能力超越人類：OSWorld 75% 勝過人類 72.4%，百萬 Token 上下文首度開放

GPT-5.4 是 OpenAI 在 2026 年 3 月 5 日發布的最新旗艦模型，最大亮點是原生電腦操作能力——在 OSWorld 基準測試中拿下 75% 成功率，首次超越人類的 72.4%。搭配 100 萬 Token 上下文視窗，這是 OpenAI 第一個能真正「用電腦工作」的通用模型。

GPT-5.4 的電腦操作能力有多強？

GPT-5.4 是 OpenAI 首款內建原生電腦操作（Computer Use）能力的通用模型。在 OSWorld-Verified 基準測試中，它能透過截圖判斷畫面、操作滑鼠和鍵盤來完成桌面任務，成功率達 75%，大幅超越前代 GPT-5.2 的 47.3%，也超過人類測試者的 72.4%。這代表 AI 第一次在「操作電腦」這件事上比人類做得更好。

GPT-5.4 有哪些版本和定價？

OpenAI 一口氣推出五個版本，覆蓋從企業到邊緣裝置的所有場景：

GPT-5.4 Standard：$2.50 / $15 per MTok，100 萬 Token 上下文
GPT-5.4 Pro：$30 / $180 per MTok，適合法律、醫療、金融等高精度場景
GPT-5.4 Thinking：互動式推理模式，適合複雜多步驟任務
GPT-5.4 Mini：$0.75 / $4.50 per MTok，40 萬 Token 上下文，免費用戶也能用
GPT-5.4 Nano：$0.20 / $1.25 per MTok，僅限 API，適合高流量低延遲場景

GPT-5.4 在程式碼和知識工作表現如何？

GPT-5.4 不只是電腦操作強，在其他基準測試也全面領先：SWE-bench Pro（程式碼）57.7%、OSWorld（電腦操作）75%、GDPval（知識工作）83%。OpenAI 表示它比 GPT-5.2 減少 33% 的事實錯誤率，是第一個在三個不同領域都達到前沿水準的模型。

GPT-5.4 對開發者有什麼影響？

對開發者來說，最大的變化是自動化工作流程的可能性。以前 AI 只能給建議，現在 GPT-5.4 可以直接操作瀏覽器、填表單、跨應用程式完成任務。搭配 100 萬 Token 的超長上下文，AI Agent 可以處理需要長時間規劃和執行的複雜任務鏈。

不過要注意，75% 的 OSWorld 成功率雖然超越人類平均，但代表每四次操作還是有一次會出錯。在生產環境中使用電腦操作功能，錯誤處理和人工監督機制還是不能少。好不好用，試了才知道。

🇺🇸 GPT-5.4 Beats Humans at Computer Use: 75% on OSWorld vs 72.4% Human Baseline

GPT-5.4, released by OpenAI on March 5, 2026, introduces native computer use capability as its headline feature. On the OSWorld-Verified benchmark, it achieved a 75% success rate — surpassing the 72.4% human baseline for the first time. Combined with a 1-million token context window, this is OpenAI's first general-purpose model that can truly "work on a computer."

How Good Is GPT-5.4 at Computer Use?

GPT-5.4 is OpenAI's first model with native computer use baked directly in. On OSWorld-Verified — which measures a model's ability to navigate desktop environments through screenshots and keyboard/mouse actions — it scores 75%, far exceeding GPT-5.2's 47.3% and surpassing the human expert baseline of 72.4%. This marks the first time any AI model has outperformed humans at operating a computer.

What Are the GPT-5.4 Variants and Pricing?

OpenAI launched five variants covering everything from enterprise to edge:

GPT-5.4 Standard: $2.50 / $15 per MTok, 1M token context
GPT-5.4 Pro: $30 / $180 per MTok, for legal, medical, and financial precision
GPT-5.4 Thinking: Interactive reasoning mode for complex multi-step tasks
GPT-5.4 Mini: $0.75 / $4.50 per MTok, 400K context, available to free-tier users
GPT-5.4 Nano: $0.20 / $1.25 per MTok, API-only, optimized for high-volume low-latency

How Does GPT-5.4 Perform on Code and Knowledge Work?

GPT-5.4 excels across the board: SWE-bench Pro (coding) 57.7%, OSWorld (computer use) 75%, GDPval (knowledge work) 83%. OpenAI reports 33% fewer factual errors compared to GPT-5.2, making it the first model to achieve frontier-level performance across three distinct domains simultaneously.

What Does GPT-5.4 Mean for Developers?

The biggest shift for developers is the automation potential. Previously, AI could only suggest actions — now GPT-5.4 can directly operate browsers, fill forms, and execute tasks across applications. With the 1-million token context window, AI agents can handle complex task chains requiring long-horizon planning and execution.

That said, a 75% OSWorld success rate still means one in four operations fails. For production use of computer use capabilities, error handling and human oversight remain essential.

Sources / 資料來源

常見問題 FAQ

GPT-5.4 什麼時候發布的？

GPT-5.4 在 2026 年 3 月 5 日發布，包含 Standard、Pro、Thinking、Mini、Nano 五個版本。

GPT-5.4 的電腦操作能力有多強？

在 OSWorld-Verified 基準測試中，GPT-5.4 達到 75% 成功率，首次超越人類測試者的 72.4%，也大幅超越前代 GPT-5.2 的 47.3%。

GPT-5.4 API 定價多少？

Standard 版 $2.50/$15 per MTok、Mini 版 $0.75/$4.50、Nano 版 $0.20/$1.25。Pro 版最貴 $30/$180 per MTok。

GPT-5.4 的上下文長度是多少？

GPT-5.4 Standard 和 Pro 支援 100 萬 Token 上下文，Mini 版支援 40 萬 Token。

GPT-5.4 跟 Gemini 3.1 比誰比較好？

GPT-5.4 在電腦操作（OSWorld 75%）領先，但 Gemini 3.1 在多模態原生處理和 200 萬 Token 上下文上有優勢，各有擅長。

延伸閱讀 / Related Articles

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

搜尋此網誌

AI小貼士

GPT-5.4 電腦操作能力超越人類：OSWorld 75% 勝過人類 72.4%，百萬 Token 上下文首度開放 | GPT-5.4 Beats Humans at Computer Use: 75% on OSWorld vs 72.4% Human Baseline

🇹🇼 GPT-5.4 電腦操作能力超越人類：OSWorld 75% 勝過人類 72.4%，百萬 Token 上下文首度開放

GPT-5.4 的電腦操作能力有多強？

GPT-5.4 有哪些版本和定價？

GPT-5.4 在程式碼和知識工作表現如何？

GPT-5.4 對開發者有什麼影響？

🇺🇸 GPT-5.4 Beats Humans at Computer Use: 75% on OSWorld vs 72.4% Human Baseline

How Good Is GPT-5.4 at Computer Use?

What Are the GPT-5.4 Variants and Pricing?

How Does GPT-5.4 Perform on Code and Knowledge Work?

What Does GPT-5.4 Mean for Developers?

Sources / 資料來源

常見問題 FAQ

GPT-5.4 什麼時候發布的？

GPT-5.4 的電腦操作能力有多強？

GPT-5.4 API 定價多少？

GPT-5.4 的上下文長度是多少？

GPT-5.4 跟 Gemini 3.1 比誰比較好？

延伸閱讀 / Related Articles

留言

張貼留言

這個網誌中的熱門文章

Cursor vs GitHub Copilot vs Claude Code：AI 程式助手大比拼 | AI Coding Assistants Compared: Cursor vs GitHub Copilot vs Claude Code

Stanford 研究登上《Science》：11 個 AI 模型有 47% 機率說你對，即使你錯了 | Stanford Study in Science: AI Models Validate Harmful Behavior 47% of the Time — Sycophancy Is a Real Problem

Claude Code 實測：AI 幫你寫程式到底行不行？ | Claude Code Review: Can AI Really Code for You?