GPT-5.4 電腦操控超越人類：OSWorld 75% 是真的能用，還是跑分好看？ | GPT-5.4 Computer Use Beats Humans on OSWorld

By Kit 小克 | AI Tool Observer | 2026-04-10

🇹🇼 GPT-5.4 電腦操控超越人類：OSWorld 75% 是真的能用，還是跑分好看？

GPT-5.4 的電腦操控能力到底有多強？

OpenAI 在 2026 年 3 月 5 日發布的 GPT-5.4，是第一個在 OSWorld 桌面自動化基準測試中超越人類專家的 AI 模型，拿下 75% 的成功率，而人類專家基準是 72.4%。這代表 AI 第一次在「操作電腦完成真實任務」這件事上，比人做得更好。

OSWorld 是什麼？為什麼這個測試很重要？

OSWorld 是一個模擬真實桌面環境的基準測試，測試項目包括點擊按鈕、填寫表單、操作檔案系統、瀏覽網頁等日常電腦操作。跟一般的文字問答測試不同，這是測 AI 能不能「真的動手操作電腦」。GPT-5.4 的前代 GPT-5.2 只拿到 47.3%，一個版本就跳了將近 28 個百分點，進步幅度相當驚人。

GPT-5.4 的電腦操控功能有哪些？

GPT-5.4 是第一個內建 computer use 功能的主線模型，不需要外掛或第三方整合。它可以透過 Playwright 程式碼和直接的滑鼠鍵盤指令兩種方式操控電腦，支援「建構-執行-驗證-修復」的完整自動化迴圈。

資料輸入自動化：自動填表、整理試算表
報告產出：跨多個軟體收集資料並產出報告
Email 管理：自動分類、回覆常見信件
多應用程式工作流：在不同軟體間切換完成複雜任務

實際使用要花多少錢？

標準版定價是每百萬 input token .50、每百萬 output token 。一般的自動化工作階段（10-20 張螢幕截圖）大約花 /bin/zsh.10-0.50。Pro 版本則是 / per MTok，適合高風險任務。超過 272K token 的 input 費用會翻倍。

跟 Claude 的 computer use 比起來如何？

Anthropic 的 Claude 早在 2024 年就推出了 computer use 功能，是這個領域的先行者。但 GPT-5.4 的優勢在於將 computer use 直接內建到主線模型中，加上 100 萬 token 的超長上下文視窗，讓它在處理複雜多步驟任務時更有優勢。不過實際使用中，哪個更好用還是要看具體場景。

對開發者來說意味著什麼？

GPT-5.4 同時在 SWE-bench Pro（程式碼）拿下 57.7%、GDPval（知識工作）拿下 83%，是第一個在桌面操控、程式開發、知識工作三個領域都達到前沿水準的通用模型。這對打造 AI Agent 的開發者來說是個重要里程碑——一個模型就能處理多種自動化需求。

適合什麼樣的使用場景？

最適合重複性高、流程固定的桌面操作任務。例如每天從多個系統擷取資料整理報表、批量處理表單、自動化測試流程等。但如果是需要高度判斷力的創意工作，目前還是建議人機協作。

常見問題 FAQ

Q: GPT-5.4 的 OSWorld 75% 成績代表什麼？

代表 GPT-5.4 在模擬真實桌面任務（點擊、填表、瀏覽網頁等）的成功率達到 75%，超過人類專家的 72.4% 基準線，是目前唯一超越人類的 AI 模型。

Q: GPT-5.4 的 computer use 功能是免費的嗎？

不是，computer use 功能按 API token 計費。標準版每百萬 input token .50，一次典型的自動化操作大約花費 /bin/zsh.10-0.50 美元。

Q: GPT-5.4 能完全取代人類操作電腦嗎？

目前還不行。75% 的成功率雖然超過人類基準，但仍有 25% 的失敗率。適合用於重複性任務的自動化，但重要操作仍需人類監督。

Q: 跟之前的 GPT-5.2 差多少？

GPT-5.2 在 OSWorld 只拿到 47.3%，GPT-5.4 跳到 75%，提升了將近 28 個百分點，是一次非常大的飛躍。

好不好用，試了才知道。

🇺🇸 GPT-5.4 Computer Use Beats Humans on OSWorld — But Is It Practical?

How Good Is GPT-5.4 at Controlling Computers?

OpenAI released GPT-5.4 on March 5, 2026, and it became the first AI model to surpass human expert performance on the OSWorld desktop automation benchmark, scoring 75% against the 72.4% human baseline. This means AI can now complete real computer tasks — clicking, typing, navigating — more reliably than human testers.

What Is OSWorld and Why Does This Benchmark Matter?

OSWorld simulates real desktop environments and tests practical computer operations like clicking buttons, filling forms, managing files, and browsing the web. Unlike text-based Q&A benchmarks, OSWorld measures whether AI can actually operate a computer. GPT-5.2 scored just 47.3%, making GPT-5.4s jump of nearly 28 percentage points remarkably significant.

What Can GPT-5.4 Computer Use Actually Do?

GPT-5.4 is the first mainline model with built-in computer use — no plugins or third-party integrations needed. It operates computers through both Playwright code execution and direct mouse/keyboard commands from screenshots, enabling a full build-run-verify-fix automation loop.

Data entry automation: Auto-fill forms and organize spreadsheets
Report generation: Gather data across multiple applications
Email management: Auto-classify and respond to routine messages
Multi-app workflows: Switch between different software to complete complex tasks

How Much Does It Cost to Use?

Standard pricing is .50 per million input tokens and per million output tokens. A typical automation session with 10-20 screenshots runs about /bin/zsh.10-0.50. The Pro variant costs / per MTok for high-stakes tasks. Input costs double beyond 272K tokens.

How Does It Compare to Claude Computer Use?

Anthropic pioneered computer use with Claude back in 2024. GPT-5.4s advantage is having computer use built directly into the mainline model, plus its 1-million-token context window for handling complex multi-step tasks. In practice, the better choice depends on your specific use case.

What Does This Mean for Developers?

GPT-5.4 also scored 57.7% on SWE-bench Pro (coding) and 83% on GDPval (knowledge work), making it the first general-purpose model at frontier level across desktop control, software engineering, and knowledge work. For developers building AI agents, this is a milestone — one model can now handle diverse automation needs.

What Are the Best Use Cases?

Best suited for repetitive, process-driven desktop tasks: daily data extraction and report compilation, batch form processing, automated testing workflows. For creative work requiring high judgment, human-AI collaboration is still recommended.

FAQ

Q: What does GPT-5.4s 75% OSWorld score actually mean?

It means GPT-5.4 successfully completed 75% of simulated real desktop tasks (clicking, form filling, web browsing), surpassing the 72.4% human expert baseline — the only AI model to do so.

Q: Is GPT-5.4 computer use free?

No, it is billed via API tokens. Standard rate is .50 per million input tokens. A typical automation session costs roughly /bin/zsh.10-0.50.

Q: Can GPT-5.4 fully replace humans at computer tasks?

Not yet. While 75% exceeds the human baseline, a 25% failure rate remains. It is best for automating repetitive tasks, with human oversight for critical operations.

Q: How much better is GPT-5.4 than GPT-5.2?

GPT-5.2 scored 47.3% on OSWorld; GPT-5.4 jumped to 75% — a nearly 28-point improvement in a single version update.

You never know until you try.

Sources / 資料來源

常見問題 FAQ

GPT-5.4 的 OSWorld 75% 成績代表什麼？

GPT-5.4 在模擬真實桌面任務的成功率達 75%，超過人類專家 72.4% 基準線，是目前唯一超越人類的 AI 模型。

GPT-5.4 的 computer use 功能要多少錢？

按 API token 計費，標準版每百萬 input token .50，一次自動化操作約 /bin/zsh.10-0.50 美元。

GPT-5.4 能完全取代人類操作電腦嗎？

目前不行，75% 成功率仍有 25% 失敗率，適合重複性任務自動化，重要操作仍需人類監督。

GPT-5.4 跟 GPT-5.2 差多少？

GPT-5.2 在 OSWorld 只有 47.3%，GPT-5.4 跳到 75%，提升近 28 個百分點。

延伸閱讀 / Related Articles

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

Stanford 研究登上《Science》：11 個 AI 模型有 47% 機率說你對，即使你錯了 | Stanford Study in Science: AI Models Validate Harmful Behavior 47% of the Time — Sycophancy Is a Real Problem

3月 28, 2026

閱讀完整內容