GLM-5.1 開源模型實測：754B 參數跑 8 小時不停，SWE-Bench Pro 登頂的真相 | GLM-5.1 Hands-On: The 754B Open-Source Model That Codes for 8 Hours Straight

By Kit 小克 | AI Tool Observer | 2026-04-12

🇹🇼 GLM-5.1 開源模型實測：754B 參數跑 8 小時不停，SWE-Bench Pro 登頂的真相

Z.ai（原智譜 AI）在 4 月 7 日丟出了一顆震撼彈：GLM-5.1，一個 754B 參數的開源 MoE 模型，用 MIT 授權直接放上 Hugging Face。最誇張的不是參數量，而是它號稱能連續自主工作 8 小時，完成 1,700 個步驟的 Agent 任務鏈。去年底的 Agent 大概只能跑 20 步，這數字直接翻了 85 倍。

GLM-5.1 到底是什麼？為什麼開發者在討論它？

GLM-5.1 是 Z.ai 的旗艦開源模型，專為Agentic Engineering（自主工程）設計。核心規格：

754B 參數，Mixture-of-Experts 架構
200K context window，128K max output tokens
MIT 授權——商用、修改、微調都可以，零限制
支援 Claude Code、OpenClaw 等主流 Agent 工具

SWE-Bench Pro 跑分第一，但真的贏很多嗎？

GLM-5.1 在 SWE-Bench Pro 拿下 58.4 分，超過 GPT-5.4（57.7）和 Claude Opus 4.6（57.3）。但說實話，這個差距只有 1.1 分。更值得注意的是，在包含 Terminal-Bench 2.0 和 NL2Repo 的綜合編碼評測中，Claude Opus 4.6 仍然以 57.5 對 54.9 領先。

所以結論是：GLM-5.1 在特定基準上登頂，但整體實力跟頂級閉源模型互有勝負。真正的意義在於——這是一個開源模型做到的。

8 小時自主工作：從「Vibe Coding」到「Agentic Engineering」

Z.ai 做了一個令人印象深刻的 Demo：讓 GLM-5.1 從零開始建一個 Linux 風格桌面環境的網頁應用。沒有初始程式碼、沒有設計稿、沒有中途干預。

結果？8 小時內完成 655 次迭代，自主建出檔案瀏覽器、終端機、文字編輯器、系統監控器，甚至還有能玩的小遊戲。之前的模型通常做個工具列、放個佔位視窗就宣告完成了。

這才是 GLM-5.1 最大的突破點：不是單次回覆更聰明，而是能持續迭代、自我修正、逐步改善。

開源界的新標竿？三個你該關注的理由

MIT 授權真的香：不像某些「開源但限制商用」的模型，GLM-5.1 完全自由。企業可以直接拿去部署，不用擔心授權問題。
Agent 能力的天花板被拉高：1,700 步的任務鏈意味著你可以交給它更複雜、更長期的工程任務，而不只是「幫我寫個函式」。
中國 AI 開源生態的里程碑：繼 DeepSeek 之後，又一個來自中國的開源模型打入全球頂級行列。

實際使用要注意什麼？

754B 參數的 MoE 模型，本地跑需要相當大的硬體資源。大多數開發者會透過 API（DeepInfra、Z.ai 官方）使用。8 小時自主模式需要搭配特定的 Agent 框架，不是開箱即用。另外，基準測試成績不等於實際生產環境表現，建議先在自己的使用場景跑過再決定。

FAQ

GLM-5.1 真的比 GPT-5.4 和 Claude Opus 4.6 強嗎？

在 SWE-Bench Pro 上是的，但差距很小（1.1 分）。在綜合編碼評測中，Claude Opus 4.6 仍然領先。要看你的具體使用場景。

我可以在本地跑 GLM-5.1 嗎？

理論上可以，但 754B MoE 模型需要大量 GPU 記憶體。大多數人會透過 API 服務使用，DeepInfra 和 Z.ai 官方都有提供。

GLM-5.1 的 MIT 授權代表什麼？

代表你可以免費下載、商用、修改、微調，沒有任何使用限制。這是開源授權中最寬鬆的之一。

8 小時自主工作模式怎麼用？

需要搭配 Agent 框架（如 Claude Code、OpenClaw），讓模型在迴圈中自動執行、檢查輸出、修正問題。不是單純的 API 呼叫。

GLM-5.1 適合什麼場景？

最適合長時間的軟體工程任務：修 Bug、重構程式碼、建新功能模組。短對話或簡單問答不是它的強項。

🇺🇸 GLM-5.1 Hands-On: The 754B Open-Source Model That Codes for 8 Hours Straight

Z.ai (formerly Zhipu AI) dropped GLM-5.1 on April 7 — a 754-billion parameter open-source MoE model under the MIT license, free on Hugging Face. The headline number is not the parameter count but the claim that it can work autonomously for 8 hours straight, chaining 1,700 agent steps. Late last year, agents managed about 20 steps. That is an 85x jump.

What Is GLM-5.1 and Why Should Developers Care?

GLM-5.1 is Z.ai's flagship open model built specifically for agentic engineering — long-horizon, multi-step coding tasks. Key specs:

754B parameters, Mixture-of-Experts architecture
200K context window, 128K max output tokens
MIT license — commercial use, modification, and fine-tuning with zero restrictions
Compatible with Claude Code, OpenClaw, and other mainstream agent frameworks

SWE-Bench Pro #1 — But How Big Is the Lead?

GLM-5.1 scores 58.4 on SWE-Bench Pro, topping GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). Honest take: that is a 1.1-point gap. On the broader coding composite including Terminal-Bench 2.0 and NL2Repo, Claude Opus 4.6 still leads 57.5 to 54.9.

The real story is not that GLM-5.1 crushes proprietary models — it is that an MIT-licensed open model now trades blows with the best closed-source ones.

8-Hour Autonomous Work: From Vibe Coding to Agentic Engineering

Z.ai ran an impressive demo: GLM-5.1 was told to build a Linux-style desktop environment as a web app from scratch. No starter code, no mockups, no human intervention.

Result: 655 iterations over 8 hours, producing a file browser, terminal, text editor, system monitor, and even playable games. Previous models typically rendered a taskbar and a placeholder window before declaring the job done.

This is the real breakthrough: not smarter single responses, but sustained iteration, self-correction, and incremental improvement over hours.

Three Reasons GLM-5.1 Matters for the Open-Source Ecosystem

MIT license is the real deal: Unlike models with commercial-use restrictions, GLM-5.1 is fully free. Enterprises can deploy without licensing headaches.
Agent ceiling just got higher: 1,700-step task chains mean you can hand off complex, long-running engineering work — not just “write me a function.”
China's open-source AI milestone: Following DeepSeek, another Chinese open model has entered the global top tier.

What to Watch Out For in Practice

A 754B MoE model demands serious hardware to run locally. Most developers will use it via API (DeepInfra, Z.ai official). The 8-hour autonomous mode requires a proper agent harness — it is not plug-and-play. And benchmarks are not production: test on your own workloads before committing.

FAQ

Is GLM-5.1 really better than GPT-5.4 and Claude Opus 4.6?

On SWE-Bench Pro, yes — by 1.1 points. On composite coding benchmarks, Claude Opus 4.6 still leads. It depends on your use case.

Can I run GLM-5.1 locally?

Technically yes, but the 754B MoE model requires substantial GPU memory. Most users access it through API providers like DeepInfra or Z.ai.

What does MIT license mean for GLM-5.1?

You can download, use commercially, modify, and fine-tune it with zero restrictions. It is one of the most permissive open-source licenses available.

How does the 8-hour autonomous mode work?

You need an agent framework (Claude Code, OpenClaw, etc.) that runs the model in a loop — execute, review output, self-correct, repeat. It is not a simple API call.

What is GLM-5.1 best suited for?

Long-running software engineering tasks: bug fixes, code refactoring, building feature modules. It is not optimized for short conversations or simple Q&A.

Sources / 資料來源

常見問題 FAQ

GLM-5.1 真的比 GPT-5.4 和 Claude Opus 4.6 強嗎？

在 SWE-Bench Pro 上以 58.4 分小幅領先，但綜合編碼評測中 Claude Opus 4.6 仍然領先。視使用場景而定。

我可以在本地跑 GLM-5.1 嗎？

可以但需要大量 GPU 記憶體。754B MoE 模型建議透過 DeepInfra 或 Z.ai 官方 API 使用。

GLM-5.1 的 MIT 授權代表什麼？

可免費下載、商用、修改、微調，無任何限制，是最寬鬆的開源授權之一。

8 小時自主工作模式怎麼用？

需搭配 Agent 框架（如 Claude Code、OpenClaw），讓模型在迴圈中自動執行與自我修正。

GLM-5.1 最適合什麼場景？

長時間軟體工程任務如修 Bug、重構、建功能模組。短對話或簡單問答不是它的強項。

延伸閱讀 / Related Articles

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

Stanford 研究登上《Science》：11 個 AI 模型有 47% 機率說你對，即使你錯了 | Stanford Study in Science: AI Models Validate Harmful Behavior 47% of the Time — Sycophancy Is a Real Problem

3月 28, 2026

閱讀完整內容