Microsoft MAI 自研模型實測：Suleyman 領軍打造語音、圖像三模型，正式向 OpenAI 宣戰 | Microsoft MAI Models Explained: In-House AI Takes on OpenAI With Speech, Voice and Image

By Kit 小克 | AI Tool Observer | 2026-04-13

🇹🇼 Microsoft MAI 自研模型實測：Suleyman 領軍打造語音、圖像三模型，正式向 OpenAI 宣戰

Microsoft MAI 模型是微軟在 2026 年 4 月 2 日推出的三款自研 AI 模型，包含語音轉文字、文字轉語音和圖像生成，標誌著微軟正式踏上與 OpenAI 分道揚鑣的自研之路。這篇文章帶你看懂這三款模型的實力、定價，以及微軟 AI 戰略的下一步。

Microsoft MAI 模型是什麼？為什麼重要？

MAI 是 Microsoft AI 的縮寫，由微軟 AI 執行長 Mustafa Suleyman 領導的「MAI 超智慧團隊」在 2025 年 11 月成立後的首批成果。這三款模型直接在 Microsoft Foundry 平台上線，開放開發者使用，意味著微軟不再只靠 OpenAI 的 GPT 系列打天下。

MAI-Transcribe-1：全球最準的語音轉文字模型？

MAI-Transcribe-1 是一款支援 25 種語言的語音轉文字模型，在 FLEURS 基準測試中拿下最低字錯率（WER），打敗了 Whisper-large-V3、GPT-Transcribe 和 Gemini 3.1 Flash-Lite。

速度：批次轉錄速度是 Azure Fast 方案的 2.5 倍
成本：GPU 成本幾乎只有競品的一半
定價：每小時音訊 $0.36 美元起

對做 Podcast 轉錄、會議記錄、字幕生成的開發者來說，這個價格加上這個準確度蠻有吸引力的。

MAI-Voice-1：一秒鐘生成一分鐘語音

MAI-Voice-1 是文字轉語音模型，最大賣點是速度——單一 GPU 就能在不到一秒內生成 60 秒的音訊，而且音質保留說話者的情緒和語調特徵。

定價：每百萬字元 $22 美元
特色：長篇內容也能維持一致的聲音特徵

做有聲書、語音助理、客服機器人的團隊可以關注。

MAI-Image-2：Arena.ai 排名第三的圖像模型

MAI-Image-2 一上線就衝到 Arena.ai 圖像模型排行榜第三名，生成速度比前代快至少 2 倍。

定價：文字輸入每百萬 token $5，圖像輸出每百萬 token $33
亮點：速度和品質兼顧，適合需要大量圖片生成的場景

微軟為什麼要自己做模型？跟 OpenAI 的關係怎麼了？

2025 年微軟重新談判了與 OpenAI 的合作條款，移除了禁止微軟自行開發通用 AI 模型的限制。Suleyman 在今年 2 月公開表示要追求「AI 自給自足」，不再完全依賴 OpenAI。

不過要注意的是，目前這三款 Microsoft MAI 模型都是多媒體專用模型（語音、圖像），微軟的通用大語言模型（LLM）預計還要 12-18 個月才會問世。換句話說，微軟短期內還是會用 GPT-5.4 處理文字推理任務，但中長期正在鋪路脫離。

開發者現在該怎麼選？

如果你的應用場景是語音轉錄、語音生成或圖像生成，MAI 模型值得測試——尤其是已經在用 Azure 的團隊，整合成本最低。但如果你需要通用推理能力，GPT-5.4、Claude Opus 4.6 或 Gemini 3.1 Pro 目前還是更成熟的選擇。

好不好用，試了才知道。

🇺🇸 Microsoft MAI Models Explained: In-House AI Takes on OpenAI With Speech, Voice and Image

Microsoft MAI models are three in-house AI models launched on April 2, 2026 — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — marking Microsoft's clearest step toward AI self-sufficiency and away from its $13 billion OpenAI partnership. Here's what they do, how they perform, and what it means for developers.

What Are Microsoft MAI Models?

MAI stands for Microsoft AI. These models are the first major output from the MAI Superintelligence team formed in November 2025 under Mustafa Suleyman, CEO of Microsoft AI. All three are available on Microsoft Foundry.

MAI-Transcribe-1: Best-in-Class Speech-to-Text

MAI-Transcribe-1 achieves the lowest Word Error Rate across 25 languages on the FLEURS benchmark, beating Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite.

Speed: 2.5x faster batch transcription than Azure's existing Fast tier
Cost: Nearly half the GPU cost of competitors
Pricing: Starting at $0.36 per audio hour

MAI-Voice-1: 60 Seconds of Audio in Under 1 Second

MAI-Voice-1 generates natural, emotionally nuanced speech while preserving speaker identity across long-form content. A single GPU produces 60 seconds of audio in under one second.

Pricing: $22 per million characters
Use cases: Audiobooks, voice assistants, customer service bots

MAI-Image-2: Top-3 on Arena.ai Leaderboard

MAI-Image-2 debuted at #3 on the Arena.ai image model leaderboard with at least 2x faster generation than its predecessor.

Pricing: $5 per million tokens (text input), $33 per million tokens (image output)

Why Is Microsoft Building Its Own Models?

The renegotiated 2025 OpenAI partnership removed contractual restrictions preventing Microsoft from building broadly capable models. Suleyman publicly stated in February 2026 that Microsoft is pursuing "AI self-sufficiency." However, these three MAI models are multimedia-focused — Microsoft's frontier LLM is still 12-18 months away.

Should Developers Switch Now?

For speech transcription, voice generation, or image creation — especially if you're already on Azure — the MAI models are worth testing. For general-purpose reasoning, GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro remain stronger choices today.

Good or not? You won't know until you try.

Sources / 資料來源

常見問題 FAQ

Microsoft MAI 模型有哪些？

Microsoft MAI 目前有三款模型：MAI-Transcribe-1（語音轉文字）、MAI-Voice-1（文字轉語音）和 MAI-Image-2（圖像生成），全部在 Microsoft Foundry 平台上線。

MAI-Transcribe-1 比 Whisper 準嗎？

是的，MAI-Transcribe-1 在 FLEURS 25 語言基準測試中的字錯率低於 Whisper-large-V3，同時批次轉錄速度快 2.5 倍，GPU 成本接近一半。

微軟會推出自己的大語言模型嗎？

微軟計畫在 2027 年前推出自研的通用大語言模型（LLM），目前的 MAI 模型專注於多媒體領域（語音和圖像），文字推理仍使用 GPT-5.4。

Microsoft MAI 模型的定價是多少？

MAI-Transcribe-1 每小時音訊 $0.36 美元，MAI-Voice-1 每百萬字元 $22 美元，MAI-Image-2 文字輸入每百萬 token $5、圖像輸出每百萬 token $33。

延伸閱讀 / Related Articles

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

Stanford 研究登上《Science》：11 個 AI 模型有 47% 機率說你對，即使你錯了 | Stanford Study in Science: AI Models Validate Harmful Behavior 47% of the Time — Sycophancy Is a Real Problem

3月 28, 2026

閱讀完整內容

搜尋此網誌

AI小貼士