Mistral Small 4 開源：三合一 MoE 模型，免費商用 | Mistral Small 4: Open-Source 119B MoE That Does It All

By Kit 小克 | AI Tool Observer | 2026-03-30

🇹🇼 Mistral Small 4 開源：三合一 MoE 模型，免費商用

Mistral Small 4 是 Mistral AI 於 2026 年 3 月 16 日發布的 119B 參數開源模型，採用 Apache 2.0 授權，可免費商用。這個版本最大的意義不是參數量，而是「合三為一」：它把原本三個獨立模型——Magistral（推理）、Pixtral（視覺）、Devstral（程式碼）——整合成單一權重，讓企業只需維護一個模型就能搞定多種任務。

什麼是 Mistral Small 4？三合一的實際意義

過去使用 Mistral 系列，你需要根據任務類型選擇不同模型：需要推理用 Magistral、需要看圖用 Pixtral、需要寫程式用 Devstral。現在 Mistral Small 4 把這三者合一，同一個 API endpoint、同一份部署，處理所有任務。

對於需要建立多功能 AI Agent 的開發者來說，這不是小事——少一個模型就少一筆推理費用、少一套維護成本、少一個延遲節點。

核心規格一覽

架構：混合專家模型（MoE），128 個專家，每個 token 啟動 4 個
總參數：119B；每次推理實際使用：~6B
上下文視窗：256,000 tokens（原生支援長文件）
多模態：原生支援文字 + 圖片輸入
動態推理：reasoning_effort 參數可即時切換「快速模式」與「深度推理模式」，無需換模型
授權：Apache 2.0，商業使用無限制

與上一代 Mistral Small 3 相比：端對端延遲降低 40%，吞吐量提升 3 倍。

如何使用 Mistral Small 4？

最簡單的方式是透過 API：Mistral 官方 API 與 NVIDIA NIM 均已提供存取，按量計費。

本地部署（需要企業級硬體：4x NVIDIA H100 或 2x H200 起跳）：

vllm serve mistralai/Mistral-Small-4-119B-2603   --max-model-len 262144 --tensor-parallel-size 2

消費者 GPU 用量化版：Unsloth 提供 GGUF 格式，可在 llama.cpp 上執行。速度有限，但技術上可行。

效能測試：Mistral 怎麼說

Mistral 的官方測試重點放在「輸出效率」而非純粹準確率。在抽象代數（AA LCR）任務上，Mistral Small 4 只用 1,600 字元就達成目標，而同等級的 Qwen 模型需要 5,800–6,100 字元——多出 3.5 倍的 token 消耗。在 LiveCodeBench 程式碼測試中，比 GPT-OSS 120B 少用 20% 的輸出 token，且分數相當。

結論：API 費用更省，這對高呼叫量的生產環境非常重要。

Kit 小克的務實評估

Mistral Small 4 的核心價值是「務實整合」：Apache 2.0 讓商業部署沒有法律疑慮，MoE 架構讓推理成本遠低於同規模 Dense 模型，三合一設計讓開發者少維護兩個模型。

缺點也很直接：119B 本地部署門檻極高，大多數開發者只能走 API 或量化版。如果你正在評估「不被雲端廠商綁死的多功能開源模型」，這是目前最值得認真測試的選項之一。

好不好用，試了才知道。

🇺🇸 Mistral Small 4: Open-Source 119B MoE That Does It All

On March 16, 2026, Mistral AI released Mistral Small 4 — a 119B-parameter open-source model under the Apache 2.0 license, free for commercial use. The headline feature isn't the parameter count; it's consolidation: three previously separate models (Magistral for reasoning, Pixtral for vision, Devstral for coding) are now a single set of weights. One deployment handles everything.

What Is Mistral Small 4 and Why Does Three-in-One Matter?

Previously, building with Mistral meant routing tasks to different models depending on what you needed: reasoning to Magistral, image understanding to Pixtral, code generation to Devstral. Mistral Small 4 collapses that into a single API endpoint and a single deployment.

For developers building multi-capability AI agents, this is a meaningful operational win — fewer models to deploy, fewer inference endpoints to maintain, fewer latency hops.

Key Specifications

Architecture: Mixture of Experts (MoE) — 128 experts, 4 active per token
Total parameters: 119B; active per inference: ~6B
Context window: 256,000 tokens (native long-document support)
Modalities: Text + image input natively
Dynamic reasoning: reasoning_effort parameter switches between fast mode and deep reasoning at runtime — no model swap
License: Apache 2.0, unrestricted commercial use

Versus Mistral Small 3: 40% lower end-to-end latency, 3x higher throughput in optimized configurations.

How to Access and Run Mistral Small 4

Easiest: API — Available through Mistral's official API and NVIDIA NIM, pay-per-token.

Local deployment (requires enterprise hardware: 4x NVIDIA H100 or 2x H200 minimum):

vllm serve mistralai/Mistral-Small-4-119B-2603   --max-model-len 262144 --tensor-parallel-size 2

Consumer hardware: Unsloth provides quantized GGUF builds for llama.cpp. Technically runnable, performance will be limited.

Performance: What Mistral Actually Benchmarks

Mistral's published benchmarks emphasize output efficiency over raw accuracy leaderboard scores. On Abstract Algebra (AA LCR), Mistral Small 4 achieves comparable results using 1,600 characters of output — while comparable Qwen models require 5,800–6,100 characters (3.5–4x more token consumption). On LiveCodeBench, it matches GPT-OSS 120B scores while generating 20% less output.

The practical implication: lower API costs at scale, which matters for high-volume production workloads.

Honest Assessment

Mistral Small 4's value proposition is pragmatic consolidation: Apache 2.0 removes commercial licensing friction, MoE architecture keeps inference costs well below a comparable dense model, and the three-in-one design reduces operational overhead. If you're evaluating open-source multimodal + reasoning + coding models that don't lock you into a single cloud vendor, this is one of the strongest options available right now.

The caveat is real: 119B local deployment requires datacenter-grade hardware. Most developers will consume it via API or quantized builds. But for teams serious about building on open weights, the tradeoff is worth evaluating.

You won't know until you try it.

Sources / 資料來源

延伸閱讀 / Related Articles

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

Stanford 研究登上《Science》：11 個 AI 模型有 47% 機率說你對，即使你錯了 | Stanford Study in Science: AI Models Validate Harmful Behavior 47% of the Time — Sycophancy Is a Real Problem

3月 28, 2026

閱讀完整內容

搜尋此網誌

AI小貼士