AutoKernel 開源實測：AI Agent 自動優化 GPU Kernel，睡一覺醒來快 5 倍 | AutoKernel Hands-On: AI Agent Auto-Optimizes GPU Kernels While You Sleep

By Kit 小克 | AI Tool Observer | 2026-04-11

🇹🇼 AutoKernel 開源實測：AI Agent 自動優化 GPU Kernel，睡一覺醒來快 5 倍

如果你曾經花過整個週末手動調 CUDA kernel，或者在 Triton 程式碼裡反覆試參數，那 AutoKernel 這個專案可能會讓你認真考慮「讓 AI 代勞」。RightNow AI 在 2026 年 4 月初開源了 AutoKernel，一個用 AI Agent 自動迭代優化 GPU kernel 的框架，GitHub 上線不到一週就拿到 1.2k stars。

AutoKernel 是什麼？AI 怎麼自動優化 GPU Kernel？

AutoKernel 的核心概念很直覺：把 Andrej Karpathy 提過的「autoresearch」模式套用到 GPU kernel 優化上。它用一個 AI Agent 迴圈——修改程式碼、跑 benchmark、保留或回滾、不斷重複——來自動產生更快的 Triton 或 CUDA C++ kernel。每次迭代大約 90 秒，跑一晚上就能累積 300-400 次實驗。

四階段工作流程

Profile：用 torch.profiler 找出模型中最耗 GPU 時間的瓶頸 kernel
Extract：自動提取瓶頸 kernel，產生獨立的 Triton 或 CUDA C++ 實作
Optimize：AI Agent 進入迭代迴圈，每次修改→測試→決定保留或回滾
Verify：五階段正確性驗證（煙霧測試、形狀掃描、數值穩定性、確定性、邊界情況）

實際效能有多誇張？H100 上的 Benchmark 數據

在 NVIDIA H100 上，AutoKernel 產出的 Triton kernel 對比 PyTorch eager 和 torch.compile（max-autotune）都有顯著提升：

RMSNorm：比 eager 快 5.29 倍，比 torch.compile 快 2.83 倍
Softmax：比 eager 快 2.82 倍，比 torch.compile 快 3.44 倍
Cross-Entropy：比 eager 快 2.21 倍，比 torch.compile 快 2.94 倍

這些數字不是跑分灌水，因為 AutoKernel 的排程器用 Amdahl 定律來決定先優化哪個 kernel，確保優化的是真正影響端到端效能的瓶頸。

誰適合用 AutoKernel？需要什麼硬體？

你需要 NVIDIA GPU（H100、A100 或 RTX 4090）、Python 3.10+ 和 uv 套件管理器。框架內建了 GPT-2 Small、LLaMA 7B、BERT-base 等模型定義，不需要 HuggingFace transformers 就能直接跑。支援 9 種 kernel 類型，包含矩陣乘法、Flash Attention、Fused MLP、RoPE 等常見運算。

雙後端架構有什麼好處？

Triton 後端用 Python 風格語法快速迭代，CUDA C++ 後端則能存取 tensor core 做更深層優化。兩者共用相同的 benchmark 介面，讓你可以先用 Triton 快速探索，再切 CUDA C++ 榨出最後一滴效能。

試了才知道：我的實際體驗

身為 AI 工具觀察站的 Kit 小克，我最在意的是「好不好用」。AutoKernel 的安裝流程算順暢，用 uv 管理依賴省了很多環境衝突的麻煩。真正讓我驚豔的是它的「正確性優先」設計——五階段驗證確保 AI 產出的 kernel 不只是快，而是數值上正確的。這解決了我過去手動調 kernel 最頭痛的問題：改快了但結果不對。

優化後的 kernel 還能匯出到 HuggingFace Kernels Hub，方便團隊共享和復用，這對企業級應用很有價值。

FAQ：常見問題

AutoKernel 可以用在自訂模型嗎？

可以。雖然內建了 GPT-2、LLaMA、BERT 等模型定義，但 AutoKernel 設計上支援任意 PyTorch 模型。你只需提供模型的前向傳播定義即可。

一次優化需要跑多久？

每次迭代約 90 秒，一整晚的運行可以累積 300-400 次實驗，涵蓋多個 kernel 的優化。對大多數模型來說，隔夜跑就能得到顯著加速。

AutoKernel 和 torch.compile 有什麼不同？

torch.compile 是通用型編譯器優化，而 AutoKernel 用 AI Agent 針對特定瓶頸做迭代式深度優化。從 benchmark 來看，AutoKernel 在多數情境下能超越 torch.compile 的 max-autotune 模式 2-3 倍。

需要 GPU 程式設計經驗嗎？

不需要。AutoKernel 的賣點就是自動化——你只需指定模型，框架會自動 profiling、提取、優化和驗證。不過如果你有 CUDA/Triton 經驗，可以更好地理解和調整優化結果。

MIT 授權可以商用嗎？

可以。AutoKernel 採用 MIT 授權，完全可以用在商業專案中，也可以自由修改和分發。

🇺🇸 AutoKernel Hands-On: AI Agent Auto-Optimizes GPU Kernels While You Sleep

If you have ever spent a weekend hand-tuning CUDA kernels or trial-and-erroring Triton parameters, AutoKernel might make you seriously reconsider letting an AI do the work. Released in early April 2026 by RightNow AI, this open-source framework uses an AI Agent loop to automatically optimize GPU kernels for any PyTorch model, racking up 1.2k GitHub stars in under a week.

What Is AutoKernel and How Does AI Auto-Optimize GPU Kernels?

AutoKernel applies Andrej Karpathy's "autoresearch" pattern to GPU kernel optimization. An AI Agent loop modifies code, runs benchmarks, keeps or reverts changes, and repeats — roughly 90 seconds per iteration, producing 300-400 experiments overnight.

Four-Stage Pipeline

Profile: torch.profiler identifies the most time-consuming GPU kernels
Extract: Bottleneck kernels are auto-extracted into standalone Triton or CUDA C++ implementations
Optimize: The AI Agent enters an edit-benchmark-keep/revert loop
Verify: Five-stage correctness validation (smoke tests, shape sweeps, numerical stability, determinism, edge cases)

How Fast Is It Really? H100 Benchmark Numbers

On an NVIDIA H100, AutoKernel-generated Triton kernels significantly outperform both PyTorch eager and torch.compile (max-autotune):

RMSNorm: 5.29x over eager, 2.83x over torch.compile
Softmax: 2.82x over eager, 3.44x over torch.compile
Cross-Entropy: 2.21x over eager, 2.94x over torch.compile

These are not synthetic benchmarks. AutoKernel's scheduler uses Amdahl's Law to prioritize kernels by end-to-end impact, ensuring optimization effort targets real bottlenecks.

Who Should Use AutoKernel? What Hardware Do You Need?

Requirements: NVIDIA GPU (H100, A100, or RTX 4090), Python 3.10+, and the uv package manager. Built-in model definitions include GPT-2 Small, LLaMA 7B, and BERT-base — no HuggingFace transformers needed. Nine kernel types are supported: matmul, Flash Attention, Fused MLP, RoPE, softmax, layernorm, RMSNorm, cross-entropy, and reductions.

Why the Dual Backend Architecture Matters

The Triton backend provides fast iteration with Python-like syntax, while CUDA C++ unlocks tensor core access for deeper optimization. Both share identical benchmark interfaces, letting you explore with Triton first and switch to CUDA C++ for maximum performance.

Hands-On Impression: Does It Actually Work?

As Kit from AI Tool Observer, what I care about most is practical usability. AutoKernel's setup is smooth with uv handling dependencies, but the real standout is its correctness-first design. The five-stage validation ensures AI-generated kernels are not just faster but numerically correct — solving the biggest headache of manual kernel tuning: fast but wrong results.

Optimized kernels can be exported to HuggingFace Kernels Hub for team sharing and reuse, adding real enterprise value.

FAQ: Common Questions About AutoKernel

Can AutoKernel optimize custom models?

Yes. While it ships with GPT-2, LLaMA, and BERT definitions, AutoKernel supports any PyTorch model. You only need to provide the forward pass definition.

How long does optimization take?

Each iteration takes about 90 seconds. An overnight run accumulates 300-400 experiments across multiple kernels, delivering significant speedups for most models.

How is AutoKernel different from torch.compile?

torch.compile is a general-purpose compiler optimization, while AutoKernel uses an AI Agent for iterative deep optimization of specific bottlenecks. Benchmarks show AutoKernel exceeding torch.compile's max-autotune mode by 2-3x in most scenarios.

Do I need GPU programming experience?

No. AutoKernel automates profiling, extraction, optimization, and verification. However, CUDA/Triton experience helps you better understand and fine-tune results.

Is the MIT license commercial-friendly?

Yes. AutoKernel uses the MIT license, fully compatible with commercial use, modification, and distribution.

Sources / 資料來源

常見問題 FAQ

AutoKernel 可以用在自訂模型嗎？

可以。AutoKernel 支援任意 PyTorch 模型，只需提供前向傳播定義即可。內建 GPT-2、LLaMA、BERT 等模型定義。

AutoKernel 一次優化需要跑多久？

每次迭代約 90 秒，隔夜運行可累積 300-400 次實驗，對大多數模型能得到顯著加速。

AutoKernel 和 torch.compile 有什麼不同？

torch.compile 是通用編譯器優化，AutoKernel 用 AI Agent 針對瓶頸做迭代式深度優化，benchmark 顯示能超越 torch.compile max-autotune 模式 2-3 倍。

AutoKernel 需要 GPU 程式設計經驗嗎？

不需要。框架自動完成 profiling、提取、優化和驗證。有 CUDA/Triton 經驗可更好理解優化結果。

AutoKernel 的 MIT 授權可以商用嗎？

可以。MIT 授權完全支援商業使用、修改和分發。

延伸閱讀 / Related Articles

AI 工具觀察站 — 每日精選 AI Agent 與工具趨勢
AI Tool Observer — Daily curated AI Agent & tool trends

Stanford 研究登上《Science》：11 個 AI 模型有 47% 機率說你對，即使你錯了 | Stanford Study in Science: AI Models Validate Harmful Behavior 47% of the Time — Sycophancy Is a Real Problem

3月 28, 2026

閱讀完整內容