Kernel Opt Agent

This repository provides four Claude Code / Codex skills for GPU kernel work:

kernel-KBS: a read/query knowledge base for CUDA, Triton, CuTe DSL, CUTLASS, and Ampere/Hopper/Blackwell kernel research.
kernel-benchmark: a standalone benchmark workflow for comparing CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager, torch.compile, and FlashInfer references.
kernel-profile: a local profiling workflow for environment checks, correctness validation, Nsight Compute metrics, and bottleneck diagnosis.
kernel-loop: an iterative optimization orchestrator that chains profiling, KBS-guided hypotheses, one-change kernel iterations, final benchmarking, and reports.

Skills

Skill	Purpose	Main entry points
`kernel-KBS`	Search evidence-backed kernel knowledge from PRs, docs, blogs, contests, curated wiki pages, code artifacts, and provenance records.	`skills/kernel-KBS/scripts/kbs.py`
`kernel-benchmark`	Compare custom CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager, `torch.compile`, and FlashInfer baselines for correctness and latency.	`skills/kernel-benchmark/scripts/benchmark.py`
`kernel-profile`	Validate and profile concrete CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels, then classify bottlenecks from NCU evidence.	`skills/kernel-profile/env/scripts/env_check.py`, `skills/kernel-profile/env/scripts/enc_config.py`, `skills/kernel-profile/scripts/correctness_check.py`, `skills/kernel-profile/scripts/ncu_profile.py`
`kernel-loop`	Run a fixed-iteration optimization loop with environment checks, correctness, NCU profiling, bottleneck classification, KBS evidence, one-variable hypotheses, final benchmark, and report.	`skills/kernel-loop/SKILL.md`, `skills/kernel-loop/references/hypothesis.md`, `skills/kernel-loop/references/report_template.md`

kernel-KBS is read-only by default and should be used for retrieval and source-backed implementation ideas. It does not run kernels or collect performance data.

kernel-benchmark runs standalone benchmarks and writes benchmark.md, including correctness results, timing statistics, and speedups versus selected baselines. It uses KernelBench-style CUDA event timing by default.

kernel-profile runs local checks and profiling. It produces artifacts such as env_check.md, correctness.md, ncu_summary.md, and ncu_details.md.

kernel-loop orchestrates the other skills for an end-to-end optimization loop. It preserves every version, requires one hypothesis before each code change, keeps dimensions and measurement settings fixed, and writes final_report.md from measured artifacts.

Layout

skills/
├── kernel-KBS/
│   ├── SKILL.md
│   ├── requirements.txt
│   ├── references/
│   ├── scripts/
│   └── store/
├── kernel-benchmark/
│   ├── SKILL.md
│   ├── requirements.txt
│   └── scripts/
├── kernel-loop/
│   ├── SKILL.md
│   └── references/
└── kernel-profile/
    ├── SKILL.md
    ├── requirements.txt
    ├── env/
    ├── reference/
    └── scripts/

Install

Claude Code plugin:

/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66

Codex plugin marketplace:

/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66

This repository includes both .claude-plugin/plugin.json and .codex-plugin/plugin.json, so the same GitHub marketplace URL works for both tools.

Directly from this repository:

python3 install_skills.py --target all        # Claude Code + Codex
python3 install_skills.py --target claude     # Claude Code only
python3 install_skills.py --target codex      # Codex only

By default, the installer installs kernel-KBS, kernel-benchmark, kernel-profile, and kernel-loop into user-level skill directories. Use --scope project to install into this repository's .claude/skills or .codex/skills directories instead.

Useful installer options:

python3 install_skills.py --dry-run
python3 install_skills.py --force
python3 install_skills.py --mode symlink
python3 install_skills.py --target codex --skill kernel-benchmark
python3 install_skills.py --target all --all-skills

Dependencies

Each skill owns its Python dependencies in its own requirements.txt:

python3 -m pip install -r skills/kernel-KBS/requirements.txt
python3 -m pip install -r skills/kernel-benchmark/requirements.txt
python3 -m pip install -r skills/kernel-profile/requirements.txt

GPU runtime dependencies are environment-specific. Benchmarking and profiling may require PyTorch with CUDA support, NVIDIA drivers/runtime, nvcc, Nsight Compute / ncu, nsight-python, Triton, CuTe DSL, or CUTLASS headers depending on the implementation being measured. Run kernel-profile's environment check before profiling.

kernel-loop has no separate Python dependency file; it uses the profiling, KBS, and benchmark skills.

Typical Use

Use kernel-KBS to find relevant implementation patterns and source evidence.
Implement or revise the kernel.
Use kernel-benchmark to compare the custom kernel against selected PyTorch eager, torch.compile, or FlashInfer baselines, confirming correctness and speedup.
Use kernel-profile to collect NCU metrics and classify bottlenecks for important versions.
Compare measured bottlenecks with KBS guidance and iterate.

For a fully managed loop, use kernel-loop with a fixed iteration budget. It runs the same profile, evidence, hypothesis, one-change iteration, and final benchmark/report sequence as a single workflow.

Example prompt:

[$kernel-loop] help me optimize <kernel.cu>, run 5 iterations, benchmark with flashinfer, and write outputs to <output>s

See each skill's SKILL.md for workflow and command details:

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.claude-plugin		.claude-plugin
.codex-plugin		.codex-plugin
demo/gqa_paged_decode_h16_kv2_d128_ps1		demo/gqa_paged_decode_h16_kv2_d128_ps1
skills		skills
.gitignore		.gitignore
README-zh.md		README-zh.md
README.md		README.md
install_skills.py		install_skills.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kernel Opt Agent

Skills

Layout

Install

Dependencies

Typical Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kernel Opt Agent

Skills

Layout

Install

Dependencies

Typical Use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages