This repository provides four Claude Code / Codex skills for GPU kernel work:
kernel-KBS: a read/query knowledge base for CUDA, Triton, CuTe DSL, CUTLASS, and Ampere/Hopper/Blackwell kernel research.kernel-benchmark: a standalone benchmark workflow for comparing CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager,torch.compile, and FlashInfer references.kernel-profile: a local profiling workflow for environment checks, correctness validation, Nsight Compute metrics, and bottleneck diagnosis.kernel-loop: an iterative optimization orchestrator that chains profiling, KBS-guided hypotheses, one-change kernel iterations, final benchmarking, and reports.
| Skill | Purpose | Main entry points |
|---|---|---|
kernel-KBS |
Search evidence-backed kernel knowledge from PRs, docs, blogs, contests, curated wiki pages, code artifacts, and provenance records. | skills/kernel-KBS/scripts/kbs.py |
kernel-benchmark |
Compare custom CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels against PyTorch eager, torch.compile, and FlashInfer baselines for correctness and latency. |
skills/kernel-benchmark/scripts/benchmark.py |
kernel-profile |
Validate and profile concrete CUDA-C++, CUTLASS, CuTe DSL, or Triton kernels, then classify bottlenecks from NCU evidence. | skills/kernel-profile/env/scripts/env_check.py, skills/kernel-profile/env/scripts/enc_config.py, skills/kernel-profile/scripts/correctness_check.py, skills/kernel-profile/scripts/ncu_profile.py |
kernel-loop |
Run a fixed-iteration optimization loop with environment checks, correctness, NCU profiling, bottleneck classification, KBS evidence, one-variable hypotheses, final benchmark, and report. | skills/kernel-loop/SKILL.md, skills/kernel-loop/references/hypothesis.md, skills/kernel-loop/references/report_template.md |
kernel-KBS is read-only by default and should be used for retrieval and source-backed implementation ideas. It does not run kernels or collect performance data.
kernel-benchmark runs standalone benchmarks and writes benchmark.md, including correctness results, timing statistics, and speedups versus selected baselines. It uses KernelBench-style CUDA event timing by default.
kernel-profile runs local checks and profiling. It produces artifacts such as env_check.md, correctness.md, ncu_summary.md, and ncu_details.md.
kernel-loop orchestrates the other skills for an end-to-end optimization loop. It preserves every version, requires one hypothesis before each code change, keeps dimensions and measurement settings fixed, and writes final_report.md from measured artifacts.
skills/
├── kernel-KBS/
│ ├── SKILL.md
│ ├── requirements.txt
│ ├── references/
│ ├── scripts/
│ └── store/
├── kernel-benchmark/
│ ├── SKILL.md
│ ├── requirements.txt
│ └── scripts/
├── kernel-loop/
│ ├── SKILL.md
│ └── references/
└── kernel-profile/
├── SKILL.md
├── requirements.txt
├── env/
├── reference/
└── scripts/
Claude Code plugin:
/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66
Codex plugin marketplace:
/plugin marketplace add fmh66/kernel-opt-agent
/plugin install kernel-opt-agent@fmh66
This repository includes both .claude-plugin/plugin.json and .codex-plugin/plugin.json, so the same GitHub marketplace URL works for both tools.
Directly from this repository:
python3 install_skills.py --target all # Claude Code + Codex
python3 install_skills.py --target claude # Claude Code only
python3 install_skills.py --target codex # Codex onlyBy default, the installer installs kernel-KBS, kernel-benchmark, kernel-profile, and kernel-loop into user-level skill directories. Use --scope project to install into this repository's .claude/skills or .codex/skills directories instead.
Useful installer options:
python3 install_skills.py --dry-run
python3 install_skills.py --force
python3 install_skills.py --mode symlink
python3 install_skills.py --target codex --skill kernel-benchmark
python3 install_skills.py --target all --all-skillsEach skill owns its Python dependencies in its own requirements.txt:
python3 -m pip install -r skills/kernel-KBS/requirements.txt
python3 -m pip install -r skills/kernel-benchmark/requirements.txt
python3 -m pip install -r skills/kernel-profile/requirements.txtGPU runtime dependencies are environment-specific. Benchmarking and profiling may require PyTorch with CUDA support, NVIDIA drivers/runtime, nvcc, Nsight Compute / ncu, nsight-python, Triton, CuTe DSL, or CUTLASS headers depending on the implementation being measured. Run kernel-profile's environment check before profiling.
kernel-loop has no separate Python dependency file; it uses the profiling, KBS, and benchmark skills.
- Use
kernel-KBSto find relevant implementation patterns and source evidence. - Implement or revise the kernel.
- Use
kernel-benchmarkto compare the custom kernel against selected PyTorch eager,torch.compile, or FlashInfer baselines, confirming correctness and speedup. - Use
kernel-profileto collect NCU metrics and classify bottlenecks for important versions. - Compare measured bottlenecks with KBS guidance and iterate.
For a fully managed loop, use kernel-loop with a fixed iteration budget. It runs the same profile, evidence, hypothesis, one-change iteration, and final benchmark/report sequence as a single workflow.
Example prompt:
[$kernel-loop] help me optimize <kernel.cu>, run 5 iterations, benchmark with flashinfer, and write outputs to <output>s
See each skill's SKILL.md for workflow and command details: