Skip to content

docs: refresh CLAUDE.md for 2 months of changes + expr-neutral / helper-placement conventions#659

Open
jhinpan wants to merge 4 commits into
ROCm:mainfrom
jhinpan:docs/claude-md-refresh
Open

docs: refresh CLAUDE.md for 2 months of changes + expr-neutral / helper-placement conventions#659
jhinpan wants to merge 4 commits into
ROCm:mainfrom
jhinpan:docs/claude-md-refresh

Conversation

@jhinpan
Copy link
Copy Markdown
Contributor

@jhinpan jhinpan commented Jun 5, 2026

What

CLAUDE.md (agent guidance) was last refreshed on 2026-04-30 (#463); ~187 PRs have merged since. This updates the file against the current tree while keeping it scoped as agent instructions, not a repo index that needs constant kernel-by-kernel synchronization.

The refresh keeps durable guidance in CLAUDE.md: commands, maintenance guardrails, architecture footguns, and routing hints that prevent likely agent mistakes. Detailed kernel catalogs stay in the docs and the current source tree.

Two maintainer conventions added (Kernel Authoring Conventions)

  • expr/ is target-neutral. The direct child modules of python/flydsl/expr/ must stay backend-agnostic — no ROCDL/HIP imports — so import flydsl.expr works without the FlyROCDL bindings (enforced in CI by tests/unit/test_expr_optional_rocdl.py). New target-specific code goes in the expr/rocdl/ package and is lazy-loaded from expr/__init__.py (_LAZY_MODULES), never in a new top-level expr/*.py. (PR fix(python): lazily load ROCDL expr modules #521)
  • Helper placement. Reuse-first; don't scatter or duplicate small helpers. Shared kernel helpers → kernels/kernels_common.py; domain-shared → the topical module (moe_common.py, layout_utils.py, fp8_gemm_utils.py, …); DSL numeric/type helpers → expr/utils/ & expr/numeric.py; compiler/runtime-wide → flydsl/utils/. (PR [Perf] Port mixed_moe kernel optimizations for stage1/stage2 #388, Reduce redundant FlyDSL numeric wrappers #448)

Staleness fixed

  • Repository Layout: expr/ direct-children-are-neutral note + expr/rocdl/ package (shadows legacy expr/rocdl.py); lib/Dialect/FlyROCDL/{CDNA3,CDNA4,GFX11,GFX1250}/ per-subtarget split; .claude/skills/; SmemAllocatorSharedAllocator; corrected tests/python/examples/ description.
  • Kernel Entry Points: cleaned this section into routing guidance instead of a complete kernel inventory. It now tells agents where to start for paged decode, how to choose GEMM/MoE families by architecture and dtype, when to update docs/prebuilt_kernels_guide.md, and when a rule belongs in CLAUDE.md at all.
  • GPU Architecture Support: new gfx11* (RDNA3/3.5) row; documented the gfx1250 footgun — is_rdna_arch() returns False and get_warp_size() returns 64, so those kernels hardcode WAVE_SIZE = 32.
  • Build & Test / Code Style: run_benchmark.sh subset/--list/--output_csv + compare_benchmark.py; check_python_style.sh; the pre-checks.yaml Python (black+ruff) + C++ (clang-format-18) CI gate; RUN_TESTS_FULL=1 is the CI invocation.
  • Environment Variables: FLYDSL_RUNTIME_RUN_ONLY (AOT-cache-only) and FLYDSL_COMPILE_LLVM_DIR (external LLVM codegen).
  • Testing Notes: pytest.ini markers (multi_gpu, benchmark) + the label-gated 8-GPU CI lane; clarified the 2-GPU shmem, 4-GPU allreduce accuracy, and 8-GPU allreduce accuracy/benchmark split; removed the deleted torch_mha_extend2() reference.
  • Documentation Map: external bitcode integration guide.

181 → 216 lines after replacing the kernel file list with routing guidance.

🤖 Generated with Claude Code

Agent guidance was last updated 2026-04-30 (ROCm#463); ~187 PRs have landed since.
Refresh every section against the current tree and add two maintainer conventions.
Every cited path was verified file-by-file against upstream/main HEAD.

- Repository Layout: expr/ direct children are target-neutral; add expr/rocdl/
  package (shadows legacy expr/rocdl.py), lib/Dialect/FlyROCDL/ per-subtarget
  split, .claude/skills/, SmemAllocator->SharedAllocator note, fix
  tests/python/examples comment.
- Kernel Entry Points: add FP8 GEMM (4wave/8wave + utils), RDNA GEMM
  (gfx11*/gfx120*), AITER HGEMM (splitk/small_m), MoE routing
  (sorting/topk_gating + moe_common), fused-quant (qk_norm_rope_quant,
  silu_and_mul_fq), pa_decode_swa, communication shim.
- GPU Architecture Support: add gfx11* row; document the gfx1250
  is_rdna_arch()==False / get_warp_size()==64 wave32 footgun.
- Build & Test / Code Style: run_benchmark subset+CSV+compare_benchmark,
  check_python_style.sh, the pre-checks.yaml Python+C++ CI style gate.
- Environment Variables: FLYDSL_RUNTIME_RUN_ONLY, FLYDSL_COMPILE_LLVM_DIR.
- Testing Notes: pytest.ini markers (multi_gpu/benchmark) + 8-GPU CI lane;
  drop the removed torch_mha_extend2().
- Documentation Map: external bitcode integration guide.

Maintainer conventions (Kernel Authoring Conventions):
- expr/ is target-neutral: direct children must not import ROCDL/HIP bindings
  (enforced by tests/unit/test_expr_optional_rocdl.py); new target code goes in
  expr/rocdl/ and is lazy-loaded (ROCm#521).
- Helper placement: reuse-first; kernels_common.py vs topical *_common/_utils
  vs expr/utils vs flydsl/utils -- don't scatter or duplicate helpers (ROCm#388, ROCm#448).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 01:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates CLAUDE.md to better document FlyDSL’s current repo layout, kernel/arch conventions, CI/style workflows, environment variables, and benchmarking/testing guidance.

Changes:

  • Expanded repo tree and kernel authoring conventions (SharedAllocator, helper placement, ROCDL lazy-loading rules).
  • Added documentation pointers for extern bitcode integration, benchmarking workflows, CI style gate reproduction, and new env vars.
  • Clarified arch-specific behavior (RDNA vs CDNA, gfx1250 caveats) and enriched testing notes (markers, multi-GPU gating).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CLAUDE.md
├── python/
│ ├── flydsl/ # Python DSL core
│ │ ├── expr/ # DSL expression API: primitive, typing, arith, vector, gpu, math, rocdl, buffer_ops
│ │ ├── expr/ # DSL expression API; direct children are TARGET-NEUTRAL (typing, primitive, gpu, derived, struct, numeric, math, vector, arith, meta, extern; + utils/)
Comment thread CLAUDE.md
bash scripts/run_tests.sh # Pytest + examples + MLIR FileCheck
RUN_TESTS_FULL=1 bash scripts/run_tests.sh # Include large_shape tests
bash scripts/run_benchmark.sh # Performance benchmarks
RUN_TESTS_FULL=1 bash scripts/run_tests.sh # Include large_shape tests; this is the CI invocation (flydsl.yaml, test-whl.yaml)
sjfeng1999
sjfeng1999 previously approved these changes Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants