- Overview
- Architecture
- Instance Generation Workflow
- Configuration Files
- Python Generation Script
- Generated Artifacts
- Integration with CK Profiler
- Directory Structure
- Usage
The CK Tile instance generation system provides an automated way to create optimized convolution kernel instances using the CK Builder pattern. These instances are:
- Generated from configuration files containing instance parameter strings
- Integrated with the CK Profiler for benchmarking and validation
-
CK Builder (
/projects/composablekernel/experimental/builder)- High-level C++20 interface for constructing composable kernel operations
- Provides compile-time dispatch from builder descriptors to specialized kernel implementations
-
Instance Generator (
/projects/composablekernel/experimental/grouped_convolution_tile_instances)- Python-based code generation system
- Parses configuration files with instance strings
- Generates C++ wrapper files using templates
-
CK Profiler Integration (
projects/composablekernel/profiler)- Benchmarks generated instances
- Validates correctness against reference implementations
- Selects best-performing kernels
The CK Builder uses a builder pattern that separates:
- Signature - Defines the operation (data type, layout, direction)
- Algorithm - Specifies tile parameters and optimizations
- Instance - The compiled kernel from Builder + Algorithm
// Example: Building a convolution instance
using Builder = ckb::ConvBuilder<SIGNATURE, ALGORITHM>;
using Instance = Builder::Instance;
auto conv = Instance{};
ckt::RunResult result = ckt::run(conv, args, inputs, outputs, stream_config);Signatures are compile-time constants that define the operation:
constexpr auto SIGNATURE_NHWGC_FP16_FWD = ckt::ConvSignature{
.spatial_dim = 2, // 2D convolution
.direction = ckb::ConvDirection::FORWARD,
.data_type = ckb::DataType::FP16,
.accumulation_data_type = ckb::DataType::FP32,
.input = {.config = {.layout = ckb::TensorLayout::NHWGC}},
.weight = {.config = {.layout = ckb::TensorLayout::GKYXC}},
.output = {.config = {.layout = ckb::TensorLayout::NHWGK}}
};Algorithms specify tile sizes, GEMM parameters, and optimizations:
constexpr auto ALGORITHM = cku::ConvAlgorithm_Tile_GroupedConvolutionKernel{}
.with_tile_specializations(ckb::TileConvSpecialization::DEFAULT)
.with_tile_thread_block(ckt::TileThreadBlock{
.tile_size = {.m = 128, .n = 128, .k = 32}
})
.with_tile_block_gemm(ckt::TileBlockGemm{
.warps = {.m = 2, .n = 2, .k = 1},
.warp_tile = {.m = 32, .n = 32, .k = 16},
.double_smem_buffer = false,
.num_wave_groups = 1,
.pipeline_version = ckb::PipelineVersion::V1,
.scheduler = ckb::PipelineScheduler::INTRAWAVE
})
.with_tile_transfer(ckt::TileTransfer{
.a_scalar_per_vector = 8,
.b_scalar_per_vector = 8,
.c_scalar_per_vector = 8
})
.with_tile_optimizations(ckt::TileOptimizations{
.num_groups_to_merge = 1,
.split_image = false,
.explicit_gemm = false
});Instance strings are defined in configuration files organized by:
- Direction:
forward,backward_weight,backward_data - Purpose:
profiler(all instances),tests(limited set),compilation(empty) - Layout & Data Type: e.g.,
nhwgc_fp16.conf,ndhwgc_bf16.conf
Location: configs/{direction}/{purpose}/{layout_dtype}.conf
Run generate_instances.py to parse configs and generate C++ files:
python generate_instances.py \
--mode profiler \
--direction all \
--filter_pattern convolutionFor each instance, the script generates:
-
Individual C++ files (one per instance)
- Location:
instances/{direction}/{config}/{instance_name}.cpp - Contains instance-specific kernel wrapper
- Location:
-
Include files (
.incheaders){problem_name}.inc- Function declarations{problem_name}_calls.inc- Function call invocations
-
CMake integration (via
CMakeLists.txt)- Compiles all generated instances
- Links with profiler
CMake compiles the generated instances with:
- GPU-specific optimizations
- Target architecture (e.g.,
gfx942) - C++20 standard required
Generated instances are integrated via include files in profiler headers.
Configuration files contain instance strings that define kernel parameters. The format varies by device operation type.
DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle<64, 64, 64, 32, Default, 32, 32, 2, 2, 1, 1, 1, 1, 1, 1>
Parameters (parsed by parse_fwd_instances):
block_size= 64 (total threads per block)m_per_block= 64 (M dimension of tile)n_per_block= 64 (N dimension of tile)k_per_block= 32 (K dimension of tile)spec= Default (specialization: Default, Filter1x1Pad0, Filter1x1Stride1Pad0, OddC, Filter3x3)m_per_xdl= 32 (M dimension per XDL instruction)n_per_xdl= 32 (N dimension per XDL instruction)m_xdl_per_wave= 2 (XDL tiles in M per wave)n_xdl_per_wave= 2 (XDL tiles in N per wave)a_scalar_per_vector= 1 (vectorization for input)b_scalar_per_vector= 1 (vectorization for weight)c_scalar_per_vector= 1 (vectorization for output) 13-14. Optional pipeline parameters- Optional
num_groups_to_merge
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<256, 128, 128, 64, Default, 32, 32, 2, 2, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v4>
Additional Parameters (V3 instances):
BlkGemmPipelineScheduler- Intrawave or InterwaveBlkGemmPipelineVersion- v1, v2, v3, v4, or v5
- DEFAULT - General purpose convolution
- FILTER_1X1_PAD0 - Optimized for 1x1 filters with no padding
- FILTER_1X1_STRIDE1_PAD0 - Optimized for 1x1 filters, stride 1, no padding
- FILTER_3x3 - Optimized for 3x3 filters
- OddC - Optimized for odd channel counts
- v1 - Basic pipeline
- v2 - Enhanced pipeline with better scheduling
- v3 - Advanced pipeline optimizations
- v4 - Double shared memory buffering
- v5 - Two wave groups (2x parallelism)
-
parse_fwd_instances(instances, problem_name)- Parses forward convolution instance strings
- Extracts tile sizes, GEMM parameters, specializations
- Returns list of
ConvInstanceTemplateParamsobjects
-
parse_bwd_weight_instances(instances, problem_name)- Parses backward weight convolution instance strings
- Handles V1, V3, and TwoStage variants
- Extracts pipeline scheduler and version parameters
-
parse_bwd_data_instances(instances, problem_name)- Placeholder for backward data parsing (not yet implemented)
-
generate_conv_cpp(instances, problem_name, config, direction, signature_name, filter_pattern)- Generates individual C++ wrapper files from template
- One file per instance
-
generate_defs_inc(instances, problem_name, signature, direction, filter_pattern)- Generates function declarations (
.incfile) - Used by profiler to call instances
- Generates function declarations (
-
generate_calls_inc(instances, problem_name, direction, filter_pattern)- Generates function call statements (
.incfile) - Invokes each instance in profiler benchmark loop
- Generates function call statements (
Template: instances/grouped_convolution_tile.cpp.in
Placeholders:
gen_signature→ Signature constant namegen_instance_name→ Unique instance function namegen_specialization→ Tile specialization enumgen_thread_block→ Thread block configurationgen_block_gemm_desc→ Block GEMM descriptorgen_block_transfer→ Transfer parametersgen_optimizations→ Optimization settings
Generated Output: instances/{direction}/{config}/{instance_name}.cpp
python generate_instances.py \
--mode {compilation|tests|profiler} \
--direction {forward|backward_weight|backward_data|all} \
--filter_pattern {pattern}Modes:
compilation- Empty instance list (compile-time check only)tests- Limited instances for testingprofiler- All instances for benchmarking
instances/
├── forward/
│ ├── nhwgc_fp16/
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_0.cpp
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_1.cpp
│ │ └── ...
│ ├── grouped_convolution_forward_tile_nhwgc_fp16.inc
│ └── grouped_convolution_forward_tile_nhwgc_fp16_calls.inc
├── backward_weight/
│ ├── nhwgc_bf16/
│ │ └── ...
│ └── ...
├── instance_includes.inc # Shared headers and signatures
└── instance_run.inc # Shared instance execution logic
-
Instance Implementation (
.cpp)// grouped_convolution_forward_tile_nhwgc_fp16_0.cpp #include "../../instance_includes.inc" namespace ck_tile::builder::profiling { constexpr auto SIGNATURE = SIGNATURE_NHWGC_FP16_FWD; std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_0( const ckt::Args<SIGNATURE>& args, const ckt::Inputs<SIGNATURE>& inputs, const ckt::Outputs<SIGNATURE>& outputs, const ck_tile::stream_config& s_conf) { constexpr auto ALGORITHM = /* ... */; #include "../../instance_run.inc" } }
-
Function Declarations (
.inc)// grouped_convolution_forward_tile_nhwgc_fp16.inc std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_0(...); std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_1(...); // ...
-
Function Calls (
_calls.inc)// grouped_convolution_forward_tile_nhwgc_fp16_calls.inc run_alg(run_grouped_convolution_forward_tile_nhwgc_fp16_0); run_alg(run_grouped_convolution_forward_tile_nhwgc_fp16_1); // ...
This file orchestrates the benchmarking of all CK Tile instances.
-
Include Generated Instances
#include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp32.inc" #include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_bf16.inc" #include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp16.inc" // ... more includes
-
Benchmark Loop (
run_grouped_conv_forward_tile_algs)template <auto SIGNATURE> std::tuple<bool, float, std::string> run_grouped_conv_forward_tile_algs( const ckt::Args<SIGNATURE>& args, const ckt::Inputs<SIGNATURE>& inputs, const ckt::Outputs<SIGNATURE>& outputs, const ck_tile::stream_config& s_conf) { float best_avg_time = std::numeric_limits<float>::max(); std::string best_op_name; bool valid = true; // Generate reference output auto reference = ckt::alloc_outputs(args); using ReferenceInstance = /* ... */; auto ref_conv = ReferenceInstance{}; auto ref_result = ckt::run(ref_conv, args, inputs, reference.get()); // Lambda to run and validate each instance auto run_alg = [&](auto&& run_alg_func) { auto [is_supported, avg_time, op_name] = run_alg_func(args, inputs, outputs, s_conf); if(is_supported) { best_avg_time = std::min(best_avg_time, avg_time); best_op_name = (best_avg_time < avg_time) ? best_op_name : op_name; // Validate correctness valid = ck_tile::check_err(outputs, reference, rtol, atol); std::cout << "Perf: " << avg_time << " ms, " << op_name << std::endl; } }; // Run all instances based on signature if constexpr(SIGNATURE == SIGNATURE_NHWGC_FP16_FWD) { #include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp16_calls.inc" } // ... more signature branches return std::make_tuple(valid, best_avg_time, best_op_name); }
projects/composablekernel/
├── experimental/
│ ├── builder/ # CK Builder framework
│ │ ├── include/ck_tile/builder/ # Builder API
│ │ │ ├── conv_builder.hpp # Main builder interface
│ │ │ ├── factory/ # Dispatch to kernel implementations
│ │ │ └── reflect/ # Instance traits and reflection
│ │ ├── test/ # Builder tests and utilities
│ │ └── README.md # Builder documentation
│ │
│ └── grouped_convolution_tile_instances/ # Instance generation system
│ ├── generate_instances.py # Main generation script
│ ├── CMakeLists.txt # Build configuration
│ ├── README.md # Brief overview
│ │
│ ├── configs/ # Configuration files
│ │ ├── forward/
│ │ │ ├── profiler/ # All instances for profiling
│ │ │ │ ├── nhwgc_fp16.conf
│ │ │ │ ├── nhwgc_fp32.conf
│ │ │ │ ├── nhwgc_bf16.conf
│ │ │ │ ├── ndhwgc_fp16.conf
│ │ │ │ ├── ndhwgc_fp32.conf
│ │ │ │ └── ndhwgc_bf16.conf
│ │ │ └── tests/ # Limited instances for testing
│ │ ├── backward_weight/
│ │ │ └── profiler/
│ │ └── backward_data/
│ │ └── profiler/
│ │
│ └── instances/ # Generated C++ files
│ ├── instance_includes.inc # Shared headers and signatures
│ ├── instance_run.inc # Shared execution logic
│ ├── grouped_convolution_tile.cpp.in # Template file
│ │
│ ├── forward/ # Forward instances
│ │ ├── nhwgc_fp16/
│ │ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_0.cpp
│ │ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_1.cpp
│ │ │ └── ...
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16.inc
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_calls.inc
│ │ └── ...
│ │
│ └── backward_weight/ # Backward weight instances
│ └── ...
│
└── profiler/
├── include/profiler/
│ ├── grouped_convolution_forward_tile_algs.hpp # Profiler integration
│ └── ...
└── src/
└── profile_grouped_conv_fwd.cpp # Main profiler entry point
cd projects/composablekernel/experimental/grouped_convolution_tile_instances
# Generate all forward, backward_weight, and backward_data instances
python generate_instances.py --mode profiler --direction allOutput:
- Generates
.cppfiles for all instances - Creates
.incdeclaration and call files - Ready to compile with CMake
# Generate limited forward instances from test configs
python generate_instances.py --mode tests --direction forward# Only generate instances matching "fp16"
python generate_instances.py \
--mode profiler \
--direction forward \
--filter_pattern fp16cd build
cmake -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-D CMAKE_BUILD_TYPE=Release \
-D GPU_TARGETS="gfx942" \
-D CK_EXPERIMENTAL_BUILDER=ON \
-D CMAKE_CXX_STANDARD=20 \
-G Ninja \
..
ninja device_grouped_conv_fwd_tile_instances
ninja device_grouped_conv_bwd_weight_tile_instances# Profile 2D convolution (NHWGC layout, FP16 data type)
./bin/ckProfiler conv fwd \
1 0 2 \ # data_type=FP16, layout=NHWGC, spatial_dim=2
1 128 128 64 \ # G=1, N=128, K=128, C=64
3 3 \ # filter: 3x3
28 28 \ # input: 28x28
1 1 \ # stride: 1x1
1 1 \ # dilation: 1x1
1 1 1 1 \ # padding: 1,1,1,1
1 0 1 # verification, initialization, profiling
# Output shows performance of each instance and selects best kernel-
Edit config file:
configs/forward/profiler/nhwgc_fp16.confDeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<256, 256, 256, 64, Default, 32, 32, 4, 4, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v5> -
Regenerate instances:
python generate_instances.py --mode profiler --direction forward --filter_pattern fp16
-
Rebuild:
ninja grouped_convolution_tile_instances
-
Profile:
./bin/ckProfiler conv fwd ...