Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
Updated
Sep 8, 2024 - Cuda
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Multiple GEMM operators are constructed with cutlass to support LLM inference.
FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Codes for DTC-SpMM (ASPLOS'24)
CUDA kernel optimization lab for GEMM, FlashAttention, quantization, and GPU performance learning.
CUDA kernels for LLM inference: FlashAttention forward, Tensor Core GEMM, PyTorch bindings, and benchmarkable reference implementations.
The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions
Systematic CUDA kernel engineering from SGEMM fundamentals to reusable kernels, advanced optimization experiments, and lightweight inference components. https://lessup.github.io/cuda-kernel-academy/
A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.
Header-only C++/CUDA AI kernel library with OpenSpec-driven workflow and readable implementations of GEMM, attention, normalization, convolution, sparse ops, and Python bindings.
🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.
CUDA-native C++ Transformer inference engine with W8A16 quantization, KV cache management, and hand-tuned kernels.
Add a description, image, and links to the tensor-core topic page so that developers can more easily learn about it.
To associate your repository with the tensor-core topic, visit your repo's landing page and select "manage topics."