tensor-core

Here are 17 public repositories matching this topic...

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

gpu cuda cublas nvidia gemm matrix-multiply tensor-core hgemm

Updated Sep 8, 2024
Cuda

psmarter / CUDA-Practice

Star

CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.

parallel-computing cuda high-performance-computing cuda-kernels quantization cutlass gemm performance-optimization nccl gpu-programming roofline-model tensor-core llm-inference flash-attention nsight-compute

Updated Mar 20, 2026
Cuda

Bruce-Lee-LY / cuda_hgemv

Star

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

gpu cuda cublas nvidia gemm gemv matrix-multiply tensor-core hgemm cuda-core hgemv

Updated Sep 8, 2024
Cuda

Bruce-Lee-LY / flash_attention_inference

Star

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

gpu cuda inference nvidia cutlass mha multi-head-attention llm tensor-core large-language-model flash-attention flash-attention-2

Updated Feb 27, 2025
C++

Bruce-Lee-LY / cutlass_gemm

Star

Multiple GEMM operators are constructed with cutlass to support LLM inference.

gpu cublas nvidia cutlass gemm cublaslt llm matrix-multiply tensor-core

Updated Aug 3, 2025
C++

loveSunning / FastCuda

Star

FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.

reduce spmv sgemm spmm cudac sgemv tensor-core hgemm flash-attention wmma

Updated Mar 18, 2026
Cuda

Bruce-Lee-LY / cuda_back2back_hgemm

Star

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

gpu cuda cublas nvidia gemm matrix-multiply tensor-core hgemm back2back-hgemm fused-hgemm back2back-gemm fused-gemm

Updated Nov 3, 2023
Cuda

fan1997 / DTC-SpMM-ASPLOS24

Star

Codes for DTC-SpMM (ASPLOS'24)

reordering sparse-matrix nvidia-gpu spmm tensor-core

Updated Jun 2, 2024
C++

Bruce-Lee-LY / DeepGEMMPerTensor

Star

DeepGEMMPerTensor: clean and efficient FP8 GEMM per tensor kernels without scales

gpu cuda nvidia cutlass tensor-core deep-gemm fp8-gemm

Updated Sep 7, 2025
Python

LessUp / hpc-ai-optimization-lab

Star

CUDA kernel optimization lab for GEMM, FlashAttention, quantization, and GPU performance learning.

cuda high-performance-computing cuda-kernels gpu-computing gemm cpp20 gpu-programming ai-inference tensor-core nanobind flashattention kernel-optimization

Updated Apr 23, 2026
Cuda

LessUp / llm-speed

Star

CUDA kernels for LLM inference: FlashAttention forward, Tensor Core GEMM, PyTorch bindings, and benchmarkable reference implementations.

benchmarking cuda gpu-acceleration attention cuda-kernels gemm pytorch-extension tensor-core llm-inference flashattention

Updated Apr 23, 2026
Python

junqi-xie-learning / CS4302-Assignments

Star

The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions

openmp cuda tensor-core

Updated Jan 5, 2023
C++

LessUp / cuda-kernel-academy

Star

Systematic CUDA kernel engineering from SGEMM fundamentals to reusable kernels, advanced optimization experiments, and lightweight inference components. https://lessup.github.io/cuda-kernel-academy/

education tutorial cplusplus hpc cuda gemm inference-engine gpu-programming sgemm cuda-programming tensor-core kernel-optimization

Updated Apr 23, 2026
C++

Dartayous / FP16-vs-FP32-A-GPU-Lab-in-Frames

Star

A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.

performance-engineering deep-learning reproducible-research cuda pytorch fp16 cupy mixed-precision nsight gpu-benchmark nvtx fp32 tensor-core

Updated Apr 25, 2026
Python

LessUp / modern-ai-kernels

Star

Header-only C++/CUDA AI kernel library with OpenSpec-driven workflow and readable implementations of GEMM, attention, normalization, convolution, sparse ops, and Python bindings.

deep-learning cpp hpc cuda header-only gpu-computing gemm normalization tensor-core flash-attention ai-kernels openspec

Updated Apr 23, 2026
C++

yasser1-0 / FP16-vs-FP32-A-GPU-Lab-in-Frames

Star

🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.

performance-engineering deep-learning reproducible-research cuda pytorch fp16 cupy mixed-precision nsight gpu-benchmark nvtx fp32 tensor-core

Updated Feb 20, 2026
Python

LessUp / tiny-llm

Star

CUDA-native C++ Transformer inference engine with W8A16 quantization, KV cache management, and hand-tuned kernels.

cmake cpp cuda nvidia transformer quantization kv-cache tensor-core llm-inference w8a16

Updated Apr 23, 2026
C++

Improve this page

Add a description, image, and links to the tensor-core topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensor-core topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor-core

Here are 17 public repositories matching this topic...

Bruce-Lee-LY / cuda_hgemm

psmarter / CUDA-Practice

Bruce-Lee-LY / cuda_hgemv

Bruce-Lee-LY / flash_attention_inference

Bruce-Lee-LY / cutlass_gemm

loveSunning / FastCuda

Bruce-Lee-LY / cuda_back2back_hgemm

fan1997 / DTC-SpMM-ASPLOS24

Bruce-Lee-LY / DeepGEMMPerTensor

LessUp / hpc-ai-optimization-lab

LessUp / llm-speed

junqi-xie-learning / CS4302-Assignments

LessUp / cuda-kernel-academy

Dartayous / FP16-vs-FP32-A-GPU-Lab-in-Frames

LessUp / modern-ai-kernels

yasser1-0 / FP16-vs-FP32-A-GPU-Lab-in-Frames

LessUp / tiny-llm

Improve this page

Add this topic to your repo