Description CPU impl
1. TODO
SHA256-SIMD version (Lei Hao)
Benchmark:
Correntness (compared with std lib)
Performance: SHA256 vs SHA256-SIMD vs BLAKE3 vs BLAKE3-threading vs BLAKE3-threading-SIMD
2. WIP
BLAKE3 SIMD version (AVX2 instruction)
Threading & SIMD
Compute-bound -> the same threads count as our CPU cores (TIPS)
3. Done
SHA256 basic impl
BLAKE3 basic impl
BLAKE3 multithreading
GPU impl
1. TODO
SM80's cp.async to reduce pipeline bubbles
Support SM90 arch
Performance benchmarking
Different kernel version on different arch (SM70, SM80, SM90) x (v1, v2, v3)
Latest kernel performance among different arch (SM70, SM80, SM90) x (latest_version)
2. WIP
SM80's cp.async to reduce pipeline bubbles
3. Done
Basic kernel impl
Coalsced Memory access + Staging pipeline
Stage 1: Coalsced Loading from gmem
Stage 2: Compress chunk to roots, and merge to one warp-level cv
Stage 3: Block Reduce, yield one block-level cv
Parallel computing logic - 16-lane sub-warp for chunk compressing instead of multiple inactivate lanes
Improved computation throughput from 67% to 70%
Involve CuTe with layouts for gmem and smem, to help solve data loading (Stage 1)
Debug the basic GPU computing logic, make sure no wrong output
Other TODO
Demo-video
Report (Overleaf via email)
Reactions are currently unavailable
You can’t perform that action at this time.
CPU impl
1. TODO
SHA256-SIMD version(Lei Hao)2. WIP
3. Done
GPU impl
1. TODO
cp.asyncto reduce pipeline bubbles2. WIP
cp.asyncto reduce pipeline bubbles3. Done
gmemcvcv16-lane sub-warp for chunk compressinginstead of multiple inactivate lanesCuTewith layouts forgmemandsmem, to help solve data loading (Stage 1)Other TODO