Roadmap

# CPU impl
## 1. TODO
1. `SHA256-SIMD version` (Lei Hao)
2. Benchmark:
    1. Correntness (compared with std lib)
    2. Performance: SHA256 vs SHA256-SIMD vs BLAKE3 vs BLAKE3-threading vs BLAKE3-threading-SIMD

## 2. WIP
1. BLAKE3 SIMD version (AVX2 instruction)
    1. Threading & SIMD
        1. Compute-bound -> the same threads count as our CPU cores (TIPS)

## 3. Done
1. SHA256 basic impl
2. BLAKE3 basic impl
3. BLAKE3 multithreading

# GPU impl
## 1. TODO
1. SM80's `cp.async` to reduce pipeline bubbles
2. Support SM90 arch
3. Performance benchmarking
    1. Different kernel version on different arch (SM70, SM80, SM90) x (v1, v2, v3)
    2. Latest kernel performance among different arch (SM70, SM80, SM90) x (latest_version)

## 2. WIP
1. SM80's `cp.async` to reduce pipeline bubbles

## 3. Done
1. Basic kernel impl
2. Coalsced Memory access + Staging pipeline
    1. Stage 1: Coalsced Loading from `gmem`
    2. Stage 2: Compress chunk to roots, and merge to one warp-level `cv`
    3. Stage 3: Block Reduce, yield one block-level `cv`
4. Parallel computing logic - `16-lane sub-warp for chunk compressing` instead of multiple inactivate lanes
    1. Improved computation throughput from 67% to 70%
5. Involve `CuTe` with layouts for `gmem` and `smem`, to help solve data loading (Stage 1)
6. Debug the basic GPU computing logic, make sure no wrong output



# Other TODO
1. Demo-video
2. Report (Overleaf via email)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap #2

CPU impl

1. TODO

2. WIP

3. Done

GPU impl

1. TODO

2. WIP

3. Done

Other TODO

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Roadmap #2

Description

CPU impl

1. TODO

2. WIP

3. Done

GPU impl

1. TODO

2. WIP

3. Done

Other TODO

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions