Skip to content

Commit 56fe97c

Browse files
authored
Merge pull request #103 from raphlinus/fix_prefix_link
[prefix] Fix broken link to code
2 parents 5cab99c + 7eade0f commit 56fe97c

1 file changed

Lines changed: 3 additions & 4 deletions

File tree

_posts/2020-04-21-prefix-sum.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ The state of the art is [decoupled look-back]. I'm not going to try to summarize
4242

4343
That work is a refinement of [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) from Nvidia's GPU Gems 3 book. A production-quality, open source implementation is [CUB]. Another implementation, designed to be more accessible but not as optimized, is [ModernGPU scan].
4444

45-
My own [implementation] is very much a research-quality proof of concept. It exists as the [prefix] branch of the [piet-gpu] repository. Basically, I wanted to determine whether it was possible to come within a stone's throw of memcpy performance using Vulkan compute kernels. It's a fairly straightforward implementation of the decoupled look-back paper, and doesn't implement all the tricks. For example, the look-back is entirely sequential; I didn't parallelize the look-back as suggested in section 4.4 of the paper. This is probably the easiest performance win to be gotten. But it's not too horrible, as the partition size is quite big; each workgroup processes 16ki elements. Rough measurements indicate that look-back is on the order of 10-15% of the total time.
45+
My own [implementation] is very much a research-quality proof of concept. It exists in an archived branch of the [Vello] repository (edit from the future: formerly the prefix branch of when it was still called piet-gpu). Basically, I wanted to determine whether it was possible to come within a stone's throw of memcpy performance using Vulkan compute kernels. It's a fairly straightforward implementation of the decoupled look-back paper, and doesn't implement all the tricks. For example, the look-back is entirely sequential; I didn't parallelize the look-back as suggested in section 4.4 of the paper. This is probably the easiest performance win to be gotten. But it's not too horrible, as the partition size is quite big; each workgroup processes 16ki elements. Rough measurements indicate that look-back is on the order of 10-15% of the total time.
4646

4747
The implementation is enough of a rough prototype I don't yet want to do careful performance evaluation, but initial results are encouraging: it takes 2.05ms of GPU time to compute the prefix sum of 64Mi 32-bit unsigned integers on a GTX 1080, a rate of 31.2 billion elements/second. Since each element involves reading and writing 4 bytes, that corresponds to a raw memory bandwidth of around 262GiB/s. The theoretical memory bandwidth is listed as 320GB/s, so clearly the code is able consume a large fraction of available memory bandwidth.
4848

@@ -195,9 +195,8 @@ There is some interesting [HN discussion](https://news.ycombinator.com/item?id=2
195195
[WebGPU]: https://www.w3.org/community/gpu/
196196
[OpenCL 3.0]: https://www.khronos.org/news/press/khronos-group-releases-opencl-3.0
197197
[monoid]: https://en.wikipedia.org/wiki/Monoid
198-
[implementation]: https://github.com/linebender/piet-gpu/blob/prefix/piet-gpu-hal/examples/shader/prefix.comp
199-
[prefix]: https://github.com/linebender/piet-gpu/tree/prefix
200-
[piet-gpu]: https://github.com/linebender/piet-gpu
198+
[implementation]: https://github.com/linebender/vello/blob/custom-hal-archive-with-shaders/tests/shader/prefix.comp
199+
[Vello]: https://github.com/linebender/vello
201200
[specialization constants]: https://blogs.igalia.com/itoral/2018/03/20/improving-shader-performance-with-vulkans-specialization-constants/
202201
[limited forward progress guarantee]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-memory-model-to-cs#_limited_forward_progress_guarantees
203202
[GPU schedulers: how fair is fair enough?]: https://www.cs.princeton.edu/~ts20/files/concur2018.pdf

0 commit comments

Comments
 (0)