|
| 1 | +## GUPS Benchmark |
| 2 | + |
| 3 | +### How to build the benchmark |
| 4 | +Build with Makefile with following options: |
| 5 | + |
| 6 | +`GPU_ARCH=xx` where `xx` is the Compute Capibility of the device(s) being tested (default: 80 90). Users could check the CC of a specific GPU using the tables [here](https://developer.nvidia.com/cuda-gpus#compute). The generated executable (called `gups`) supports both global memory GUPS and shared memory GUPS modes. Global memory mode is the default mode. Please refer to the next section for the runtime option to switch between modes. |
| 7 | + |
| 8 | +Notes on shared memory GUPS: |
| 9 | +1. Note that for shared memory GUPS, unless if dynamic allocation is forced (see below), only CC 80 and CC 90 are supported, for other CC, the shared memory GUPS code will fall back to dynamic allocation mode. |
| 10 | +2. To force dynamic shared memory allocation, build with `DYNAMIC_SHMEM=`. Note that this is NOT recommended and will result in incorrect shared memory GUPS numbers as the kernel becomes instruction bound. |
| 11 | + |
| 12 | +For example: `make GPU_ARCH="70 80" DYNAMIC_SHMEM=` will build the executable `gups`, which supports global memory GUPS and shared memory GUPS with dynamic shared memory allocation, for both CC 70 (e.g., NVIIDA V100 GPU) and CC 80 (e.g., NVIDIA A100 GPU). |
| 13 | + |
| 14 | +### How to run the benchmark |
| 15 | +Note that besides GUPS (updates (loop)), our benchmark code supports other random access tests, including reads, writes, reads+writes, and updates (no loop). |
| 16 | +You can choose the benchmark type using the `-t` runtime option. Users may need to fine tune access per element option (`-a`) to achieve the best performance. |
| 17 | +Note that the correctness verification is only available for updates (loop)/default test. |
| 18 | + |
| 19 | +You could use `./gups -h` to get a list of runtime arguments. |
| 20 | +``` |
| 21 | +Usage: |
| 22 | + -n <int> input data size = 2^n [default: 29] |
| 23 | + -o <int> occupancy percentage, 100/occupancy how much larger the working set is compared to the requested bytes [default: 100] |
| 24 | + -r <int> number of kernel repetitions [default: 1] |
| 25 | + -a <int> number of random accesses per input element [default: 32 (r, w) or 8 (u, unl, rw) for gmem, 65536 for shmem] |
| 26 | + -t <int> test type (0 - update (u), 1 - read (r), 2 - write (w), 3 - read write (rw), 4 - update no loop (unl)) [default: 0] |
| 27 | + -d <int> device ID to use [default: 0] |
| 28 | + -s <int> enable input in shared memory instead of global memory for shared memory GUPS benchmark if s>=0. The benchmark will use max available shared memory if s=0 (for ideal GUPS conditions this must be done at compile time, check README.md for build options). This tool does allow setting the shmem data size with = 2^s (for s>0), however this will also result in an instruction bound kernel that fails to reach hardware limitations of GUPS. [default: -1 (disabled)] |
| 29 | +``` |
| 30 | + |
| 31 | +You can also use provided Python script to run multiple tests with a single command and get a CSV report. The default setting of the script run all the random access tests. Run `python run.py --help` for the usage options. |
| 32 | +``` |
| 33 | +usage: run.py [-h] [--device-id DEVICE_ID] |
| 34 | + [--input-size-begin INPUT_SIZE_BEGIN] |
| 35 | + [--input-size-end INPUT_SIZE_END] [--occupancy OCCUPANCY] |
| 36 | + [--repeats REPEATS] |
| 37 | + [--test {reads,writes,reads_writes,updates,updates_no_loop,all}] |
| 38 | + [--memory-loc {global,shared}] |
| 39 | +
|
| 40 | +Benchmark GUPS. Store results in results.csv file. |
| 41 | +
|
| 42 | +optional arguments: |
| 43 | + -h, --help show this help message and exit |
| 44 | + --device-id DEVICE_ID |
| 45 | + GPU ID to run the test |
| 46 | + --input-size-begin INPUT_SIZE_BEGIN |
| 47 | + exponent of the input data size begin range, base is 2 |
| 48 | + (input size = 2^n). [Default: 29 for global GUPS, |
| 49 | + max_shmem for shared GUPS. Global/shared is controlled |
| 50 | + by --memory-loc |
| 51 | + --input-size-end INPUT_SIZE_END |
| 52 | + exponent of the input data size end range, base is 2 |
| 53 | + (input size = 2^n). [Default: 29 for global GUPS, |
| 54 | + max_shmem for shared GUPS. Global/shared is controlled |
| 55 | + by --memory-loc |
| 56 | + --occupancy OCCUPANCY |
| 57 | + 100/occupancy is how much larger the working set is |
| 58 | + compared to the requested bytes |
| 59 | + --repeats REPEATS number of kernel repetitions |
| 60 | + --test {reads,writes,reads_writes,updates,updates_no_loop,all} |
| 61 | + test to run |
| 62 | + --memory-loc {global,shared} |
| 63 | + memory buffer in global memory or shared memory |
| 64 | +``` |
| 65 | + |
| 66 | +### LICENSE |
| 67 | + |
| 68 | +`gups.cu` is modified based on `randomaccess.cu` file from [link to Github repository](https://github.com/nattoheaven/cuda_randomaccess). The LICENSE file of the Github repository is preserved as `LICENSE.gups.cu`. |
| 69 | + |
| 70 | +`run.py` and `Makefile` are implemented from scratch by NVIDIA. For the license information of these two files, please refer to the `LICENSE` file. |
0 commit comments