-
Notifications
You must be signed in to change notification settings - Fork 24
Expand file tree
/
Copy pathexample.cfg
More file actions
92 lines (77 loc) · 5.28 KB
/
example.cfg
File metadata and controls
92 lines (77 loc) · 5.28 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# ConfigFile Format:
# ==================
# A Transfer is defined as a single operation where an Executor reads and adds together
# values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
# This simplifies to a simple copy operation when dealing with single SRC/DST.
#
# SRC 0 DST 0
# SRC 1 -> Executor -> DST 1
# SRC X DST Y
# Five Executors are supported by TransferBench
# Executor: SubExecutor:
# 1) CPU CPU thread
# 2) GPU GPU threadblock/Compute Unit (CU)
# 3) DMA N/A. (Must have single SRC, at least one DST)
# 4) NIC Queue Pair
# 5) Batched-DMA Batch item (Must have single SRC, at least one DST)
# Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
# There are two ways to specify a Test:
# 1) Basic
# The basic specification assumes the same number of SubExecutors (SE) used per Transfer
# A positive number of Transfers is specified followed by that number of triplets describing each Transfer
# #Transfers #SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
# 2) Advanced
# A negative number of Transfers is specified, followed by quintuplets describing each Transfer
# A non-zero number of bytes specified will override any provided value
# -#Transfers (srcMem1->Executor1->dstMem1 #SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #SEsL BytesL)
# Argument Details:
# #Transfers: Number of Transfers to be run in parallel
# #SEs : Number of SubExectors to use (CPU threads/ GPU threadblocks)
# srcMemL : Source memory locations (Where the data is to be read from)
# Executor : Executor is specified by a character indicating type, followed by device index (0-indexed)
# - C: CPU-executed (Indexed from 0 to # NUMA nodes - 1)
# - G: GPU-executed (Indexed from 0 to # GPUs - 1)
# - D: DMA-executor (Indexed from 0 to # GPUs - 1)
# - B: Batched-DMA-executor (Indexed from 0 to # GPUs - 1)
# - I#.#: NIC executor (Indexed from 0 to # NICs - 1)
# - N#.#: Nearest NIC executor (Indexed from 0 to # GPUs - 1)
# dstMemL : Destination memory locations (Where the data is to be written to)
# bytesL : Number of bytes to copy (0 means use command-line specified size)
# Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
#
# Memory locations are specified by one or more (device character / device index) pairs
# Character indicating memory type followed by device index (0-indexed)
# Supported memory locations are:
# - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - P: Pinned host memory (on NUMA node, indexed by closest GPU [#GPUs -1])
# - B: Coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - D: Non-coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - K: Uncached pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - H: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - U: Uncached device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - N: Null memory (index ignored)
# Examples:
# 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
# 1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
# 2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
# -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
# 1 2 (F0->I0.2->F1) Uses 2 QPs to transfer data from GPU0 via NIC0 to GPU1 via NIC2
# 1 1 (F0->N0.1->F1) Uses 1 QP to transfer data from GPU0 via GPU0's closest NIC to GPU1 via GPU1's closest NIC
# -2 (G0->N0.1->G1 2 128M) (G1->N1.0->G0 1 256M) Uses Nearest NIC executor to copy 128Mb from GPU0 to GPU1 with 2 QPs,
# and 256Mb from GPU1 to GPU0 with 1 QP
# Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
# Lines starting with # will be ignored. Lines starting with ## will be echoed to output
## Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs
1 4 (G0->G0->G1)
## Single DMA executed Transfer between GPUs 0 and 1
1 1 (G0->D0->G1)
## Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
## "Memset" by GPU 0 to GPU 0 memory
1 32 (N0->G0->G0)
## "Read-only" by CPU 0
1 4 (C0->C0->N0)
## Broadcast from GPU 0 to GPU 0 and GPU 1
1 16 (G0->G0->G0G1)