This assignment will introduce you to the basics of extending GPU Microarchitecture to accelerate a kernel in hardware You will add a new RISC-V custom instruction for computing the integer dot product: VX_DOT8. You will also implement this instruction in the SimX cycle-level simulator.
VX_DOT8 calculates the dot product of two 4x4 vectors of int8 integers.
Dot Product = (A1*B1 + A2*B2 + A3*B3 + A4*B4)
The instruction format is as follows:
VX_DOT8 rd, rs1, rs2
where each source registers rs1 and rs2 hold four int8 elements.
rs1 := {A1, A2, A3, A4}
rs2 := {B1, B2, B3, B4}
rd := destination int32 result
Use the R-Type RISC-V instruction format.
| funct7 | rs2 | rs1 | funct3 | rd | opcode |
| 7 bits | 5 bits | 5 bits | 3 bits | 5 bit | 7 bits |
where:
opcode: opcode reserved for custom instructions.
funct3 and funct7: opcode modifiers.
Use custom extension opcode=0x0B with func7=9 and func3=0;
You will need to modify vx_intrinsics.h to add your new VX_DOT8 instruction.
// DOT8
inline int vx_dot8(int a, int b) {
size_t ret;
asm volatile (".insn r ?, ?, ?, ?, ?, ?" : "=r"(?) : "i"(?), "r"(?), "r"(?));
return ret;
}
Read the following doc to understand insn speudo-instruction format https://sourceware.org/binutils/docs/as/RISC_002dV_002dFormats.html
Implement a simple matrix multiplication GPU kernel that uses your new H/W extension.
Here is a basic C++ implementation of the kernel that uses our new VX_DOT8 instruction:
void MatrixMultiply(int8_t A[][N], int8_t B[][N], int32_t C[][N], int N) {
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
C[i][j] = 0;
for (int k = 0; k < N; k += 4) {
// Pack 4 int8_t elements from A and B into 32-bit integers
uint32_t packedA = *((int*)(A[i] + k));
uint32_t packedB = (uint8_t)B[k][j]
| ((uint8_t)B[k+1][j] << 8)
| ((uint8_t)B[k+2][j] << 16)
| ((uint8_t)B[k+3][j] << 24);
// Accumulate the dot product result into the C
C[i][j] += vx_dot8(packedA, packedB);
}
}
}
}
-
Clone sgemmx test under https://github.com/vortexgpgpu/vortex/blob/master/tests/regression/sgemmx into a new folder
tests/regressions/dot8. -
Set PROJECT name to
dot8intests/regressions/dot8/Makefile -
Update
matmul_cpuin main.cpp to operate onint8_tmatrices. -
Update
kernel_bodyintests/regressions/dot8/kernel.cppto usevx_dot8
Modify the cycle level simulator to implement the custom ISA extension. We recommend checking out how VX_SPLIT and VX_PRED instructions are decoded in SimX as reference.
- Update
op_string()indecode.cppto print out the new instruction. - Update
Emulator::decode()indecode.cppto decode the new instruction format.
case 9: {
switch (funct3) {
case 0: { // DOT8
auto instr = std::allocate_shared<Instr>(instr_pool_, uuid, FUType::ALU);
instr->setDestReg(rd, RegType::Integer);
instr->setSrcReg(0, rs1, RegType::Integer);
instr->setSrcReg(1, rs2, RegType::Integer);
instr->setOpType(AluType::DOT8);
ibuffer.push_back(instr);
} break;
default:
std::abort();
}
} break;- Update
AluTypeenum intypes.hto addDOT8type - Update
Emulator::execute()inexecute.cppto implement the actualVX_DOT8emulation. You will execute the new instruction on the ALU functional unit.
case AluType::DOT8: {
for (uint32_t t = thread_start; t < num_threads; ++t) {
if (!warp.tmask.test(t))
continue;
// TODO:
}
rd_write = true;
} break;- Update
AluUnit::tick()infunc_unit.cppto implement the timing ofVX_DOT8. You will assume 2 cycles latency for the dot-product execution.
case AluType::DOT8:
// TODO:
break;You will compare your new accelerated dot8 program with the existing sgemmx kernel under the regression codebase. You will use N=128 and (warps=4, threads=4) and (Warps=16, threads=16) for 1 and 4 cores. Plot the total execution cycles to observe the performance improvement.