Q8 Matmul MMA+ Kernels Implementation by shalinib-ibm · Pull Request #23 · shalinib-ibm/llama.cpp

shalinib-ibm · 2026-04-22T12:50:13Z

Overview

This branch conatins matmul kernles for p12 MMA+ for Q8 data type.

Additional information

Currently the below kernels have been implemented and tested for functional correctness.
8x4
8x8
8x16
16x8
16x16

Steps to run and verify the kernels

Building llama.cpp with MMA+ gcc compiler .

Gcc compiler needs to be built from ibm/mmaplus branch and llama.cpp needs to be built with -O0 ( since -O3 with mma+ compiler gives functionally incorrect matrix multiplication results)

export PATH=/home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin:$PATH
cd /home/shalini/llama_5_3_26/llama.cpp/
cmake -B build-gcc-mma+ -DCMAKE_C_COMPILER=//home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin/gcc -DCMAKE_CXX_COMPILER=/home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin/g++ -DCMAKE_C_FLAGS="-O0" -DCMAKE_CXX_FLAGS="-O0"
vim ggml/src/ggml-cpu/CMakeLists.txt -> change mcpu=power10 to -mpcu=future
export LD_LIBRARY_PATH=/home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin/:$LD_LIBRARY_PATH
cmake --build build-gcc-mma+/ -j10

Follow the above steps on trout-lp1 machine . copy the libs to bohr machine and run the binary from tcl script to simulate in mambo p12 environment.

I have read and agree with the contributing guidelines
AI usage disclosure:

add -mpcu=future under P9 condition in CMakeLists.txt Add 16x16 kernel with P12 mma+ builtins. Modified 8x8 kernel to use P12 mma+ builtins. Modified 8x4 kernel to use P12 mma+ builtins TO DO: Create a saperate class for Q8 as P12 MMA+ only supports INT8 and not INT4. Signed-off-by: root <root@trout-lp1.rch.stglabs.ibm.com>

Also, print inpyts matrices for 8x4 kernel. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

8x8 8x16 16x8 16x16 Debug: Print accumulator content Testcase: Make m n k cli arguments Signed-off-by: root <root@trout-lp1.rch.stglabs.ibm.com>

root and others added 3 commits April 22, 2026 07:46

ggml-cpu:Add a stanalone test case for Q4 and Q8 Gemm

0f39da0

Also, print inpyts matrices for 8x4 kernel. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

ggml-cpu: Implement MMA+ kernels

1013aaa

8x8 8x16 16x8 16x16 Debug: Print accumulator content Testcase: Make m n k cli arguments Signed-off-by: root <root@trout-lp1.rch.stglabs.ibm.com>

shalinib-ibm changed the title ~~Q8 Matmul MMA Kernels Implementation~~ Q8 Matmul MMA+ Kernels Implementation Apr 29, 2026

shalinib-ibm added 2 commits May 4, 2026 16:44

Merge branch 'master' into p12_mma_plus

0fa079c

Update sgemm.cpp

5cfdbd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q8 Matmul MMA+ Kernels Implementation#23

Q8 Matmul MMA+ Kernels Implementation#23
shalinib-ibm wants to merge 5 commits into
masterfrom
p12_mma_plus

shalinib-ibm commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shalinib-ibm commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Steps to run and verify the kernels

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shalinib-ibm commented Apr 22, 2026 •

edited

Loading