Q8 Matmul MMA+ Kernels Implementation#23
Open
shalinib-ibm wants to merge 5 commits into
Open
Conversation
add -mpcu=future under P9 condition in CMakeLists.txt Add 16x16 kernel with P12 mma+ builtins. Modified 8x8 kernel to use P12 mma+ builtins. Modified 8x4 kernel to use P12 mma+ builtins TO DO: Create a saperate class for Q8 as P12 MMA+ only supports INT8 and not INT4. Signed-off-by: root <root@trout-lp1.rch.stglabs.ibm.com>
Also, print inpyts matrices for 8x4 kernel. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
8x8 8x16 16x8 16x16 Debug: Print accumulator content Testcase: Make m n k cli arguments Signed-off-by: root <root@trout-lp1.rch.stglabs.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This branch conatins matmul kernles for p12 MMA+ for Q8 data type.
Additional information
Currently the below kernels have been implemented and tested for functional correctness.
8x4
8x8
8x16
16x8
16x16
Steps to run and verify the kernels
Building llama.cpp with MMA+ gcc compiler .
Gcc compiler needs to be built from ibm/mmaplus branch and llama.cpp needs to be built with -O0 ( since -O3 with mma+ compiler gives functionally incorrect matrix multiplication results)
export PATH=/home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin:$PATH
cd /home/shalini/llama_5_3_26/llama.cpp/
cmake -B build-gcc-mma+ -DCMAKE_C_COMPILER=//home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin/gcc -DCMAKE_CXX_COMPILER=/home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin/g++ -DCMAKE_C_FLAGS="-O0" -DCMAKE_CXX_FLAGS="-O0"
vim ggml/src/ggml-cpu/CMakeLists.txt -> change mcpu=power10 to -mpcu=future
export LD_LIBRARY_PATH=/home/shalini/gcc-mma+/gcc/install/mmaplus-ppc64le/bin/:$LD_LIBRARY_PATH
cmake --build build-gcc-mma+/ -j10
Follow the above steps on trout-lp1 machine . copy the libs to bohr machine and run the binary from tcl script to simulate in mambo p12 environment.