You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: This is a prototypical implementation that was created in the course of a university project.
This project implements a decision tree learner with SIMD instructions.
The learner is based on a regression model and employs variance reduction to find the best data split (see also: https://en.wikipedia.org/wiki/Decision_tree_learning).
diabetes_prediction_dataset.csv: This dataset contains 100.000 samples and is used for training.
diabetes_prediction_dataset.pred.csv: This dataset contains 50 samples (subset of the former dataset) and is used for value prediction.
The dataset describes medical data and health conditions and was intially intended to predict whether the person has diabetes or not. In this project, as we are using a regression model, we use the data set to predict the person's age from the remaining data points.
It has to be mentioned that one can see that the model indeed learns from the data, however prediction accuracy is not very good. This may be due to the dataset not being intended in the way used in this project (predicting the age of a person from relatively sparse medical data is way more challenging than just predicting if a given person has diabetes or not). However, the primary goal was not a good model performance, but rather solid parallelization using SIMD instructions, which the project complies with.
The initial implementation of this algorithm used a naive approach to calculate variance reduction, which was computationally heavy and can still be found in the attached performance analysis. This approach has been improved by re-using computations from earlier iterations of the algorithm. The mathematical justification for this can be found at the end of the document.
C definitions
There are a couple of definitions that can be used at compile time to configure the project:
MAX_TREE_DEPTH: Defines the maximum tree depth of the decision tree (to avoid overfitting).
MAX_TRAIN_ITEMS: The maximum amount of training items to be used (to limit computation time).
WITH_DEBUG: Outputs debug information to STDOUT.
WITH_VARIANCE_PRINT: Print intermediate information about variance calculation.
WITH_SIMD: Use SIMD instructions (if not defined, sequential versions of functions are used).
Build
Please use the attached build.sh file and run it in bash like:
./build.sh main
main is the desired C file to be used for compilation. The script will output a binary main and intermediate files (i.e. main.s containing Intel assembly code).
Run
Start the application from shell:
./dist/main
The application does not support any command line arguments and will print its result to STDOUT.
Performance analysis
System information
Architecture: x86_64
Processor: Intel 10th Gen (Intel Core i5-10310U)
Vector extensions: up to AVX2
g++: 14.2.0 on Ubuntu 25.04
Library used: Google Highway (libhwy-dev@1.2.0)
5.000 train samples
naive implementation
SIMD: 9.31 sec
SEQ: 52.92 sec
SIMD -> SEQ improvement by factor 5.68.
25.000 train samples
naive implementation
SIMD: 496.74 sec
SEQ: n/a (aborted, due to long execution time)
with analytical improvements
SIMD: 6.15 sec
SEQ: 32.75 sec
SIMD -> SEQ improvement by factor 5.33.
analytical -> naive implementation improvement (SIMD) by factor 80.77 (=496.74/6.15).
100.000 training samples
naive implementation
n/a (not tried, as training on 25.000 samples has already been challenging)
with analytical improvements
SIMD: 112.58 sec
SEQ: 555.77 sec
SIMD -> SEQ improvement by factor 4.94.
Variance reduction justification
Note: To make this justification formally correct, a proof by induction would be required.
The goal of this justification is to reuse as many computations as possible for each iteration of the algorithm. As in each step of the algorithm, only one data point is shifted from "below" to "above", previous computations can be reused easily.
Definitions
$$
\begin{aligned}
S &:= \text{Set of data samples} \\
n &:= |S| \\
S^-_1 &:= S \setminus {a} \text{ (Set without data sample a)} \\
S^+_1 &:= S \cup {a} \text{ (Set with data sample a)} \\
A_S &:= \sum _{i\in S} y_{i}^2 \\
B_S &:= \sum _{i\in S} y_{i} \\
\end{aligned}
$$