GitHub - gebsl/simd_decision-tree: A prototypical implementation of a decision tree learner with SIMD instructions.

SIMD Decision Tree

Note: This is a prototypical implementation that was created in the course of a university project.

This project implements a decision tree learner with SIMD instructions. The learner is based on a regression model and employs variance reduction to find the best data split (see also: https://en.wikipedia.org/wiki/Decision_tree_learning).

The following dataset was used for train/test: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

The directory contains two CSV files:

diabetes_prediction_dataset.csv: This dataset contains 100.000 samples and is used for training.
diabetes_prediction_dataset.pred.csv: This dataset contains 50 samples (subset of the former dataset) and is used for value prediction.

The dataset describes medical data and health conditions and was intially intended to predict whether the person has diabetes or not. In this project, as we are using a regression model, we use the data set to predict the person's age from the remaining data points.

It has to be mentioned that one can see that the model indeed learns from the data, however prediction accuracy is not very good. This may be due to the dataset not being intended in the way used in this project (predicting the age of a person from relatively sparse medical data is way more challenging than just predicting if a given person has diabetes or not). However, the primary goal was not a good model performance, but rather solid parallelization using SIMD instructions, which the project complies with.

Note: No validity-check of the model with an established library (such as sklearn [https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html]) has been undertaken, due to time constraints.

Variance reduction

The initial implementation of this algorithm used a naive approach to calculate variance reduction, which was computationally heavy and can still be found in the attached performance analysis. This approach has been improved by re-using computations from earlier iterations of the algorithm. The mathematical justification for this can be found at the end of the document.

C definitions

There are a couple of definitions that can be used at compile time to configure the project:

MAX_TREE_DEPTH: Defines the maximum tree depth of the decision tree (to avoid overfitting).
MAX_TRAIN_ITEMS: The maximum amount of training items to be used (to limit computation time).
WITH_DEBUG: Outputs debug information to STDOUT.
WITH_VARIANCE_PRINT: Print intermediate information about variance calculation.
WITH_SIMD: Use SIMD instructions (if not defined, sequential versions of functions are used).

Build

Please use the attached build.sh file and run it in bash like:

./build.sh main

main is the desired C file to be used for compilation. The script will output a binary main and intermediate files (i.e. main.s containing Intel assembly code).

Run

Start the application from shell:

./dist/main

The application does not support any command line arguments and will print its result to STDOUT.

Performance analysis

System information

Architecture: x86_64
Processor: Intel 10th Gen (Intel Core i5-10310U)
Vector extensions: up to AVX2
g++: 14.2.0 on Ubuntu 25.04
Library used: Google Highway (libhwy-dev@1.2.0)

5.000 train samples

naive implementation

SIMD: 9.31 sec
SEQ: 52.92 sec

SIMD -> SEQ improvement by factor 5.68.

25.000 train samples

naive implementation

SIMD: 496.74 sec
SEQ: n/a (aborted, due to long execution time)

with analytical improvements

SIMD: 6.15 sec
SEQ: 32.75 sec

SIMD -> SEQ improvement by factor 5.33.

analytical -> naive implementation improvement (SIMD) by factor 80.77 (=496.74/6.15).

100.000 training samples

naive implementation

n/a (not tried, as training on 25.000 samples has already been challenging)

with analytical improvements

SIMD: 112.58 sec
SEQ: 555.77 sec

SIMD -> SEQ improvement by factor 4.94.

Variance reduction justification

Note: To make this justification formally correct, a proof by induction would be required.

The goal of this justification is to reuse as many computations as possible for each iteration of the algorithm. As in each step of the algorithm, only one data point is shifted from "below" to "above", previous computations can be reused easily.

Definitions

$$ \begin{aligned} S &:= \text{Set of data samples} \\ n &:= |S| \\ S^-_1 &:= S \setminus {a} \text{ (Set without data sample a)} \\ S^+_1 &:= S \cup {a} \text{ (Set with data sample a)} \\ A_S &:= \sum _{i\in S} y_{i}^2 \\ B_S &:= \sum _{i\in S} y_{i} \\ \end{aligned} $$

$$ \begin{aligned} A_{S^-_1} &= \sum_{i \in S} y^2_i - a^2_1 \\ A_{S^-_2} &= \sum_{i \in S} y^2_i - a^2_1 - a^2_2 \\ &= \sum_{i \in S^-_1} y^2_i - a^2_2 \\ & ... \end{aligned} $$

$$ \begin{aligned} B_{S^-_1} &= \sum_{i \in S} y_i - a_1 \\ B_{S^-_2} &= \sum_{i \in S} y_i - a_1 - a_2 \\ &= \sum_{i \in S^-_1} y_i - a_2 \\ & ... \end{aligned} $$

$$ \text{Mean: } \bar{y}_S = \frac{1}{n} \sum _{i\in S} y_{i} = \frac{B_S}{n} $$

$$ \begin{aligned} \bar{y}_{S^-_1} &= \frac{1}{n - 1} \sum _{i \in S^-_1} y_{i} \\ &= \frac{1}{n - 1} \sum _{i \in S} y_{i} - \frac{a_1}{n - 1} \\ &= \frac{n \cdot \bar{y}_{S}}{n - 1} - \frac{a_1}{n - 1} \\ &= \frac{n \cdot \frac{B_S}{n}}{n - 1} - \frac{a_1}{n - 1} \\ &= \frac{B_S - a_1}{n - 1} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^-_2} &= \frac{1}{n - 2} \sum _{i \in S^-_2} y_{i} \\ &= \frac{1}{n - 2} \sum _{i \in S^-_1} y_{i} - \frac{a_2}{n - 2} \\ &= \frac{n \cdot \bar{y}_{S^-_1}}{n - 2} - \frac{a_2}{n - 2} \\ &= \frac{n \cdot \frac{B_{S^-_1}}{n}}{n - 2} - \frac{a_2}{n - 2} \\ &= \frac{B_{S^-_1} - a_2}{n - 2} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^-_3} &= \text{...} \\ &= \frac{B_{S^-_2} - a_3}{n - 3} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^+_1} &= \frac{1}{n + 1} \sum _{i \in S^+_1} y_{i} \\ &= \frac{1}{n + 1} \sum _{i \in S} y_{i} + \frac{a}{n + 1} \\ &= \frac{n \cdot \bar{y}_{S}}{n + 1} + \frac{a}{n + 1} \\ &= \frac{n \cdot \frac{B_S}{n}}{n + 1} + \frac{a}{n + 1} \\ &= \frac{B_S + a}{n + 1} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^+_2} &= \frac{1}{n + 2} \sum _{i \in S^+_2} y_{i} \\ &= \frac{1}{n + 2} \sum _{i \in S^+_1} y_{i} + \frac{a_2}{n + 2} \\ &= \frac{n \cdot \bar{y}_{S^+_1}}{n + 2} + \frac{a_2}{n + 2} \\ &= \frac{n \cdot \frac{B_{S^+_1}}{n}}{n + 2} + \frac{a_2}{n + 2} \\ &= \frac{B_{S^+_1} + a_2}{n + 2} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^+_3} &= \text{...} \\ &= \frac{B_{S^+_2} + a_3}{n + 3} \\ \end{aligned} $$

Mean squared error (MSE)

$$ \begin{aligned} \sum _{i\in S}(y_{i} - \bar{y}_S)^{2} &= \sum _{i\in S} y_{i}^2 - 2y_{i}\bar{y}_S + \bar{y}_S^2 \\ &= \sum _{i\in S} y_{i}^2 - \sum _{i\in S} 2y_{i}\bar{y}_S + \sum _{i\in S} \bar{y}_S^2 \\ &= A_S - \left[ \sum _{i\in S} 2y_{i}\bar{y}_S \right] + n \cdot \bar{y}_S^2 \\ &= A_S - 2\bar{y}_S \left[ \sum _{i\in S} y_{i} \right] + n \cdot \bar{y}_S^2 \\ &= A_S - 2\bar{y}_S B_S + n \cdot \bar{y}_S^2 \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^-_1}(y_{i} - \bar{y}_{S^-_1})^{2} &= \left[ \sum _{i\in S}(y_{i} - \bar{y}_{S^-_1})^{2} \right] - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 - 2 y_{i} \bar{y}_{S^-_1} + \bar{y}_{S^-_1}^{2} \right] - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 \right] - \left[ \sum _{i\in S} 2 y_{i} \bar{y}_{S^-_1} \right] + \left[ \sum _{i\in S}\bar{y}_{S^-_1}^{2} \right] - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= A_S - 2\bar{y}_{S^-_1} \left[ \sum _{i\in S} y_{i} \right] + n \cdot \bar{y}_{S^-_1}^{2} - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= A_S - 2\bar{y}_{S^-_1} B_S + n \cdot \bar{y}_{S^-_1}^{2} - (a_1 - \bar{y}_{S^-_1})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^-_2}(y_{i} - \bar{y}_{S^-_2})^{2} &= \left[ \sum _{i\in S^-_1}(y_{i} - \bar{y}_{S^-_2})^{2} \right] - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= \left[ \sum _{i\in S^-_1}y_{i}^2 - 2 y_{i} \bar{y}_{S^-_2} + \bar{y}_{S^-_2}^{2} \right] - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= \left[ \sum _{i\in S^-_1}y_{i}^2 \right] - \left[ \sum _{i\in S^-_1} 2 y_{i} \bar{y}_{S^-_2} \right] + \left[ \sum _{i\in S^-_1}\bar{y}_{S^-_2}^{2} \right] - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= A_{S^-_1} - 2\bar{y}_{S^-_2} \left[ \sum _{i\in S^-_1} y_{i} \right] + n \cdot \bar{y}_{S^-_2}^{2} - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= A_{S^-_1} - 2\bar{y}_{S^-_2} B_{S^-_1} + n \cdot \bar{y}_{S^-_2}^{2} - (a_2 - \bar{y}_{S^-_2})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^-_3}(y_{i} - \bar{y}_{S^-_3})^{2} &= \text{...} \\ &= A_{S^-_2} - 2\bar{y}_{S^-_3} B_{S^-_2} + n \cdot \bar{y}_{S^-_3}^{2} - (a_3 - \bar{y}_{S^-_3})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^+_1}(y_{i} - \bar{y}_{S^+_1})^{2} &= \left[ \sum _{i\in S}(y_{i} - \bar{y}_{S^+_1})^{2} \right] + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 - 2 y_{i} \bar{y}_{S^+_1} + \bar{y}_{S^+_1}^{2} \right] + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 \right] - \left[ \sum _{i\in S} 2 y_{i} \bar{y}_{S^+_1} \right] + \left[ \sum _{i\in S}\bar{y}_{S^+_1}^{2} \right] + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= A_S - 2\bar{y}_{S^+_1} \left[ \sum _{i\in S} y_{i} \right] + n \cdot \bar{y}_{S^+_1}^{2} + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= A_S - 2\bar{y}_{S^+_1} B_S + n \cdot \bar{y}_{S^+_1}^{2} + (a_1 - \bar{y}_{S^+_1})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^+_2}(y_{i} - \bar{y}_{S^+_2})^{2} &= \text{...} \\ &= A_{S^+_1} - 2\bar{y}_{S^+_2} B_{S^+_1} + n \cdot \bar{y}_{S^+_2}^{2} + (a_2 - \bar{y}_{S^+_2})^{2} \\ \end{aligned} $$

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
src		src
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
diabetes_prediction_dataset.csv		diabetes_prediction_dataset.csv
diabetes_prediction_dataset.pred.csv		diabetes_prediction_dataset.pred.csv
flights_economy.csv		flights_economy.csv
reference_implementation.ipynb		reference_implementation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIMD Decision Tree

Variance reduction

C definitions

Build

Run

Performance analysis

System information

5.000 train samples

naive implementation

25.000 train samples

naive implementation

with analytical improvements

100.000 training samples

naive implementation

with analytical improvements

Variance reduction justification

Definitions

Mean squared error (MSE)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SIMD Decision Tree

Variance reduction

C definitions

Build

Run

Performance analysis

System information

5.000 train samples

naive implementation

25.000 train samples

naive implementation

with analytical improvements

100.000 training samples

naive implementation

with analytical improvements

Variance reduction justification

Definitions

Mean squared error (MSE)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages