Skip to content

gebsl/simd_decision-tree

Repository files navigation

SIMD Decision Tree

Note: This is a prototypical implementation that was created in the course of a university project.

This project implements a decision tree learner with SIMD instructions. The learner is based on a regression model and employs variance reduction to find the best data split (see also: https://en.wikipedia.org/wiki/Decision_tree_learning).

The following dataset was used for train/test: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

The directory contains two CSV files:

  • diabetes_prediction_dataset.csv: This dataset contains 100.000 samples and is used for training.
  • diabetes_prediction_dataset.pred.csv: This dataset contains 50 samples (subset of the former dataset) and is used for value prediction.

The dataset describes medical data and health conditions and was intially intended to predict whether the person has diabetes or not. In this project, as we are using a regression model, we use the data set to predict the person's age from the remaining data points.

It has to be mentioned that one can see that the model indeed learns from the data, however prediction accuracy is not very good. This may be due to the dataset not being intended in the way used in this project (predicting the age of a person from relatively sparse medical data is way more challenging than just predicting if a given person has diabetes or not). However, the primary goal was not a good model performance, but rather solid parallelization using SIMD instructions, which the project complies with.

Note: No validity-check of the model with an established library (such as sklearn [https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html]) has been undertaken, due to time constraints.

Variance reduction

The initial implementation of this algorithm used a naive approach to calculate variance reduction, which was computationally heavy and can still be found in the attached performance analysis. This approach has been improved by re-using computations from earlier iterations of the algorithm. The mathematical justification for this can be found at the end of the document.

C definitions

There are a couple of definitions that can be used at compile time to configure the project:

  • MAX_TREE_DEPTH: Defines the maximum tree depth of the decision tree (to avoid overfitting).
  • MAX_TRAIN_ITEMS: The maximum amount of training items to be used (to limit computation time).
  • WITH_DEBUG: Outputs debug information to STDOUT.
  • WITH_VARIANCE_PRINT: Print intermediate information about variance calculation.
  • WITH_SIMD: Use SIMD instructions (if not defined, sequential versions of functions are used).

Build

Please use the attached build.sh file and run it in bash like:

./build.sh main

main is the desired C file to be used for compilation. The script will output a binary main and intermediate files (i.e. main.s containing Intel assembly code).

Run

Start the application from shell:

./dist/main

The application does not support any command line arguments and will print its result to STDOUT.


Performance analysis

System information

  • Architecture: x86_64
  • Processor: Intel 10th Gen (Intel Core i5-10310U)
  • Vector extensions: up to AVX2
  • g++: 14.2.0 on Ubuntu 25.04
  • Library used: Google Highway (libhwy-dev@1.2.0)

5.000 train samples

naive implementation

  • SIMD: 9.31 sec
  • SEQ: 52.92 sec

SIMD -> SEQ improvement by factor 5.68.

25.000 train samples

naive implementation

  • SIMD: 496.74 sec
  • SEQ: n/a (aborted, due to long execution time)

with analytical improvements

  • SIMD: 6.15 sec
  • SEQ: 32.75 sec

SIMD -> SEQ improvement by factor 5.33.

analytical -> naive implementation improvement (SIMD) by factor 80.77 (=496.74/6.15).

100.000 training samples

naive implementation

  • n/a (not tried, as training on 25.000 samples has already been challenging)

with analytical improvements

  • SIMD: 112.58 sec
  • SEQ: 555.77 sec

SIMD -> SEQ improvement by factor 4.94.

Variance reduction justification

Note: To make this justification formally correct, a proof by induction would be required.

The goal of this justification is to reuse as many computations as possible for each iteration of the algorithm. As in each step of the algorithm, only one data point is shifted from "below" to "above", previous computations can be reused easily.

Definitions

$$ \begin{aligned} S &:= \text{Set of data samples} \\ n &:= |S| \\ S^-_1 &:= S \setminus {a} \text{ (Set without data sample a)} \\ S^+_1 &:= S \cup {a} \text{ (Set with data sample a)} \\ A_S &:= \sum _{i\in S} y_{i}^2 \\ B_S &:= \sum _{i\in S} y_{i} \\ \end{aligned} $$

$$ \begin{aligned} A_{S^-_1} &= \sum_{i \in S} y^2_i - a^2_1 \\ A_{S^-_2} &= \sum_{i \in S} y^2_i - a^2_1 - a^2_2 \\ &= \sum_{i \in S^-_1} y^2_i - a^2_2 \\ & ... \end{aligned} $$

$$ \begin{aligned} B_{S^-_1} &= \sum_{i \in S} y_i - a_1 \\ B_{S^-_2} &= \sum_{i \in S} y_i - a_1 - a_2 \\ &= \sum_{i \in S^-_1} y_i - a_2 \\ & ... \end{aligned} $$

$$ \text{Mean: } \bar{y}_S = \frac{1}{n} \sum _{i\in S} y_{i} = \frac{B_S}{n} $$

$$ \begin{aligned} \bar{y}_{S^-_1} &= \frac{1}{n - 1} \sum _{i \in S^-_1} y_{i} \\ &= \frac{1}{n - 1} \sum _{i \in S} y_{i} - \frac{a_1}{n - 1} \\ &= \frac{n \cdot \bar{y}_{S}}{n - 1} - \frac{a_1}{n - 1} \\ &= \frac{n \cdot \frac{B_S}{n}}{n - 1} - \frac{a_1}{n - 1} \\ &= \frac{B_S - a_1}{n - 1} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^-_2} &= \frac{1}{n - 2} \sum _{i \in S^-_2} y_{i} \\ &= \frac{1}{n - 2} \sum _{i \in S^-_1} y_{i} - \frac{a_2}{n - 2} \\ &= \frac{n \cdot \bar{y}_{S^-_1}}{n - 2} - \frac{a_2}{n - 2} \\ &= \frac{n \cdot \frac{B_{S^-_1}}{n}}{n - 2} - \frac{a_2}{n - 2} \\ &= \frac{B_{S^-_1} - a_2}{n - 2} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^-_3} &= \text{...} \\ &= \frac{B_{S^-_2} - a_3}{n - 3} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^+_1} &= \frac{1}{n + 1} \sum _{i \in S^+_1} y_{i} \\ &= \frac{1}{n + 1} \sum _{i \in S} y_{i} + \frac{a}{n + 1} \\ &= \frac{n \cdot \bar{y}_{S}}{n + 1} + \frac{a}{n + 1} \\ &= \frac{n \cdot \frac{B_S}{n}}{n + 1} + \frac{a}{n + 1} \\ &= \frac{B_S + a}{n + 1} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^+_2} &= \frac{1}{n + 2} \sum _{i \in S^+_2} y_{i} \\ &= \frac{1}{n + 2} \sum _{i \in S^+_1} y_{i} + \frac{a_2}{n + 2} \\ &= \frac{n \cdot \bar{y}_{S^+_1}}{n + 2} + \frac{a_2}{n + 2} \\ &= \frac{n \cdot \frac{B_{S^+_1}}{n}}{n + 2} + \frac{a_2}{n + 2} \\ &= \frac{B_{S^+_1} + a_2}{n + 2} \\ \end{aligned} $$

$$ \begin{aligned} \bar{y}_{S^+_3} &= \text{...} \\ &= \frac{B_{S^+_2} + a_3}{n + 3} \\ \end{aligned} $$

Mean squared error (MSE)

$$ \begin{aligned} \sum _{i\in S}(y_{i} - \bar{y}_S)^{2} &= \sum _{i\in S} y_{i}^2 - 2y_{i}\bar{y}_S + \bar{y}_S^2 \\ &= \sum _{i\in S} y_{i}^2 - \sum _{i\in S} 2y_{i}\bar{y}_S + \sum _{i\in S} \bar{y}_S^2 \\ &= A_S - \left[ \sum _{i\in S} 2y_{i}\bar{y}_S \right] + n \cdot \bar{y}_S^2 \\ &= A_S - 2\bar{y}_S \left[ \sum _{i\in S} y_{i} \right] + n \cdot \bar{y}_S^2 \\ &= A_S - 2\bar{y}_S B_S + n \cdot \bar{y}_S^2 \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^-_1}(y_{i} - \bar{y}_{S^-_1})^{2} &= \left[ \sum _{i\in S}(y_{i} - \bar{y}_{S^-_1})^{2} \right] - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 - 2 y_{i} \bar{y}_{S^-_1} + \bar{y}_{S^-_1}^{2} \right] - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 \right] - \left[ \sum _{i\in S} 2 y_{i} \bar{y}_{S^-_1} \right] + \left[ \sum _{i\in S}\bar{y}_{S^-_1}^{2} \right] - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= A_S - 2\bar{y}_{S^-_1} \left[ \sum _{i\in S} y_{i} \right] + n \cdot \bar{y}_{S^-_1}^{2} - (a_1 - \bar{y}_{S^-_1})^{2} \\ &= A_S - 2\bar{y}_{S^-_1} B_S + n \cdot \bar{y}_{S^-_1}^{2} - (a_1 - \bar{y}_{S^-_1})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^-_2}(y_{i} - \bar{y}_{S^-_2})^{2} &= \left[ \sum _{i\in S^-_1}(y_{i} - \bar{y}_{S^-_2})^{2} \right] - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= \left[ \sum _{i\in S^-_1}y_{i}^2 - 2 y_{i} \bar{y}_{S^-_2} + \bar{y}_{S^-_2}^{2} \right] - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= \left[ \sum _{i\in S^-_1}y_{i}^2 \right] - \left[ \sum _{i\in S^-_1} 2 y_{i} \bar{y}_{S^-_2} \right] + \left[ \sum _{i\in S^-_1}\bar{y}_{S^-_2}^{2} \right] - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= A_{S^-_1} - 2\bar{y}_{S^-_2} \left[ \sum _{i\in S^-_1} y_{i} \right] + n \cdot \bar{y}_{S^-_2}^{2} - (a_2 - \bar{y}_{S^-_2})^{2} \\ &= A_{S^-_1} - 2\bar{y}_{S^-_2} B_{S^-_1} + n \cdot \bar{y}_{S^-_2}^{2} - (a_2 - \bar{y}_{S^-_2})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^-_3}(y_{i} - \bar{y}_{S^-_3})^{2} &= \text{...} \\ &= A_{S^-_2} - 2\bar{y}_{S^-_3} B_{S^-_2} + n \cdot \bar{y}_{S^-_3}^{2} - (a_3 - \bar{y}_{S^-_3})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^+_1}(y_{i} - \bar{y}_{S^+_1})^{2} &= \left[ \sum _{i\in S}(y_{i} - \bar{y}_{S^+_1})^{2} \right] + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 - 2 y_{i} \bar{y}_{S^+_1} + \bar{y}_{S^+_1}^{2} \right] + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= \left[ \sum _{i\in S}y_{i}^2 \right] - \left[ \sum _{i\in S} 2 y_{i} \bar{y}_{S^+_1} \right] + \left[ \sum _{i\in S}\bar{y}_{S^+_1}^{2} \right] + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= A_S - 2\bar{y}_{S^+_1} \left[ \sum _{i\in S} y_{i} \right] + n \cdot \bar{y}_{S^+_1}^{2} + (a_1 - \bar{y}_{S^+_1})^{2} \\ &= A_S - 2\bar{y}_{S^+_1} B_S + n \cdot \bar{y}_{S^+_1}^{2} + (a_1 - \bar{y}_{S^+_1})^{2} \\ \end{aligned} $$

$$ \begin{aligned} \sum _{i\in S^+_2}(y_{i} - \bar{y}_{S^+_2})^{2} &= \text{...} \\ &= A_{S^+_1} - 2\bar{y}_{S^+_2} B_{S^+_1} + n \cdot \bar{y}_{S^+_2}^{2} + (a_2 - \bar{y}_{S^+_2})^{2} \\ \end{aligned} $$

About

A prototypical implementation of a decision tree learner with SIMD instructions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors