This repository is the code basis for the paper intitled "Exploring the Intricacies of Neural Network Optimization"
Install requirements.txt by using the command pip install -r requirements.txt
-
Write various
.jsonfiles with the experiments you want to perform. -
Run the experiments using the comand
python code/run.py --hyper path_to_the_folder
In the hyperparameters folder there is one folder for each of the tested datasets.
If the user desires to run every experiment at the same time use the all_runs folder.
Otherwise it can run the experiments by folder individually achieving the same results as the ones presented in the paper.
Keep in mind the experiments with the binary_crossentropy and sparse_categorical_crossentropy are kept in a seperate folder as they require Y array to be created differently.
You can run them seperatly and then join the csv results.
With the experiments performed the results should be presented in results/raw folder.
To preprocess them run python code/results_preprocess.py which should create the results/final folder with the preprocessed results.
After that to obtain the importance of the hyperparameters run python code/results_analysis.py which should present the importance by dataset and the average of the six datasets.
Here we present the results that are available in the paper and an additional analysis of the obtained results.
If there is any analysis missing that the reader might desire to perform, the complete data obtained from the runs is available in the results folder, or the reader might run the experiments him self.
These are the results of the fANOVA analysis.
| All Datasets | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 18.42 | 3.2 | 6.99 |
| batch_size | 0.95 | 55.94 | 37.67 |
| loss | 12.23 | 0.33 | 2.1 |
| optimizer | 14.88 | 5.17 | 2.16 |
| learning_rate | 17.65 | 3.38 | 1.34 |
| hidden_layer_dim | 3.94 | 3.85 | 16.62 |
| hidden_layer_size | 3.94 | 3.61 | 6.29 |
| Classification | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 17.59 | 2.91 | 2.28 |
| batch_size | 1.31 | 57.3 | 37.43 |
| loss | 9.16 | 0.01 | 3.76 |
| optimizer | 17.11 | 3.78 | 4.53 |
| learning_rate | 21.4 | 4.69 | 0.01 |
| hidden_layer_dim | 6.13 | 0.67 | 19.37 |
| hidden_layer_size | 3.04 | 5.2 | 8.54 |
| Regression | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 23.66 | 2.87 | 15.98 |
| batch_size | 4.49 | 64.87 | 37.22 |
| loss | 19.51 | 0.12 | 0.01 |
| optimizer | 7.4 | 8.33 | 0.12 |
| learning_rate | 18.09 | 1.38 | 3.26 |
| hidden_layer_dim | 2.1 | 2.2 | 12.22 |
| hidden_layer_size | 3.32 | 1.48 | 4.18 |
| Abalone | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 14.77 | 1.39 | 4.39 |
| batch_size | 0.55 | 56.72 | 21.61 |
| loss | 0.0 | 1.62 | 0.0 |
| optimizer | 2.96 | 7.99 | 3.5 |
| learning_rate | 30.02 | 6.9 | 0.07 |
| hidden_layer_dim | 7.16 | 0.12 | 15.69 |
| hidden_layer_size | 11.55 | 4.35 | 11.04 |
| Bike Sharing | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 51.26 | 0.59 | 24.54 |
| batch_size | 0.74 | 72.21 | 29.71 |
| loss | 0.06 | 0.0 | 0.0 |
| optimizer | 17.86 | 6.28 | 0.02 |
| learning_rate | 11.6 | 5.17 | 7.14 |
| hidden_layer_dim | 0.0 | 1.98 | 14.41 |
| hidden_layer_size | 2.62 | 1.16 | 0.82 |
| Compas | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 3.4 | 0.4 | 0.08 |
| batch_size | 1.16 | 43.0 | 6.23 |
| loss | 33.98 | 0.19 | 0.0 |
| optimizer | 21.68 | 4.02 | 4.16 |
| learning_rate | 9.59 | 6.06 | 0.02 |
| hidden_layer_dim | 0.76 | 2.92 | 49.31 |
| hidden_layer_size | 3.61 | 7.49 | 20.06 |
| Covertype | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 29.22 | 12.77 | 4.01 |
| batch_size | 0.77 | 56.92 | 41.6 |
| loss | 0.06 | 0.0 | 10.34 |
| optimizer | 8.29 | 1.65 | 4.67 |
| learning_rate | 23.64 | 0.32 | 0.17 |
| hidden_layer_dim | 13.27 | 0.2 | 3.32 |
| hidden_layer_size | 1.84 | 4.79 | 0.62 |
| Delays Zurich | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 0.37 | 3.57 | 5.2 |
| batch_size | 0.0 | 58.2 | 57.82 |
| loss | 39.27 | 0.0 | 0.01 |
| optimizer | 14.39 | 2.42 | 0.0 |
| learning_rate | 0.18 | 0.58 | 0.5 |
| hidden_layer_dim | 2.37 | 10.18 | 12.22 |
| hidden_layer_size | 3.81 | 0.48 | 3.92 |
| Higgs | |||
|---|---|---|---|
| Hyperparameter | Performance | Training Time | Inference Time |
| activation_functions | 11.51 | 0.49 | 3.73 |
| batch_size | 2.46 | 48.6 | 69.07 |
| loss | 0.01 | 0.14 | 2.25 |
| optimizer | 24.08 | 8.67 | 0.63 |
| learning_rate | 30.84 | 1.22 | 0.15 |
| hidden_layer_dim | 0.09 | 7.68 | 4.75 |
| hidden_layer_size | 0.18 | 3.39 | 1.25 |
| Activation function | Batch size | Hidden layer dimension | Loss function | Optimizer | Learning Rate | MSE/MCC | Training time | Prediction Time |
|---|---|---|---|---|---|---|---|---|
| Regression | ||||||||
| Abalone | ||||||||
| relu | 256 | [224, 192, 608, 768, 800] | mean_squared_error | adam | 0.001 | 2.158 | 1.928 | 0.107 |
| Bike Sharing | ||||||||
| selu | 1024 | [352, 32, 288, 32, 544, 704, 96] | mean_squared_error | adam | 0.001 | 59.748 | 3.621 | 0.128 |
| Delays Zurich | ||||||||
| relu | 128 | [640, 416, 576, 192, 288, 32, 32] | mean_squared_error | adam | 0.001 | 3.101 | 73.694 | 0.286 |
| Classification | ||||||||
| Compass | ||||||||
| relu | 512 | [512, 512, 512, 512] | categorical_crossentropy | adam | 0.001 | 0.041 | 1.567 | 0.118 |
| Covertype | ||||||||
| relu | 512 | [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024] | categorical_crossentropy | adam | 0.001 | 0.828 | 74.544 | 0.199 |
| Higgs | ||||||||
| softsign | 512 | [224, 480, 64, 96, 768, 32, 928] | categorical_crossentropy | adam | 0.001 | 0.415 | 50.935 | 0.239 |
The best and worst models were picked based on the performance metric
| Dataset | Baseline | Best model | Worst model |
|---|---|---|---|
| Performance | (MCC/MSE) | ||
| Regression | |||
| Abalone | 2.289 | 2.158 | 9.295 |
| Bike Sharing | 84.045 | 59.748 | 100.139 |
| Delays Zurich | 3.107 | 3.101 | 154.627 |
| Classification | |||
| Compass | 0.022 | 0.041 | 0 |
| Covertype | 0.812 | 0.828 | -0.001 |
| Higgs | 0.256 | 0.415 | 0 |
| Training Time | |||
| Abalone | 1.465 | 1.928 | 2.554 |
| Bike Sharing | 4.67 | 3.621 | 3.014 |
| Delays Zurich | 12.74 | 73.694 | 7.25 |
| Compass | 1.088 | 2.342 | 1.121 |
| Covertype | 37.381 | 74.544 | 4.987 |
| Higgs | 21.161 | 50.935 | 4.329 |
| Inference Time | |||
| Abalone | 0.11 | 0.107 | 0.101 |
| Bike Sharing | 0.132 | 0.128 | 0.122 |
| Delays Zurich | 0.136 | 0.286 | 0.149 |
| Compass | 1.088 | 0.11 | 1.121 |
| Covertype | 0.173 | 0.199 | 0.172 |
| Higgs | 0.173 | 0.239 | 0.182 |
- Rafael Teixeira - rgtzths
This project is licensed under the MIT License - see the LICENSE file for details
If you use this code, please cite our work: Teixeira, Rafael & Antunes, Mário & Sobral, Rúben & Martins, João & Gomes, Diogo & Aguiar, Rui. (2023). Exploring the Intricacies of Neural Network Optimization. 10.1007/978-3-031-45275-8_2.