You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🎉 CODES was accepted to the ML4PS workshop @ NeurIPS2024 🎉
8
+
🎉 Accepted to the ML4PS workshop @ NeurIPS 2024
9
9
10
-
## Benchmarking Coupled ODE Surrogates
10
+
Benchmark coupled ODE surrogate models on curated datasets with reproducible training, evaluation, and visualization pipelines. CODES helps you answer: _Which surrogate architecture fits my data, accuracy target, and runtime budget?_
11
11
12
-
CODES is a benchmark for coupled ODE surrogate models.
<imgwidth="14"alt="CODES API Docs"src="docs/_static/book-solid.svg">
35
-
</picture> The technical API documentation is hosted on this <ahref="https://robin-janssen.github.io/CODES-Benchmark/">GitHub Page</a>.
21
+
**uv (recommended)**
36
22
37
-
## Motivation
38
-
39
-
There are many efforts to use machine learning models ("surrogates") to replace the costly numerics involved in solving coupled ODEs. But for the end user, it is not obvious how to choose the right surrogate for a given task. Usually, the best choice depends on both the dataset and the target application.
40
-
41
-
Dataset specifics - how "complex" is the dataset?
42
-
43
-
- How many samples are there?
44
-
- Are the trajectories very dynamic or are the developments rather slow?
45
-
- How dense is the distribution of initial conditions?
46
-
- Is the data domain of interest well-covered by the domain of the training set?
47
-
48
-
Task requirements:
49
-
50
-
- What is the required accuracy?
51
-
- How important is inference time? Is the training time limited?
52
-
- Are there computational constraints (memory or processing power)?
53
-
- Is uncertainty estimation required (e.g. to replace uncertain predictions by numerics)?
54
-
- How much predictive flexibility is required? Do we need to interpolate or extrapolate across time?
55
-
56
-
Besides these practical considerations, one overarching question is always: Does the model only learn the data, or does it "understand" something about the underlying dynamics?
57
-
58
-
## Goals
59
-
60
-
This benchmark aims to aid in choosing the best surrogate model for the task at hand and additionally to shed some light on the above questions.
61
-
62
-
To achieve this, a selection of surrogate models are implemented in this repository. They can be trained on one of the included datasets or a custom dataset and then benchmarked on the corresponding test dataset.
63
-
64
-
Some **metrics** included in the benchmark (but there is much more!):
65
-
66
-
- Absolute and relative error of the models.
67
-
- Inference time.
68
-
- Number of trainable parameters.
69
-
- Memory requirements (**WIP**).
70
-
71
-
Besides this, there are plenty of **plots and visualisations** providing insights into the models behaviour:
72
-
73
-
- Error distributions - per model, across time or per quantity.
74
-
- Insights into interpolation and extrapolation across time.
75
-
- Behaviour when training with sparse data or varying batch size.
76
-
- Predictions with uncertainty and predictive uncertainty across time.
77
-
- Correlations between the either predictive uncertainty or dynamics (gradients) of the data and the prediction error
78
-
79
-
Some prime **use-cases** of the benchmark are:
80
-
81
-
- Finding the best-performing surrogate on a dataset. Here, best-performing could mean high accuracy, low inference times or any other metric of interest (e.g. most accurate uncertainty estimates, ...).
82
-
- Comparing performance of a novel surrogate architecture against the implemented baseline models.
83
-
- Gaining insights into a dataset or comparing datasets using the built-in dataset insights.
84
-
85
-
## Key Features
86
-
87
-
<details>
88
-
<summary><b>Baseline Surrogates</b></summary>
89
-
90
-
The following surrogate models are currently implemented to be benchmarked:
91
-
92
-
- Fully Connected Neural Network:
93
-
The vanilla neural network a.k.a. multilayer perceptron.
94
-
- DeepONet:
95
-
Two fully connected networks whose outputs are combined using a scalar product. In the current implementation, the surrogate comprises of only one DeepONet with multiple outputs (hence the name MultiONet).
96
-
- Latent NeuralODE:
97
-
NeuralODE combined with an autoencoder that reduces the dimensionality of the dataset before solving the dynamics in the resulting latent space.
98
-
- Latent Polynomial:
99
-
Uses an autoencoder similar to Latent NeuralODE, but fits a polynomial to the trajectories in the resulting latent space.
100
-
101
-
</details>
102
-
103
-
<details>
104
-
<summary><b>Baseline Datasets</b></summary>
105
-
106
-
The following datasets are currently included in the benchmark:
To give an uncertainty estimate that does not rely too much on the specifics of the surrogate architecture, we use DeepEnsemble for UQ.
114
-
115
-
</details>
116
-
117
-
<details>
118
-
<summary><b>Parallel Training</b></summary>
119
-
120
-
To gain insights into the surrogates behaviour, many models must be trained on varying subsets of the training data. This task is trivially parallelisable. In addition to utilising all specified devices, the benchmark features some nice progress bars to gain insights into the current status of the training.
121
-
122
-
</details>
123
-
124
-
<details>
125
-
<summary><b>Plots, Plots, Plots</b></summary>
126
-
127
-
While hard metrics are crucial to compare the surrogates, performance cannot always be broken down to a set of numbers. Running the benchmark creates many plots that serve to compare performance of surrogates or provide insights into the performance of each surrogate.
128
-
129
-
</details>
130
-
131
-
<details>
132
-
<summary><b>Dataset Insights (WIP)</b></summary>
133
-
134
-
"Know your data" is one of the most important rules in machine learning. To aid in this, the benchmark provides plots and visualisations that should help to understand the dataset better.
At the end of the benchmark, the most important metrics are displayed in a table, additionally, all metrics generated during the benchmark are provided as a csv file.
uv run python run_training.py --config configs/train_eval/config_minimal.yaml
29
+
uv run python run_eval.py --config configs/train_eval/config_minimal.yaml
30
+
```
147
31
148
-
Randomness is an important part of machine learning and even required in the context of UQ with DeepEnsemble, but reproducibility is key in benchmarking enterprises. The benchmark uses a custom seed that can be set by the user to ensure full reproducibility.
<summary><b>Custom Datasets and Own Models</b></summary>
44
+
Outputs land in `trained/<training_id>`, `results/<training_id>`, and `plots/<training_id>`. The `configs/` folder contains ready-to-use templates (`train_eval/config_minimal.yaml`, `config_full.yaml`, etc.). Copy a file there and adjust datasets/surrogates/modalities before running the CLIs.
154
45
155
-
To cover a wide variety of use-cases, the benchmark is designed such that adding own datasets and models is explicitly supported.
Optionally, you can set up a [virtual environment](https://docs.python.org/3/library/venv.html) (recommended).
65
+
## Contributing
168
66
169
-
Then, install the required packages with
67
+
Pull requests are welcome! Please include documentation updates, add or update tests when you touch executable code, and run:
170
68
69
+
```bash
70
+
uv pip install --group dev
71
+
pytest
72
+
sphinx-build -b html docs/source/ docs/_build/html
171
73
```
172
-
pip install -r requirements.txt
173
-
```
174
-
175
-
The installation is now complete. To be able to run and evaluate the benchmark, you need to first set up a configuration YAML file. There is one provided, but it should be configured. For more information, check the [configuration page](https://robin-janssen.github.io/CODES-Benchmark/documentation.html#config). There, we also offer an interactive Config-Generator tool with some explanations to help you set up your benchmark.
176
-
177
-
You can also add your own datasets and models to the benchmark to evaluate them against each other or some of our baseline models. For more information on how to do this, please refer to the [documentation](https://robin-janssen.github.io/CODES-Benchmark/documentation.html).
178
74
75
+
If you publish a new surrogate or dataset, document it under `docs/guides` / `docs/reference` so users can adopt it quickly. For questions, open an issue on GitHub.
179
76
180
77
## Contributors
181
78
@@ -186,4 +83,4 @@ You can also add your own datasets and models to the benchmark to evaluate them
0 commit comments