Skip to content

Commit cc8ddf1

Browse files
committed
Update report
1 parent 04fad7a commit cc8ddf1

3 files changed

Lines changed: 12 additions & 12 deletions

File tree

report/sections/3_experiment.tex

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@ \subsubsection{Models}
2222

2323
\paragraph{Decision Tree (DT)} DT is the simplest model in nature. It is tree-based which suggests making predictions based on binary predicates. Its architecture is not complex for training and it is also highly explainable to non-technical persons. It is also widely used in real-world production-grade systems like autonomous vehicles \cite{autonomous-vehicle-appl}. Given its simplicity and popularity, we start our analysis on exploring what parameters minimally have to be tuned for the simplest model, and how it performs during tuning with metaheuristics. For simplicity, a prebuilt structure from \cite{dt-scikit} is used.
2424

25-
\paragraph{K-Nearest Neighbors (KNN)} KNN is another simple model which predicts based on the class of a nearest neighbor among existing data instances. From the perspective of explainability, a prior work \cite{mygithub-drugconsumpML} suggests that, depending on datasets, sometimes a KNN classifier could be linear-based, but in the case of an image dataset, KNN in our experiment are predicting from highly dimensional image arrays, whose predictions need to be generalized by a kernel-based explainer. The model's architecture itself is not complex but the dataset involved could be a bit heavier training task. We used it as another type of model in our experiment. For simplicity, a prebuilt structure from \cite{knn-scikit} is used.
25+
\paragraph{K-Nearest Neighbors (KNN)} We ultilize \textit{KNeighborsClassifier} from \cite{knn-scikit} which predicts based on the class of a nearest neighbor among existing data instances. From the perspective of explainability, a prior work \cite{mygithub-drugconsumpML} suggests that, depending on datasets, sometimes a KNN classifier could be linear-based, but in the case of an image dataset, KNN in our experiment are predicting from highly dimensional image arrays, whose predictions need to be generalized by a kernel-based explainer. The model's architecture itself is not complex but the dataset involved could be a bit heavier training task. We used it as an alternative type in our experiment.
2626

27-
\paragraph{Convolutional Neural Network (CNN)} While the above models might exhibit a simpler architecture, CNN, on the other hand, is a common deep learning architecture in practical works, which helps learning image recognition tasks more efficiently. It is a fully-connected neural network architecture which leverages operations known as "convolutions". Each of which utilizes a subset of pixels, known as a "kernel", or a "filter", iteratively learn those patterns. Custom neural networks generally, in the real-world, involve far more hyper-parameters during their training, and thus, tuning them is computationally much heavier than those prebuilt surrogate models. However, bad hyper-parameters could also increase the resulting error rate of the model \cite{metaheuristics-cookbook}. Therefore, a suitable metaheuristic search here comes with an important role to help determine the best set of hyper-parameters with fewer resources than an exhaustive search. Figure~\ref{fig:cnn_arch} summarizes the CNN backbone used in our experiments.
27+
\paragraph{Convolutional Neural Network (CNN)} While the above models might exhibit a simpler architecture, CNN, on the other hand, is a common deep learning architecture in practical works, which helps learning image recognition tasks more efficiently. It is a fully-connected 3-layered neural network with batch normalization, which takes operations known as "convolutions". Each of which utilizes a subset of pixels, known as a "kernel", or a "filter", iteratively learn those patterns. Custom neural networks generally, in the real-world, involve far more hyper-parameters during their training, and thus, tuning them is computationally much heavier than those prebuilt surrogate models. However, bad hyper-parameters could also increase the resulting error rate of the model \cite{metaheuristics-cookbook}. Therefore, a suitable metaheuristic search here comes with an important role to help determine the best set of hyper-parameters with fewer resources than an exhaustive search. Figure~\ref{fig:cnn_arch} summarizes the CNN backbone used in our experiments.
2828

2929
\begin{figure}[ht]
3030
\centering
@@ -143,7 +143,8 @@ \subsubsection{Justification}
143143

144144
We identified an order of importance for each metric by its relative weight in the function.
145145

146-
\paragraph{Macro F1} The recall measures the fraction of a class's true positives over all samples that are in reality positive, i.e., \( \frac{TP}{TP + FN} \). The precision measures the fraction of true positives over all samples that are labeled positive in the dataset, i.e., \( \frac{TP}{TP + FP} \). By balancing the pros and cons of the precision-recall dilemma, a macro F1 score, denoted by \( \frac{\sum_{i=1}^{N} F1_i}{N} \) from N classes, is believed to be the best and the most important metric. The formula suggests it as an unweighted mean each class's F1 score. It helps penalizing poor performances on any class and avoid over-focusing on well performing ones. This is how it combines both precision and recall for each class, by giving equal importance to each class's performance.
146+
\paragraph{Macro F1}
147+
Macro F1 is selected as the dominant term of the composite fitness because it enforces class-wise fairness during hyperparameter optimization. Unlike accuracy or micro-averaged metrics that can be inflated by majority-class performance, Macro F1 computes the unweighted mean of each class’s F1 score, ensuring that every class contributes equally to the objective. This prevents the search process from converging to solutions that perform well only on frequent or easy classes while neglecting minority or harder classes. Since F1 already balances precision and recall through a harmonic mean, its macro-averaged form provides the strongest and most stable signal for steering the optimizer toward models that maintain consistent predictive quality across the entire label distribution.
147148

148149
\paragraph{Precision-Recall} Thereafter, precision and recall ranks right after the macro F1 score in importance. While the precision informs us how many samples are labeled positive, it is more important to know, how many of them are truly correctly classified, and therefore, we assign a slightly above weight to the recall rather than the precision.
149150

@@ -189,14 +190,14 @@ \subsubsection{Search Budget} Each HPO run is allocated a fixed budget of 50 fit
189190

190191
\subsubsection{Stochasticity} To account for stochasticity, we perform $N = 10$ independent runs for each optimizer-model.
191192

192-
\subsubsection{Metrics}
193+
\subsubsection{Evaluation Metrics}
193194

194195
We will collect:
195196

196197
\begin{itemize}
197-
\item{Effectiveness:} The distribution (mean, median, best, worst) of the final fitness score achieved across 10 runs.
198-
\item{Convergence:} The convergence trace (improvement over evaluations) for each run.
199-
\item{Stability:} The variance of the final fitness scores across the 10 runs.
198+
\item{\textbf{Effectiveness}:} The distribution (mean, median, best, worst) of the final fitness score achieved across 10 runs.
199+
\item{\textbf{Convergence}:} The convergence trace (improvement over evaluations) for each run.
200+
\item{\textbf{Stability}:} The variance of the final fitness scores across the 10 runs.
200201
\end{itemize}
201202

202203
\subsubsection{Analysis}

report/sections/4_results.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ \subsection{RQ2: Stability}
3535

3636
\subsection{Statistical Significance}
3737

38-
Wilcoxon signed-rank tests showed no significant performance differences between any optimizer pairs across all models (p > 0.05; Table \ref{tab:wilcoxon}). The nearest to significance was PSO versus RS on KNN (p = 0.094). The memetic GA variant also showed no improvement over the standard GA.
38+
Wilcoxon signed-rank tests showed no significant performance differences between any optimizer pairs across all models ($p > 0.05$; Table \ref{tab:wilcoxon}). The nearest to significance was PSO versus RS on KNN (p = 0.094). The memetic GA variant also showed no statistically significant improvement over the standard GA.
3939

4040
\begin{table}[H]
4141
\centering

report/sections/5_limitations.tex

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
\section{Limitations}
22

3-
\paragraph{Insufficient Evolutionary Iterations.} The fixed evaluation budget ($B=50$) relative to the population size ($P=30$) restricted the Genetic Algorithm to fewer than two full generations. Consequently, 60\% of the budget was consumed by initial random sampling, leaving insufficient evaluations for crossover, mutation, and selection to drive convergence.
3+
\paragraph{Insufficient Evolutionary Iterations.} The fixed evaluation budget ($50$) relative to the population size ($30$) restricted the Genetic Algorithm to fewer than optimal generations. Consequently, 60\% of the budget was consumed by initial random sampling, leaving insufficient evaluations for crossover, mutation, and selection to drive convergence.
44

5-
\paragraph{Statistical Power.} With $N=10$ independent runs per configuration, the Wilcoxon signed-rank test had limited power to detect small but consistent performance differences between algorithms.
6-
7-
\paragraph{Scope Validity.} Experiments were conducted only on grayscale CIFAR-10 with defined hyperparameter spaces. Results may not generalize to higher-dimensional spaces (e.g., RGB images or deeper architectures) where exploration-exploitation trade-offs could differ.
5+
\paragraph{Statistical Power.} With $10$ independent runs per configuration, the Wilcoxon signed-rank test had limited power to detect small but consistent performance differences between algorithms.
86

7+
\paragraph{Scope Validity.} Experiments were conducted only on grayscale CIFAR-10 with defined hyperparameter spaces. Results may not generalize to higher-dimensional spaces (e.g., RGB images or deeper architectures) where exploration-exploitation trade-offs could differ.

0 commit comments

Comments
 (0)