You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: report/sections/3_experiment.tex
+1-5Lines changed: 1 addition & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -201,8 +201,4 @@ \subsubsection{Metrics}
201
201
202
202
\subsubsection{Analysis}
203
203
204
-
To evaluate the performance differences between the optimizers, we will employ both qualitative and quantitative analysis methods.
205
-
206
-
Qualitatively, we inspect the convergence plots to understand the search trajectory and efficiency of each algorithm over the 50 generations. We will also utilize box plots to visualize the spread of the final fitness values, allowing us to assess the stability and variance of the solutions found across the 10 independent runs.
207
-
208
-
Quantitatively, we will perform statistical hypothesis testing to determine if observed differences in performance are significant or merely due to random chance. Since the distribution of fitness scores is not guaranteed to be normal, we will opt for the \textbf{Wilcoxon signed-rank test}, a non-parametric pairwise comparison test. We will define our significance level at $\alpha = 0.05$. The null hypothesis ($H_0$) states that the median difference between pairs of observations from two optimizers is zero (i.e., they perform equally).
204
+
Performance differences are assessed qualitatively via convergence plots (search trajectory) and box plots (solution stability across runs). For quantitative comparison, we apply the Wilcoxon signed-rank test ($\alpha=0.05$, non-parametric) to determine if observed differences are significant. The null hypothesis is that the median performance difference between any two optimizers is zero.
Copy file name to clipboardExpand all lines: report/sections/4_results.tex
+5-28Lines changed: 5 additions & 28 deletions
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ \section{Results}
4
4
5
5
\subsection{RQ1: Effectiveness and Convergence}
6
6
7
-
To answer RQ1, we analyzed the final fitness scores and the convergence behavior of each optimizer. As illustrated in Figure \ref{fig:convergence}, all three optimizers demonstrated the ability to improve solutions over the generations, though with varying efficiency.
7
+
All optimizers improved solutions over 50 evaluations, but convergence efficiency varied by model (Figure \ref{fig:convergence}). For DT and KNN, which have relatively small search spaces, GA, PSO, and RS converged rapidly to near-identical optimal configurations. Their final fitness scores were approximately 0.3384 and 0.4308, respectively (Figure \ref{fig:test_perf}).
8
8
9
9
\begin{figure}[H]
10
10
\centering
@@ -13,18 +13,7 @@ \subsection{RQ1: Effectiveness and Convergence}
13
13
\label{fig:convergence}
14
14
\end{figure}
15
15
16
-
For the \textbf{Decision Tree} and \textbf{KNN}, the search space was relatively small. Consequently, all three optimizers rapidly converged to near-identical optimal configurations. As shown in the final test performance (Figure \ref{fig:test_perf}), the DT achieved a fitness of $\approx0.3384$, while the KNN achieved $\approx0.4308$.
17
-
18
-
We have also plotted the mean and standard deviation of the current fitness for each $n$ evaluations for each optimizer. The curves are smoothed for readability (with a smoothing factor of 0.5), while the shaded regions reflect the underlying variation. As shown in Figure \ref{fig:evaluations}, the current fitness is generally improving over the evaluations, though with varying efficiency.
\caption{Current fitness behavior of GA, PSO, and RS across 50 evaluations for all three models.}
24
-
\label{fig:evaluations}
25
-
\end{figure}
26
-
27
-
For the \textbf{CNN}, which possesses the largest and most complex search space, we observed distinct behaviors. While GA (orange line in Figure \ref{fig:convergence}) started with lower fitness, it showed steady improvement. Random Search (RS), surprisingly, started with high fitness in several runs, likely due to the efficacy of random sampling in high-dimensional spaces where few parameters dominate performance. Ultimately, all algorithms converged to a test performance of approximately $0.77$ (Figure \ref{fig:test_perf}).
16
+
For the more complex CNN, GA showed steady improvement from a lower initial fitness, while RS occasionally found high-fitness configurations early (Figure \ref{fig:convergence}). All optimizers converged to a similar final performance of approximately 0.77 (Figure \ref{fig:test_perf}).
28
17
29
18
\begin{figure}[H]
30
19
\centering
@@ -35,7 +24,7 @@ \subsection{RQ1: Effectiveness and Convergence}
35
24
36
25
\subsection{RQ2: Stability}
37
26
38
-
Stability was measured by the standard deviation of the final fitness scores across the 10 runs, visualized in the box plots in Figure \ref{fig:boxplot}.
27
+
The stability of final solutions, shown by box plots of fitness across 10 runs (Figure \ref{fig:boxplot}), was high for all models. Variance was minimal for DT and KNN. For CNN, all optimizers produced similar interquartile ranges (approximately 0.79 to 0.83), with minor differences in outlier counts (GA: 1, PSO: 2, RS: 1).
39
28
40
29
\begin{figure}[H]
41
30
\centering
@@ -44,11 +33,9 @@ \subsection{RQ2: Stability}
44
33
\label{fig:boxplot}
45
34
\end{figure}
46
35
47
-
The box plots provide a nuanced view of the stability for each model. For the \textbf{CNN}, the distributions are quite similar across optimizers, with whiskers generally falling within the $0.79-0.83$ range. In terms of outliers, GA and RS each exhibit one, while PSO displays two. For the \textbf{Decision Tree (DT)}, Random Search (RS) shows the largest interquartile range and a significant lower whisker; however, the overall variance remains small given the scale ($0.3415-0.3440$). The metaheuristics are more packed at the upper end, though GA includes two outliers, and PSO has a lower whisker extending to approximately $0.34225$. Finally, for the \textbf{KNN}, no outliers are observed for any of the optimizers.
48
-
49
36
\subsection{Statistical Significance}
50
37
51
-
We performed the \textbf{Wilcoxon signed-rank test} to determine if there were statistically significant differences between the optimizers. The results are summarized in Table \ref{tab:wilcoxon}.
38
+
Wilcoxon signed-rank tests showed no significant performance differences between any optimizer pairs across all models (p > 0.05; Table \ref{tab:wilcoxon}). The nearest to significance was PSO versus RS on KNN (p = 0.094). The memetic GA variant also showed no improvement over the standard GA.
\item\textbf{No Significant Differences:} All pairwise comparisons yield $p > 0.05$, so we fail to reject the null hypothesis across models.
74
-
\item\textbf{Identical Performance (DT):} GA-Standard vs PSO remains $p=1.000$, indicating indistinguishable outcomes on Decision Trees.
75
-
\item\textbf{Near Significance (KNN):} PSO vs RS on KNN is $p=0.094$, hinting PSO may modestly outperform RS, but it does not reach $\alpha=0.05$.
76
-
\item\textbf{Memetic Variants:} GA-Memetic comparisons (vs GA-Standard, PSO, RS) are all non-significant ($p > 0.05$), showing no measurable improvement over the standard GA under our budget.
77
-
\end{itemize}
78
-
79
-
Given the GA population size of $30$ and the strict budget of $50$ evaluations, the algorithm is structurally capped at fewer than three full generations: $30$ evaluations are spent on initialization, leaving only $20$ offspring evaluations (about $0.67$ of a generation). Roughly $60\%$ of the budget is therefore consumed by warm-up sampling, so GA behaves similarly to Random Search under this constraint. PSO faces the same tight budget and likewise does not clear a statistically meaningful advantage. Under such sub-$2\times$-population budgets, Random Search is effectively superior because it pays no population-initialization overhead. The value of GA (and possibly PSO) would likely emerge only with larger budgets that permit 10+ generations, where exploitation phases can amortize their startup cost.
\paragraph{Insufficient Evolutionary Iterations.} The primary threat to validity is the interaction between the population size ($P=30$) and the fixed evaluation budget ($B=50$). This ratio mathematically restricted the Genetic Algorithm to fewer than two full generations ($N_{gen} \approx1.67$). Consequently, 60\% of the computational budget was consumed by the random initialization phase (Generation 0), leaving insufficient iterations for the metaheuristic mechanisms (crossover, mutation, and selection) to drive convergence away from the initial random distribution.
3
+
\paragraph{Insufficient Evolutionary Iterations.} The fixed evaluation budget ($B=50$) relative to the population size ($P=30$) restricted the Genetic Algorithm to fewer than two full generations. Consequently, 60\% of the budget was consumed by initial random sampling, leaving insufficient evaluations for crossover, mutation, and selection to drive convergence.
4
4
5
-
\paragraph{Statistical Power.} The experiment utilized $N=10$ independent runs per optimizer-model pair. While standard for exploratory HPO studies, this sample size limits the statistical power of the Wilcoxon signed-rank test, potentially masking small but consistent effect sizes between the algorithms.
5
+
\paragraph{Statistical Power.} With $N=10$ independent runs per configuration, the Wilcoxon signed-rank test had limited power to detect small but consistent performance differences between algorithms.
6
6
7
-
\paragraph{Scope Validity.} The evaluation was restricted to grayscale CIFAR-10 and fixed hyperparameter search spaces. These results may not generalize to higher-dimensional search spaces (e.g., full RGB imagery or deeper architectures) where the "curse of dimensionality" might differentiate global search strategies (GA) from local ones (PSO) more effectively.
7
+
\paragraph{Scope Validity.} Experiments were conducted only on grayscale CIFAR-10 with defined hyperparameter spaces. Results may not generalize to higher-dimensional spaces (e.g., RGB images or deeper architectures) where exploration-exploitation trade-offs could differ.
This study compared metaheuristic optimizers (GA, PSO) against a Randomized Search (RS) baseline under a strict computational budget. Statistical testing confirmed that GA and PSO failed to significantly outperform RS for any of the tested Decision Tree, KNN, or CNN models (p > 0.05). All optimizers converged to statistically indistinguishable fitness plateaus.
3
+
This study evaluated metaheuristic optimizers (GA, PSO) against a Randomized Search baseline under a strict budget of 50 evaluations. No significant performance difference was found between methods across Decision Tree, KNN, or CNN models ($p > 0.05$), with all reaching similar fitness plateaus.
4
4
5
-
This null result is attributed to the initialization overhead of population-based methods within a micro-budget regime. The geneetic algorithms for example, expended 60\% of their budget on the initial random population, leaving insufficient evaluations for evolutionary operators to amortize this cost and drive meaningful improvement.
5
+
The result is explained by initialization overhead: with a population of 30, GA used 60\% of its budget on initial sampling, leaving too few evaluations for evolutionary operators to yield improvement. In such micro-budget regimes (budget $< 2\times$ population), population-based methods behave similarly to Random Search.
6
6
7
-
First, \textbf{metaheuristics failed to outperform the baseline.} Statistical testing confirmed no significant difference in final solution quality between GA, PSO, and RS across Decision Tree, KNN, or CNN architectures ($p > 0.05$). All methods converged to statistically indistinguishable fitness plateaus (e.g., $\approx0.77$ for CNN).
8
-
9
-
We conclude that for micro-budget HPO tasks, where the total evaluation budget is not substantially larger than the population size, the added complexity of metaheuristics is unjustified. In such scenarios, Randomized Search can potentially be a more effective strategy.
7
+
Therefore, for hyperparameter optimization with very limited evaluations, the added complexity of metaheuristics like GA and PSO is not justified. Random Search proved equally effective baseline under these constraints.
10
8
11
9
Future work should:
12
-
13
10
\begin{itemize}
14
-
\itemExtend evaluation budgets to observe whether GA/PSO advantages emerge once initialization overhead is amortized.
15
-
\itemTest budget-aware variants in order to shrink startup cost under tight budgets.
16
-
\itemReplicate on different datasets and larger hyperparameter spaces to assess generality.
17
-
\itemIncrease run counts to raise test power.
11
+
\itemIncrease evaluation budgets to allow amortization of initialization costs.
12
+
\itemExplore adaptive or budget-aware variants of GA and PSO.
13
+
\itemExtend experiments to broader datasets and hyperparameter spaces.
14
+
\itemUse larger run counts to improve statistical power.
0 commit comments