Update report

seofernando25 · seofernando25 · commit 959a8aeb4c32 · 2025-12-09T22:28:16.000-05:00
diff --git a/report/main.tex b/report/main.tex
@@ -66,6 +66,8 @@
 \input{sections/2_problem_formulation}
 \input{sections/3_experiment}
 \input{sections/4_results}
+\input{sections/5_limitations}
+\input{sections/6_conclusion}
 
 % BEFORE END
 \clearpage
diff --git a/report/sections/1_introduction.tex b/report/sections/1_introduction.tex
@@ -1,8 +1,6 @@
 \section{Introduction}
 
-The performance of machine learning models often depends on their hyperparameters—high-level configuration variables like learning rate or batch size that control the training process. Finding the optimal set of these configurations, or Hyperparameter Optimization (HPO), is a significant bottleneck in model development.
-
-HPO can be framed as a software verification problem. In this context, the model is the software under test and a "defect" is a suboptimal hyperparameter configuration that causes the model to fail its performance specifications, such as by exhibiting high loss, poor generalization, or unstable training. HPO thus functions as an automated test driver, searching the configuration space to find a set of hyperparameters that verifies the model's performance against a pre-defined quality specification.
+The performance of machine learning models relies heavily on hyperparameters—configuration variables like learning rate and batch size that control the training process. Identifying the optimal configuration is a significant bottleneck in model development due to the high computational cost of evaluation. To address this complexity, we frame Hyperparameter Optimization (HPO) as a software verification problem. In this context, the model functions as the “software under test,” where a suboptimal configuration is treated as a “defect” that causes the system to violate its performance specifications (e.g., high loss or instability). HPO therefore acts as an automated test driver, searching the configuration space to verify model performance against defined quality criteria.
 
 \subsection{Evaluation Criteria}
 
diff --git a/report/sections/4_results.tex b/report/sections/4_results.tex
@@ -15,7 +15,7 @@ \subsection{RQ1: Effectiveness and Convergence}
 
 For the \textbf{Decision Tree} and \textbf{KNN}, the search space was relatively small. Consequently, all three optimizers rapidly converged to near-identical optimal configurations. As shown in the final test performance (Figure \ref{fig:test_perf}), the DT achieved a fitness of $\approx 0.3384$, while the KNN achieved $\approx 0.4308$.
 
-We have also plotted the mean and standard deviation of the current fitness for each $n$ evaluations for each optimizer. As shown in Figure \ref{fig:evaluations}, the current fitness is generally improving over the evaluations, though with varying efficiency.
+We have also plotted the mean and standard deviation of the current fitness for each $n$ evaluations for each optimizer. The curves are smoothed for readability (with a smoothing factor of 0.5), while the shaded regions reflect the underlying variation. As shown in Figure \ref{fig:evaluations}, the current fitness is generally improving over the evaluations, though with varying efficiency.
 
 \begin{figure}[H]
     \centering
@@ -76,4 +76,4 @@ \subsection{Statistical Significance}
     \item \textbf{Memetic Variants:} GA-Memetic comparisons (vs GA-Standard, PSO, RS) are all non-significant ($p > 0.05$), showing no measurable improvement over the standard GA under our budget.
 \end{itemize}
 
-Given the GA population size of $30$ and the strict budget of $50$ evaluations, the algorithm is structurally capped at fewer than three full generations: $30$ evaluations are spent on initialization, leaving only $20$ offspring evaluations (about $0.67$ of a generation). Roughly $60\%$ of the budget is therefore consumed by warm-up sampling, so the GA behaves similarly to Random Search under this constraint. This limited evolutionary pressure helps explain the non-significant differences in Table \ref{tab:wilcoxon}, as crossover and mutation had too few iterations to drive convergence.
+Given the GA population size of $30$ and the strict budget of $50$ evaluations, the algorithm is structurally capped at fewer than three full generations: $30$ evaluations are spent on initialization, leaving only $20$ offspring evaluations (about $0.67$ of a generation). Roughly $60\%$ of the budget is therefore consumed by warm-up sampling, so GA behaves similarly to Random Search under this constraint. PSO faces the same tight budget and likewise does not clear a statistically meaningful advantage. Under such sub-$2\times$-population budgets, Random Search is effectively superior because it pays no population-initialization overhead. The value of GA (and possibly PSO) would likely emerge only with larger budgets that permit 10+ generations, where exploitation phases can amortize their startup cost.
diff --git a/report/sections/5_limitations.tex b/report/sections/5_limitations.tex
@@ -0,0 +1,8 @@
+\section{Limitations}
+
+\paragraph{Insufficient Evolutionary Iterations.} The primary threat to validity is the interaction between the population size ($P=30$) and the fixed evaluation budget ($B=50$). This ratio mathematically restricted the Genetic Algorithm to fewer than two full generations ($N_{gen} \approx 1.67$). Consequently, 60\% of the computational budget was consumed by the random initialization phase (Generation 0), leaving insufficient iterations for the metaheuristic mechanisms (crossover, mutation, and selection) to drive convergence away from the initial random distribution.
+
+\paragraph{Statistical Power.} The experiment utilized $N=10$ independent runs per optimizer-model pair. While standard for exploratory HPO studies, this sample size limits the statistical power of the Wilcoxon signed-rank test, potentially masking small but consistent effect sizes between the algorithms.
+
+\paragraph{Scope Validity.} The evaluation was restricted to grayscale CIFAR-10 and fixed hyperparameter search spaces. These results may not generalize to higher-dimensional search spaces (e.g., full RGB imagery or deeper architectures) where the "curse of dimensionality" might differentiate global search strategies (GA) from local ones (PSO) more effectively.
+
diff --git a/report/sections/6_conclusion.tex b/report/sections/6_conclusion.tex
@@ -0,0 +1,18 @@
+\section{Conclusion}
+
+This study compared metaheuristic optimizers (GA, PSO) against a Randomized Search (RS) baseline under a strict computational budget. Statistical testing confirmed that GA and PSO failed to significantly outperform RS for any of the tested Decision Tree, KNN, or CNN models (p > 0.05). All optimizers converged to statistically indistinguishable fitness plateaus.
+
+This null result is attributed to the initialization overhead of population-based methods within a micro-budget regime. The geneetic algorithms for example, expended 60\% of their budget on the initial random population, leaving insufficient evaluations for evolutionary operators to amortize this cost and drive meaningful improvement.
+
+First, \textbf{metaheuristics failed to outperform the baseline.} Statistical testing confirmed no significant difference in final solution quality between GA, PSO, and RS across Decision Tree, KNN, or CNN architectures ($p > 0.05$). All methods converged to statistically indistinguishable fitness plateaus (e.g., $\approx 0.77$ for CNN).
+
+We conclude that for micro-budget HPO tasks, where the total evaluation budget is not substantially larger than the population size, the added complexity of metaheuristics is unjustified. In such scenarios, Randomized Search can potentially be a more effective strategy.
+
+Future work should:
+
+\begin{itemize}
+    \item Extend evaluation budgets to observe whether GA/PSO advantages emerge once initialization overhead is amortized.
+    \item Test budget-aware variants in order to shrink startup cost under tight budgets.
+    \item Replicate on different datasets and larger hyperparameter spaces to assess generality.
+    \item Increase run counts to raise test power.
+\end{itemize}