Update 04-Model-Complexity.qmd

caalo · caalo · commit 8b39dd99af2f · 2026-04-06T15:16:03.000-07:00
diff --git a/04-Model-Complexity.qmd b/04-Model-Complexity.qmd
@@ -210,8 +210,6 @@ for n_predictors in predictors_to_iterate:
   linear_reg = linear_model.LinearRegression().fit(X_train, y_train)
   y_train_predicted = linear_reg.predict(X_train)
   y_test_predicted = linear_reg.predict(X_test)
-  print(n_predictors, mean_squared_error(y_train_predicted, y_train))
-  print(n_predictors, mean_squared_error(y_test_predicted, y_test))
   train_err.append(mean_squared_error(y_train_predicted, y_train))
   test_err.append(mean_squared_error(y_test_predicted, y_test))
   
@@ -224,6 +222,8 @@ plt.legend()
 plt.show()  
 ```
 
+What is the optimal number of predictors we should use for our final model?
+
 ## Bias-Variance Trade-off
 
 Another way to describe the underfitting/overfitting phenoma is via the theory "**Bias-Variance Trade-off".** It breaks down our Testing Error of a single model by the following:
@@ -333,7 +333,9 @@ This is a Piecewise Cubic Regression, an example can be seen in the top panel of
 
 Here, we end up using 8 predictors for our model. We see something that looks off immediately: our model is not continuous at the cutoff point! To fix the problem, we can constrain our model to be continuous: we require that the first and second derivatives of the piecewise polynomials to be continuous at the cutoff point. This fix is shown in the bottom panel, which is called **Cubic Spline Regression**. We can increase the number of cutoff points as we like in a piecewise or spline model. This cubic spline model uses $K + 4$ predictors, where $K$ is the number of cutoff points used.
 
-To pick the number of cutoff points, we can also perform cross validation:
+To pick the number of cutoff points, we can also perform cross validation.
+
+For 10 cutoff points, here is the cross validation result:
 
 ```{python}
 y, X = model_matrix("MeanBloodPressure ~ BMI + cs(BMI, df=10)", nhanes_tiny)
@@ -347,6 +349,22 @@ scores = cross_val_score(linear_reg, X_train, y_train, cv=5, scoring="neg_mean_s
 -np.mean(scores)
 ```
 
+For 5 cutoff points, here is the cross validation result:
+
+```{python}
+y, X = model_matrix("MeanBloodPressure ~ BMI + cs(BMI, df=5)", nhanes_tiny)
+
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
+
+linear_reg = linear_model.LinearRegression()
+scores = cross_val_score(linear_reg, X_train, y_train, cv=5, scoring="neg_mean_squared_error")
+
+-scores
+-np.mean(scores)
+```
+
+Looks like 5-cutoff points is better. We then use this model to visualize with the training data:
+
 ```{python}
 y_train_predicted = cross_val_predict(linear_reg, X_train, y_train, cv=5)
 
@@ -361,6 +379,8 @@ plt.ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPres
 plt.show()
 ```
 
+Finally, how does it do on the test set?
+
 ```{python}
 y_test_predicted = linear_reg.fit(X_train, y_train).predict(X_test)
 mean_squared_error(y_test_predicted, y_test)