Skip to content

Commit 8b39dd9

Browse files
committed
Update 04-Model-Complexity.qmd
1 parent 624a425 commit 8b39dd9

1 file changed

Lines changed: 23 additions & 3 deletions

File tree

04-Model-Complexity.qmd

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -210,8 +210,6 @@ for n_predictors in predictors_to_iterate:
210210
linear_reg = linear_model.LinearRegression().fit(X_train, y_train)
211211
y_train_predicted = linear_reg.predict(X_train)
212212
y_test_predicted = linear_reg.predict(X_test)
213-
print(n_predictors, mean_squared_error(y_train_predicted, y_train))
214-
print(n_predictors, mean_squared_error(y_test_predicted, y_test))
215213
train_err.append(mean_squared_error(y_train_predicted, y_train))
216214
test_err.append(mean_squared_error(y_test_predicted, y_test))
217215
@@ -224,6 +222,8 @@ plt.legend()
224222
plt.show()
225223
```
226224

225+
What is the optimal number of predictors we should use for our final model?
226+
227227
## Bias-Variance Trade-off
228228

229229
Another way to describe the underfitting/overfitting phenoma is via the theory "**Bias-Variance Trade-off".** It breaks down our Testing Error of a single model by the following:
@@ -333,7 +333,9 @@ This is a Piecewise Cubic Regression, an example can be seen in the top panel of
333333

334334
Here, we end up using 8 predictors for our model. We see something that looks off immediately: our model is not continuous at the cutoff point! To fix the problem, we can constrain our model to be continuous: we require that the first and second derivatives of the piecewise polynomials to be continuous at the cutoff point. This fix is shown in the bottom panel, which is called **Cubic Spline Regression**. We can increase the number of cutoff points as we like in a piecewise or spline model. This cubic spline model uses $K + 4$ predictors, where $K$ is the number of cutoff points used.
335335

336-
To pick the number of cutoff points, we can also perform cross validation:
336+
To pick the number of cutoff points, we can also perform cross validation.
337+
338+
For 10 cutoff points, here is the cross validation result:
337339

338340
```{python}
339341
y, X = model_matrix("MeanBloodPressure ~ BMI + cs(BMI, df=10)", nhanes_tiny)
@@ -347,6 +349,22 @@ scores = cross_val_score(linear_reg, X_train, y_train, cv=5, scoring="neg_mean_s
347349
-np.mean(scores)
348350
```
349351

352+
For 5 cutoff points, here is the cross validation result:
353+
354+
```{python}
355+
y, X = model_matrix("MeanBloodPressure ~ BMI + cs(BMI, df=5)", nhanes_tiny)
356+
357+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
358+
359+
linear_reg = linear_model.LinearRegression()
360+
scores = cross_val_score(linear_reg, X_train, y_train, cv=5, scoring="neg_mean_squared_error")
361+
362+
-scores
363+
-np.mean(scores)
364+
```
365+
366+
Looks like 5-cutoff points is better. We then use this model to visualize with the training data:
367+
350368
```{python}
351369
y_train_predicted = cross_val_predict(linear_reg, X_train, y_train, cv=5)
352370
@@ -361,6 +379,8 @@ plt.ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPres
361379
plt.show()
362380
```
363381

382+
Finally, how does it do on the test set?
383+
364384
```{python}
365385
y_test_predicted = linear_reg.fit(X_train, y_train).predict(X_test)
366386
mean_squared_error(y_test_predicted, y_test)

0 commit comments

Comments
 (0)