@@ -294,7 +294,7 @@ <h3 data-number="3.1.1" class="anchored" data-anchor-id="one-predictor"><span cl
294294MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
295295\]</ span > </ p >
296296< p > Our model would look like the following like the red line from our Training data:</ p >
297- < div id ="b6bc60ac " class ="cell " data-execution_count ="1 ">
297+ < div id ="3bb74aa1 " class ="cell " data-execution_count ="1 ">
298298< div class ="sourceCode cell-code " id ="cb1 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb1-1 "> < a href ="#cb1-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> import</ span > pandas < span class ="im "> as</ span > pd</ span >
299299< span id ="cb1-2 "> < a href ="#cb1-2 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> import</ span > seaborn < span class ="im "> as</ span > sns</ span >
300300< span id ="cb1-3 "> < a href ="#cb1-3 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> import</ span > numpy < span class ="im "> as</ span > np</ span >
@@ -350,7 +350,7 @@ <h3 data-number="3.2.1" class="anchored" data-anchor-id="linearity-of-responder-
350350< p > The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn’t ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.</ p >
351351< p > We can check this relationship by seeing whether predictor is linear with the predicted response value, but this is cumbersome with multiple predictors. Rather, we typically calculate the < strong > residual</ strong > , which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a < strong > residual plot</ strong > of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.</ p >
352352< p > If there’s a trend in the data, that means there are non-linear associations between some of the predictors and the response.</ p >
353- < div id ="3d93965a " class ="cell " data-execution_count ="2 ">
353+ < div id ="b94b04d8 " class ="cell " data-execution_count ="2 ">
354354< div class ="sourceCode cell-code " id ="cb2 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb2-1 "> < a href ="#cb2-1 " aria-hidden ="true " tabindex ="-1 "> </ a > residual < span class ="op "> =</ span > y_train < span class ="op "> -</ span > y_train_predicted</ span >
355355< span id ="cb2-2 "> < a href ="#cb2-2 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
356356< span id ="cb2-3 "> < a href ="#cb2-3 " aria-hidden ="true " tabindex ="-1 "> </ a > plt.clf()</ span >
@@ -387,7 +387,7 @@ <h3 data-number="3.2.3" class="anchored" data-anchor-id="predictors-are-not-coli
387387< li > < p > When there is a collinear relationship between three or more predictors, pairwise methods will fail. We may consider the Variance Inflation Factor to detect them, but doesn’t necessarily recommend which variables to remove.</ p > </ li >
388388</ ul >
389389< p > Suppose that we are consider the predictors of our training set:</ p >
390- < div id ="ca468224 " class ="cell " data-execution_count ="3 ">
390+ < div id ="2100a5a9 " class ="cell " data-execution_count ="3 ">
391391< div class ="sourceCode cell-code " id ="cb3 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb3-1 "> < a href ="#cb3-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="co "> #some cleanup</ span > </ span >
392392< span id ="cb3-2 "> < a href ="#cb3-2 " aria-hidden ="true " tabindex ="-1 "> </ a > obj_columns < span class ="op "> =</ span > nhanes_train.select_dtypes([< span class ="st "> 'object'</ span > ]).columns</ span >
393393< span id ="cb3-3 "> < a href ="#cb3-3 " aria-hidden ="true " tabindex ="-1 "> </ a > nhanes_train[obj_columns] < span class ="op "> =</ span > nhanes_train[obj_columns].< span class ="bu "> apply</ span > (< span class ="kw "> lambda</ span > x: x.astype(< span class ="st "> 'category'</ span > ))</ span >
@@ -410,7 +410,7 @@ <h3 data-number="3.2.3" class="anchored" data-anchor-id="predictors-are-not-coli
410410</ div >
411411</ div >
412412< p > Let’s look at a pair of predictors up close:</ p >
413- < div id ="26d02af8 " class ="cell " data-execution_count ="4 ">
413+ < div id ="44a51e4d " class ="cell " data-execution_count ="4 ">
414414< div class ="sourceCode cell-code " id ="cb4 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb4-1 "> < a href ="#cb4-1 " aria-hidden ="true " tabindex ="-1 "> </ a > plt.clf()</ span >
415415< span id ="cb4-2 "> < a href ="#cb4-2 " aria-hidden ="true " tabindex ="-1 "> </ a > ax < span class ="op "> =</ span > sns.regplot(y< span class ="op "> =</ span > < span class ="st "> "Age"</ span > , x< span class ="op "> =</ span > < span class ="st "> "BMI"</ span > , data< span class ="op "> =</ span > nhanes_train, lowess< span class ="op "> =</ span > < span class ="va "> True</ span > , scatter_kws< span class ="op "> =</ span > {< span class ="st "> 'alpha'</ span > :< span class ="fl "> 0.1</ span > }, line_kws< span class ="op "> =</ span > {< span class ="st "> 'color'</ span > :< span class ="st "> "r"</ span > })</ span >
416416< span id ="cb4-3 "> < a href ="#cb4-3 " aria-hidden ="true " tabindex ="-1 "> </ a > ax.set_xlim([< span class ="dv "> 10</ span > , < span class ="dv "> 50</ span > ])</ span >
@@ -450,7 +450,7 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="polynomial-regression">
450450MeanBloodPressure= \beta_0 + \beta_1 \cdot Age + \beta_2 \cdot Age^2
451451\]</ span > </ p >
452452< p > This is < em > still</ em > a linear model – we have added a new predictor that gives us a quadratic shape. We use the < a href ="https://matthewwardrop.github.io/formulaic/latest/guides/splines/#poly "> < code > poly()</ code > function</ a > to generate our polynomial predictor.</ p >
453- < div id ="e15d8dbb " class ="cell " data-execution_count ="5 ">
453+ < div id ="3884ea98 " class ="cell " data-execution_count ="5 ">
454454< div class ="sourceCode cell-code " id ="cb5 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb5-1 "> < a href ="#cb5-1 " aria-hidden ="true " tabindex ="-1 "> </ a > y_train, X_train < span class ="op "> =</ span > model_matrix(< span class ="st "> "MeanBloodPressure ~ poly(Age, degree=2, raw=True)"</ span > , nhanes_train)</ span >
455455< span id ="cb5-2 "> < a href ="#cb5-2 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
456456< span id ="cb5-3 "> < a href ="#cb5-3 " aria-hidden ="true " tabindex ="-1 "> </ a > linear_reg < span class ="op "> =</ span > linear_model.LinearRegression()</ span >
@@ -472,7 +472,7 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="polynomial-regression">
472472</ div >
473473</ div >
474474< p > Let’s look at our Residual Plot:</ p >
475- < div id ="aec8df64 " class ="cell " data-execution_count ="6 ">
475+ < div id ="017acfef " class ="cell " data-execution_count ="6 ">
476476< div class ="sourceCode cell-code " id ="cb6 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb6-1 "> < a href ="#cb6-1 " aria-hidden ="true " tabindex ="-1 "> </ a > residual < span class ="op "> =</ span > y_train < span class ="op "> -</ span > y_train_predicted</ span >
477477< span id ="cb6-2 "> < a href ="#cb6-2 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
478478< span id ="cb6-3 "> < a href ="#cb6-3 " aria-hidden ="true " tabindex ="-1 "> </ a > plt.clf()</ span >
@@ -500,7 +500,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
500500< p > < span class ="math display "> \[
501501MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
502502\]</ span > </ p >
503- < div id ="db4abb69 " class ="cell " data-execution_count ="7 ">
503+ < div id ="273ee9db " class ="cell " data-execution_count ="7 ">
504504< div class ="sourceCode cell-code " id ="cb7 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb7-1 "> < a href ="#cb7-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="co "> #Use a small part of the data to illlustrate overfitting.</ span > </ span >
505505< span id ="cb7-2 "> < a href ="#cb7-2 " aria-hidden ="true " tabindex ="-1 "> </ a > nhanes_tiny < span class ="op "> =</ span > nhanes.sample(n< span class ="op "> =</ span > < span class ="dv "> 300</ span > , random_state< span class ="op "> =</ span > < span class ="dv "> 2</ span > )</ span >
506506< span id ="cb7-3 "> < a href ="#cb7-3 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
@@ -545,7 +545,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
545545</ div >
546546< p > We see that Training Error < Testing Error.</ p >
547547< p > Let’s look at what happens if we increase the flexibility of the model by fitting it with degree 2 polynomial:</ p >
548- < div id ="8d442cd2 " class ="cell " data-execution_count ="8 ">
548+ < div id ="5986dd1a " class ="cell " data-execution_count ="8 ">
549549< div class ="sourceCode cell-code " id ="cb9 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb9-1 "> < a href ="#cb9-1 " aria-hidden ="true " tabindex ="-1 "> </ a > p_degree < span class ="op "> =</ span > < span class ="dv "> 2</ span > </ span >
550550< span id ="cb9-2 "> < a href ="#cb9-2 " aria-hidden ="true " tabindex ="-1 "> </ a > y, X < span class ="op "> =</ span > model_matrix(< span class ="st "> "MeanBloodPressure ~ BMI + poly(BMI, degree="</ span > < span class ="op "> +</ span > < span class ="bu "> str</ span > (p_degree) < span class ="op "> +</ span > < span class ="st "> ")"</ span > , nhanes_tiny)</ span >
551551< span id ="cb9-3 "> < a href ="#cb9-3 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
@@ -589,7 +589,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
589589</ div >
590590< p > We see that both Training and Testing error both decreased slightly!</ p >
591591< p > What happens if we keep increasing the model complexity?</ p >
592- < div id ="65882b57 " class ="cell " data-execution_count ="9 ">
592+ < div id ="bc977416 " class ="cell " data-execution_count ="9 ">
593593< div class ="sourceCode cell-code " id ="cb11 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb11-1 "> < a href ="#cb11-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="cf "> for</ span > p_degree < span class ="kw "> in</ span > [< span class ="dv "> 4</ span > , < span class ="dv "> 10</ span > ]:</ span >
594594< span id ="cb11-2 "> < a href ="#cb11-2 " aria-hidden ="true " tabindex ="-1 "> </ a > y, X < span class ="op "> =</ span > model_matrix(< span class ="st "> "MeanBloodPressure ~ BMI + poly(BMI, degree="</ span > < span class ="op "> +</ span > < span class ="bu "> str</ span > (p_degree) < span class ="op "> +</ span > < span class ="st "> ")"</ span > , nhanes_tiny)</ span >
595595< span id ="cb11-3 "> < a href ="#cb11-3 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
@@ -642,7 +642,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
642642</ div >
643643</ div >
644644< p > Let’s summarize it:</ p >
645- < div id ="f87573b4 " class ="cell " data-execution_count ="10 ">
645+ < div id ="f63ac4ce " class ="cell " data-execution_count ="10 ">
646646< div class ="sourceCode cell-code " id ="cb14 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb14-1 "> < a href ="#cb14-1 " aria-hidden ="true " tabindex ="-1 "> </ a > train_err < span class ="op "> =</ span > []</ span >
647647< span id ="cb14-2 "> < a href ="#cb14-2 " aria-hidden ="true " tabindex ="-1 "> </ a > test_err < span class ="op "> =</ span > []</ span >
648648< span id ="cb14-3 "> < a href ="#cb14-3 " aria-hidden ="true " tabindex ="-1 "> </ a > polynomials < span class ="op "> =</ span > < span class ="bu "> list</ span > (< span class ="bu "> range</ span > (< span class ="dv "> 1</ span > , < span class ="dv "> 10</ span > ))</ span >
@@ -718,7 +718,7 @@ <h3 data-number="3.5.2" class="anchored" data-anchor-id="parameter-inference"><s
718718< p > < span class ="math inline "> \(\beta_0\)</ span > is a parameter describing the intercept of the line, and < span class ="math inline "> \(\beta_1\)</ span > is a parameter describing the slope of the line.</ p >
719719< p > Suppose that from fitting the model on the Training Set, < span class ="math inline "> \(\beta_1=2\)</ span > . That means increasing < span class ="math inline "> \(BMI\)</ span > by 1 will lead to an increase of < span class ="math inline "> \(BloodPressure\)</ span > by 2. This measures the strength of association between a variable and the outcome.</ p >
720720< p > Let’s see this in practice:</ p >
721- < div id ="3c6da906 " class ="cell " data-execution_count ="11 ">
721+ < div id ="bdb4dd18 " class ="cell " data-execution_count ="11 ">
722722< div class ="sourceCode cell-code " id ="cb15 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb15-1 "> < a href ="#cb15-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> import</ span > statsmodels.api < span class ="im "> as</ span > sm</ span >
723723< span id ="cb15-2 "> < a href ="#cb15-2 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
724724< span id ="cb15-3 "> < a href ="#cb15-3 " aria-hidden ="true " tabindex ="-1 "> </ a > y, X < span class ="op "> =</ span > model_matrix(< span class ="st "> "MeanBloodPressure ~ BMI"</ span > , nhanes_tiny)</ span >
@@ -757,7 +757,7 @@ <h3 data-number="3.5.2" class="anchored" data-anchor-id="parameter-inference"><s
757757</ tr >
758758< tr class ="odd ">
759759< td data-quarto-table-cell-role ="th "> Time:</ td >
760- < td > 21:19:36 </ td >
760+ < td > 22:00:55 </ td >
761761< td data-quarto-table-cell-role ="th "> Log-Likelihood:</ td >
762762< td > -502.67</ td >
763763</ tr >
@@ -861,7 +861,7 @@ <h2 data-number="3.6" class="anchored" data-anchor-id="appendix-interactions"><s
861861< p > Here is another way to extend the Linear Model:</ p >
862862< p > Suppose we think that < span class ="math inline "> \(BMI\)</ span > and < span class ="math inline "> \(Gender\)</ span > may be good predictors of < span class ="math inline "> \(MeanBloodPressure\)</ span > :</ p >
863863< p > Let’s explore the relationship between < span class ="math inline "> \(MeanBloodPressure\)</ span > and < span class ="math inline "> \(BMI\)</ span > separately for values of < span class ="math inline "> \(Gender\)</ span > .</ p >
864- < div id ="eb9ab765 " class ="cell " data-execution_count ="12 ">
864+ < div id ="06d0b30f " class ="cell " data-execution_count ="12 ">
865865< div class ="sourceCode cell-code " id ="cb16 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb16-1 "> < a href ="#cb16-1 " aria-hidden ="true " tabindex ="-1 "> </ a > plt.clf()</ span >
866866< span id ="cb16-2 "> < a href ="#cb16-2 " aria-hidden ="true " tabindex ="-1 "> </ a > ax < span class ="op "> =</ span > sns.lmplot(y< span class ="op "> =</ span > < span class ="st "> "MeanBloodPressure"</ span > , x< span class ="op "> =</ span > < span class ="st "> "BMI"</ span > , hue< span class ="op "> =</ span > < span class ="st "> "Gender"</ span > , data< span class ="op "> =</ span > nhanes_train, lowess< span class ="op "> =</ span > < span class ="va "> False</ span > , scatter_kws< span class ="op "> =</ span > {< span class ="st "> 'alpha'</ span > :< span class ="fl "> 0.1</ span > })</ span >
867867< span id ="cb16-3 "> < a href ="#cb16-3 " aria-hidden ="true " tabindex ="-1 "> </ a > ax.< span class ="bu "> set</ span > (xlim< span class ="op "> =</ span > (< span class ="dv "> 10</ span > , < span class ="dv "> 50</ span > )) </ span >
@@ -887,7 +887,7 @@ <h2 data-number="3.6" class="anchored" data-anchor-id="appendix-interactions"><s
887887MeanBloodPressure= \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot Gender + \beta_3 \cdot BMI \cdot Gender
888888\]</ span > </ p >
889889< p > Let’s see what happens:</ p >
890- < div id ="1a6d315e " class ="cell " data-execution_count ="13 ">
890+ < div id ="359383f2 " class ="cell " data-execution_count ="13 ">
891891< div class ="sourceCode cell-code " id ="cb18 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb18-1 "> < a href ="#cb18-1 " aria-hidden ="true " tabindex ="-1 "> </ a > y_train, X_train < span class ="op "> =</ span > model_matrix(< span class ="st "> "MeanBloodPressure ~ BMI + Gender + BMI*Gender"</ span > , nhanes_train)</ span >
892892< span id ="cb18-2 "> < a href ="#cb18-2 " aria-hidden ="true " tabindex ="-1 "> </ a > linear_reg < span class ="op "> =</ span > linear_model.LinearRegression()</ span >
893893< span id ="cb18-3 "> < a href ="#cb18-3 " aria-hidden ="true " tabindex ="-1 "> </ a > linear_reg < span class ="op "> =</ span > linear_reg.fit(X_train, y_train)</ span >
0 commit comments