Skip to content

Commit 2524d6c

Browse files
committed
2 parents 8b39dd9 + 5a4916a commit 2524d6c

10 files changed

Lines changed: 279 additions & 187 deletions

File tree

docs/01-Problem-Setup.html

Lines changed: 14 additions & 14 deletions
Large diffs are not rendered by default.

docs/02-Regression.html

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,7 @@ <h3 data-number="3.1.1" class="anchored" data-anchor-id="one-predictor"><span cl
302302
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
303303
\]</span></p>
304304
<p>Our model would look like the following like the red line from our Training data:</p>
305-
<div id="7d61d74b" class="cell" data-execution_count="1">
305+
<div id="048c853d" class="cell" data-execution_count="1">
306306
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
307307
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
308308
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
@@ -358,7 +358,7 @@ <h3 data-number="3.2.1" class="anchored" data-anchor-id="linearity-of-responder-
358358
<p>The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn’t ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.</p>
359359
<p>We can check this relationship by seeing whether predictor is linear with the predicted response value, but this is cumbersome with multiple predictors. Rather, we typically calculate the <strong>residual</strong>, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a <strong>residual plot</strong> of the predicted response vs.&nbsp;residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.</p>
360360
<p>If there’s a trend in the data, that means there are non-linear associations between some of the predictors and the response.</p>
361-
<div id="5058ef05" class="cell" data-execution_count="2">
361+
<div id="c7c202b4" class="cell" data-execution_count="2">
362362
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>residual <span class="op">=</span> y_train <span class="op">-</span> y_train_predicted</span>
363363
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>plot_df <span class="op">=</span> pd.DataFrame({<span class="st">'Age'</span>: X_train.Age, <span class="st">'Predicted_Response'</span>: np.ravel(y_train_predicted), <span class="st">'Residual'</span>: np.ravel(residual)})</span>
364364
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -375,7 +375,7 @@ <h3 data-number="3.2.1" class="anchored" data-anchor-id="linearity-of-responder-
375375
</div>
376376
<p>We see there’s a slight curve in our residual plot. We will look at ways to deal with this later in this lecture.</p>
377377
<p>In a model with more predictors, we can dig into more details by making a residual plot of a predictor vs.&nbsp;residual. This is often used to figure out which predictor is contributing to the shape of the predicted response vs.&nbsp;residual plot.</p>
378-
<div id="58d10fc7" class="cell" data-execution_count="3">
378+
<div id="9b7266e6" class="cell" data-execution_count="3">
379379
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
380380
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.regplot(x<span class="op">=</span><span class="st">"Age"</span>, y<span class="op">=</span><span class="st">"Residual"</span>, data<span class="op">=</span>plot_df, lowess<span class="op">=</span><span class="va">True</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.2</span>}, line_kws<span class="op">=</span>{<span class="st">'color'</span>:<span class="st">"r"</span>})</span>
381381
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>plt.show()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -407,7 +407,7 @@ <h3 data-number="3.2.3" class="anchored" data-anchor-id="predictors-are-not-coli
407407
<li><p>When there is a collinear relationship between three or more predictors, pairwise methods will fail. We may consider the Variance Inflation Factor to detect them, but doesn’t necessarily recommend which variables to remove.</p></li>
408408
</ul>
409409
<p>Suppose that we are consider the predictors of our training set:</p>
410-
<div id="e4b64717" class="cell" data-execution_count="4">
410+
<div id="f4c42f5c" class="cell" data-execution_count="4">
411411
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co">#some cleanup</span></span>
412412
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>obj_columns <span class="op">=</span> nhanes_train.select_dtypes([<span class="st">'object'</span>]).columns</span>
413413
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>nhanes_train[obj_columns] <span class="op">=</span> nhanes_train[obj_columns].<span class="bu">apply</span>(<span class="kw">lambda</span> x: x.astype(<span class="st">'category'</span>))</span>
@@ -430,7 +430,7 @@ <h3 data-number="3.2.3" class="anchored" data-anchor-id="predictors-are-not-coli
430430
</div>
431431
</div>
432432
<p>Let’s look at a pair of predictors up close:</p>
433-
<div id="41349cec" class="cell" data-execution_count="5">
433+
<div id="1b420b29" class="cell" data-execution_count="5">
434434
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
435435
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.regplot(y<span class="op">=</span><span class="st">"Age"</span>, x<span class="op">=</span><span class="st">"Poverty"</span>, data<span class="op">=</span>nhanes_train, lowess<span class="op">=</span><span class="va">True</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.1</span>}, line_kws<span class="op">=</span>{<span class="st">'color'</span>:<span class="st">"r"</span>})</span>
436436
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>plt.show()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -469,7 +469,7 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="polynomial-regression">
469469
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age + \beta_2 \cdot Age^2
470470
\]</span></p>
471471
<p>This is <em>still</em> a linear model – we have added a new predictor that gives us a quadratic shape. We use the <a href="https://matthewwardrop.github.io/formulaic/latest/guides/splines/#poly"><code>poly()</code> function</a> to generate our polynomial predictor.</p>
472-
<div id="d55a18b9" class="cell" data-execution_count="6">
472+
<div id="0030da4e" class="cell" data-execution_count="6">
473473
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>y_train, X_train <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ poly(Age, degree=2, raw=True)"</span>, nhanes_train)</span>
474474
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span>
475475
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_model.LinearRegression()</span>
@@ -491,7 +491,7 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="polynomial-regression">
491491
</div>
492492
</div>
493493
<p>Let’s look at our Residual Plot:</p>
494-
<div id="c6bccb5b" class="cell" data-execution_count="7">
494+
<div id="7b95ca35" class="cell" data-execution_count="7">
495495
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>residual <span class="op">=</span> y_train <span class="op">-</span> y_train_predicted</span>
496496
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span>
497497
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>plot_df <span class="op">=</span> pd.DataFrame({<span class="st">'y_train_predicted'</span>: np.ravel(y_train_predicted), <span class="st">'residual'</span>: np.ravel(residual)})</span>
@@ -515,7 +515,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="interactions"><span class
515515
<p>Here is another way to extend the Linear Model:</p>
516516
<p>Suppose we think that <span class="math inline">\(BMI\)</span> and <span class="math inline">\(Gender\)</span> may be good predictors of <span class="math inline">\(MeanBloodPressure\)</span>:</p>
517517
<p>Let’s explore the relationship between <span class="math inline">\(MeanBloodPressure\)</span> and <span class="math inline">\(BMI\)</span> separately for values of <span class="math inline">\(Gender\)</span>.</p>
518-
<div id="1178162d" class="cell" data-execution_count="8">
518+
<div id="8622e250" class="cell" data-execution_count="8">
519519
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
520520
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.lmplot(y<span class="op">=</span><span class="st">"MeanBloodPressure"</span>, x<span class="op">=</span><span class="st">"BMI"</span>, hue<span class="op">=</span><span class="st">"Gender"</span>, data<span class="op">=</span>nhanes_train, lowess<span class="op">=</span><span class="va">False</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.1</span>})</span>
521521
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>ax.<span class="bu">set</span>(xlim<span class="op">=</span>(<span class="dv">10</span>, <span class="dv">50</span>)) </span>
@@ -541,7 +541,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="interactions"><span class
541541
MeanBloodPressure= \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot Gender + \beta_3 \cdot BMI \cdot Gender
542542
\]</span></p>
543543
<p>Let’s see what happens:</p>
544-
<div id="8dfd399f" class="cell" data-execution_count="9">
544+
<div id="4b1c6bce" class="cell" data-execution_count="9">
545545
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>y_train, X_train <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ BMI + Gender + BMI*Gender"</span>, nhanes_train)</span>
546546
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_model.LinearRegression()</span>
547547
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_reg.fit(X_train, y_train)</span>
@@ -584,7 +584,7 @@ <h3 data-number="3.5.2" class="anchored" data-anchor-id="parameter-inference"><s
584584
<p><span class="math inline">\(\beta_0\)</span> is a parameter describing the intercept of the line, and <span class="math inline">\(\beta_1\)</span> is a parameter describing the slope of the line.</p>
585585
<p>Suppose that from fitting the model on the Training Set, <span class="math inline">\(\beta_1=2\)</span>. That means increasing <span class="math inline">\(BMI\)</span> by 1 will lead to an increase of <span class="math inline">\(BloodPressure\)</span> by 2. This measures the strength of association between a variable and the outcome.</p>
586586
<p>Let’s see this in practice:</p>
587-
<div id="a1b51bfd" class="cell" data-execution_count="10">
587+
<div id="f9377a4a" class="cell" data-execution_count="10">
588588
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> statsmodels.api <span class="im">as</span> sm</span>
589589
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span>
590590
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>y, X <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ BMI"</span>, nhanes_train)</span>
@@ -617,13 +617,13 @@ <h3 data-number="3.5.2" class="anchored" data-anchor-id="parameter-inference"><s
617617
</tr>
618618
<tr class="even">
619619
<td data-quarto-table-cell-role="th">Date:</td>
620-
<td>Thu, 02 Apr 2026</td>
620+
<td>Mon, 06 Apr 2026</td>
621621
<td data-quarto-table-cell-role="th">Prob (F-statistic):</td>
622622
<td>4.11e-48</td>
623623
</tr>
624624
<tr class="odd">
625625
<td data-quarto-table-cell-role="th">Time:</td>
626-
<td>21:00:32</td>
626+
<td>22:10:18</td>
627627
<td data-quarto-table-cell-role="th">Log-Likelihood:</td>
628628
<td>-10325.</td>
629629
</tr>
12 Bytes
Loading

0 commit comments

Comments
 (0)