Skip to content

Commit 3185d5c

Browse files
committed
2 parents dbd070e + 769f14d commit 3185d5c

8 files changed

Lines changed: 67 additions & 66 deletions

File tree

docs/01-Problem-Setup.html

Lines changed: 14 additions & 14 deletions
Large diffs are not rendered by default.

docs/02-Regression.html

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -294,7 +294,7 @@ <h3 data-number="3.1.1" class="anchored" data-anchor-id="one-predictor"><span cl
294294
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
295295
\]</span></p>
296296
<p>Our model would look like the following like the red line from our Training data:</p>
297-
<div id="b6bc60ac" class="cell" data-execution_count="1">
297+
<div id="3bb74aa1" class="cell" data-execution_count="1">
298298
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
299299
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
300300
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
@@ -350,7 +350,7 @@ <h3 data-number="3.2.1" class="anchored" data-anchor-id="linearity-of-responder-
350350
<p>The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn’t ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.</p>
351351
<p>We can check this relationship by seeing whether predictor is linear with the predicted response value, but this is cumbersome with multiple predictors. Rather, we typically calculate the <strong>residual</strong>, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a <strong>residual plot</strong> of the predicted response vs.&nbsp;residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.</p>
352352
<p>If there’s a trend in the data, that means there are non-linear associations between some of the predictors and the response.</p>
353-
<div id="3d93965a" class="cell" data-execution_count="2">
353+
<div id="b94b04d8" class="cell" data-execution_count="2">
354354
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>residual <span class="op">=</span> y_train <span class="op">-</span> y_train_predicted</span>
355355
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
356356
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
@@ -387,7 +387,7 @@ <h3 data-number="3.2.3" class="anchored" data-anchor-id="predictors-are-not-coli
387387
<li><p>When there is a collinear relationship between three or more predictors, pairwise methods will fail. We may consider the Variance Inflation Factor to detect them, but doesn’t necessarily recommend which variables to remove.</p></li>
388388
</ul>
389389
<p>Suppose that we are consider the predictors of our training set:</p>
390-
<div id="ca468224" class="cell" data-execution_count="3">
390+
<div id="2100a5a9" class="cell" data-execution_count="3">
391391
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co">#some cleanup</span></span>
392392
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>obj_columns <span class="op">=</span> nhanes_train.select_dtypes([<span class="st">'object'</span>]).columns</span>
393393
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>nhanes_train[obj_columns] <span class="op">=</span> nhanes_train[obj_columns].<span class="bu">apply</span>(<span class="kw">lambda</span> x: x.astype(<span class="st">'category'</span>))</span>
@@ -410,7 +410,7 @@ <h3 data-number="3.2.3" class="anchored" data-anchor-id="predictors-are-not-coli
410410
</div>
411411
</div>
412412
<p>Let’s look at a pair of predictors up close:</p>
413-
<div id="26d02af8" class="cell" data-execution_count="4">
413+
<div id="44a51e4d" class="cell" data-execution_count="4">
414414
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
415415
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.regplot(y<span class="op">=</span><span class="st">"Age"</span>, x<span class="op">=</span><span class="st">"BMI"</span>, data<span class="op">=</span>nhanes_train, lowess<span class="op">=</span><span class="va">True</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.1</span>}, line_kws<span class="op">=</span>{<span class="st">'color'</span>:<span class="st">"r"</span>})</span>
416416
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>ax.set_xlim([<span class="dv">10</span>, <span class="dv">50</span>])</span>
@@ -450,7 +450,7 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="polynomial-regression">
450450
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age + \beta_2 \cdot Age^2
451451
\]</span></p>
452452
<p>This is <em>still</em> a linear model – we have added a new predictor that gives us a quadratic shape. We use the <a href="https://matthewwardrop.github.io/formulaic/latest/guides/splines/#poly"><code>poly()</code> function</a> to generate our polynomial predictor.</p>
453-
<div id="e15d8dbb" class="cell" data-execution_count="5">
453+
<div id="3884ea98" class="cell" data-execution_count="5">
454454
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>y_train, X_train <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ poly(Age, degree=2, raw=True)"</span>, nhanes_train)</span>
455455
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
456456
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_model.LinearRegression()</span>
@@ -472,7 +472,7 @@ <h3 data-number="3.3.1" class="anchored" data-anchor-id="polynomial-regression">
472472
</div>
473473
</div>
474474
<p>Let’s look at our Residual Plot:</p>
475-
<div id="aec8df64" class="cell" data-execution_count="6">
475+
<div id="017acfef" class="cell" data-execution_count="6">
476476
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>residual <span class="op">=</span> y_train <span class="op">-</span> y_train_predicted</span>
477477
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span>
478478
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
@@ -500,7 +500,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
500500
<p><span class="math display">\[
501501
MeanBloodPressure= \beta_0 + \beta_1 \cdot Age
502502
\]</span></p>
503-
<div id="db4abb69" class="cell" data-execution_count="7">
503+
<div id="273ee9db" class="cell" data-execution_count="7">
504504
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co">#Use a small part of the data to illlustrate overfitting.</span></span>
505505
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>nhanes_tiny <span class="op">=</span> nhanes.sample(n<span class="op">=</span><span class="dv">300</span>, random_state<span class="op">=</span><span class="dv">2</span>)</span>
506506
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -545,7 +545,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
545545
</div>
546546
<p>We see that Training Error &lt; Testing Error.</p>
547547
<p>Let’s look at what happens if we increase the flexibility of the model by fitting it with degree 2 polynomial:</p>
548-
<div id="8d442cd2" class="cell" data-execution_count="8">
548+
<div id="5986dd1a" class="cell" data-execution_count="8">
549549
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>p_degree <span class="op">=</span> <span class="dv">2</span></span>
550550
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>y, X <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ BMI + poly(BMI, degree="</span> <span class="op">+</span> <span class="bu">str</span>(p_degree) <span class="op">+</span> <span class="st">")"</span>, nhanes_tiny)</span>
551551
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -589,7 +589,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
589589
</div>
590590
<p>We see that both Training and Testing error both decreased slightly!</p>
591591
<p>What happens if we keep increasing the model complexity?</p>
592-
<div id="65882b57" class="cell" data-execution_count="9">
592+
<div id="bc977416" class="cell" data-execution_count="9">
593593
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> p_degree <span class="kw">in</span> [<span class="dv">4</span>, <span class="dv">10</span>]:</span>
594594
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> y, X <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ BMI + poly(BMI, degree="</span> <span class="op">+</span> <span class="bu">str</span>(p_degree) <span class="op">+</span> <span class="st">")"</span>, nhanes_tiny)</span>
595595
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> </span>
@@ -642,7 +642,7 @@ <h2 data-number="3.4" class="anchored" data-anchor-id="overfitting"><span class=
642642
</div>
643643
</div>
644644
<p>Let’s summarize it:</p>
645-
<div id="f87573b4" class="cell" data-execution_count="10">
645+
<div id="f63ac4ce" class="cell" data-execution_count="10">
646646
<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>train_err <span class="op">=</span> []</span>
647647
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>test_err <span class="op">=</span> []</span>
648648
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>polynomials <span class="op">=</span> <span class="bu">list</span>(<span class="bu">range</span>(<span class="dv">1</span>, <span class="dv">10</span>))</span>
@@ -718,7 +718,7 @@ <h3 data-number="3.5.2" class="anchored" data-anchor-id="parameter-inference"><s
718718
<p><span class="math inline">\(\beta_0\)</span> is a parameter describing the intercept of the line, and <span class="math inline">\(\beta_1\)</span> is a parameter describing the slope of the line.</p>
719719
<p>Suppose that from fitting the model on the Training Set, <span class="math inline">\(\beta_1=2\)</span>. That means increasing <span class="math inline">\(BMI\)</span> by 1 will lead to an increase of <span class="math inline">\(BloodPressure\)</span> by 2. This measures the strength of association between a variable and the outcome.</p>
720720
<p>Let’s see this in practice:</p>
721-
<div id="3c6da906" class="cell" data-execution_count="11">
721+
<div id="bdb4dd18" class="cell" data-execution_count="11">
722722
<div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> statsmodels.api <span class="im">as</span> sm</span>
723723
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a></span>
724724
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a>y, X <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ BMI"</span>, nhanes_tiny)</span>
@@ -757,7 +757,7 @@ <h3 data-number="3.5.2" class="anchored" data-anchor-id="parameter-inference"><s
757757
</tr>
758758
<tr class="odd">
759759
<td data-quarto-table-cell-role="th">Time:</td>
760-
<td>21:19:36</td>
760+
<td>22:00:55</td>
761761
<td data-quarto-table-cell-role="th">Log-Likelihood:</td>
762762
<td>-502.67</td>
763763
</tr>
@@ -861,7 +861,7 @@ <h2 data-number="3.6" class="anchored" data-anchor-id="appendix-interactions"><s
861861
<p>Here is another way to extend the Linear Model:</p>
862862
<p>Suppose we think that <span class="math inline">\(BMI\)</span> and <span class="math inline">\(Gender\)</span> may be good predictors of <span class="math inline">\(MeanBloodPressure\)</span>:</p>
863863
<p>Let’s explore the relationship between <span class="math inline">\(MeanBloodPressure\)</span> and <span class="math inline">\(BMI\)</span> separately for values of <span class="math inline">\(Gender\)</span>.</p>
864-
<div id="eb9ab765" class="cell" data-execution_count="12">
864+
<div id="06d0b30f" class="cell" data-execution_count="12">
865865
<div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
866866
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.lmplot(y<span class="op">=</span><span class="st">"MeanBloodPressure"</span>, x<span class="op">=</span><span class="st">"BMI"</span>, hue<span class="op">=</span><span class="st">"Gender"</span>, data<span class="op">=</span>nhanes_train, lowess<span class="op">=</span><span class="va">False</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.1</span>})</span>
867867
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a>ax.<span class="bu">set</span>(xlim<span class="op">=</span>(<span class="dv">10</span>, <span class="dv">50</span>)) </span>
@@ -887,7 +887,7 @@ <h2 data-number="3.6" class="anchored" data-anchor-id="appendix-interactions"><s
887887
MeanBloodPressure= \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot Gender + \beta_3 \cdot BMI \cdot Gender
888888
\]</span></p>
889889
<p>Let’s see what happens:</p>
890-
<div id="1a6d315e" class="cell" data-execution_count="13">
890+
<div id="359383f2" class="cell" data-execution_count="13">
891891
<div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>y_train, X_train <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ BMI + Gender + BMI*Gender"</span>, nhanes_train)</span>
892892
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_model.LinearRegression()</span>
893893
<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_reg.fit(X_train, y_train)</span>
-25 Bytes
Loading

0 commit comments

Comments
 (0)