Skip to content

Commit 88f69a7

Browse files
Render course
1 parent 0849637 commit 88f69a7

8 files changed

Lines changed: 37 additions & 37 deletions

File tree

docs/01-Problem-Setup.html

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="classification-model-exam
281281
</ol>
282282
<p>Let’s start with the easiest case for just <span class="math inline">\(Hypertension = f(Age)\)</span>, a single predictor.</p>
283283
<p>Before we fit models, we often visualize the data to get a sense whether our setup makes sense.</p>
284-
<div id="00a807f6" class="cell" data-execution_count="1">
284+
<div id="411bb769" class="cell" data-execution_count="1">
285285
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
286286
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
287287
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
@@ -308,7 +308,7 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="classification-model-exam
308308
</div>
309309
<p>Okay, great, it looks like when someone’s BMI is higher, then it is more likely that the person has Hypertension.</p>
310310
<p>Now, let’s build the model <span class="math inline">\(Hypertension = f(BMI)\)</span> to make a prediction of <span class="math inline">\(Hyptertension\)</span> given <span class="math inline">\(BMI\)</span>.</p>
311-
<div id="f0d0e807" class="cell" data-execution_count="2">
311+
<div id="ecc79aea" class="cell" data-execution_count="2">
312312
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
313313
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
314314
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
@@ -346,7 +346,7 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="classification-model-exam
346346
<p>Instead of boxplots, we plotted the data just using points, with “Hypertension” having a probability of 1 and “No Hypertension” having a probability of 0. We see that we have a fitted line in blue for every value of BMI, which represents our machine learning model <span class="math inline">\(f(BMI)\)</span>. This model is called <strong>Logistic Regression</strong>.</p>
347347
<p>The first thing we want to investigate about this model is how well it performs in terms of Classification. Just using <span class="math inline">\(BMI\)</span> as a variable, what is the Accuracy of <span class="math inline">\(f(BMI)\)</span> classifying whether a person has <span class="math inline">\(Hypertension\)</span>? Notice that first <span class="math inline">\(f(BMI)\)</span> gives us continuous probability values, such as given a BMI of 30, there is a 20% chance the person has Hypertension. We need a discrete cutoff of this model to decide whether the person has Hypertension.</p>
348348
<p>A reasonable cutoff to start is 50%: if the probability of having Hypertension is &gt;=50%, then classify that person having Hypertension. Same for &lt; 50%. This is called the <strong>Decision Boundary</strong>.</p>
349-
<div id="632919dc" class="cell" data-execution_count="3">
349+
<div id="81aa2b9d" class="cell" data-execution_count="3">
350350
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
351351
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(X.BMI, logit_model.predict(), color<span class="op">=</span><span class="st">"blue"</span>, label<span class="op">=</span><span class="st">"Fitted Line"</span>)</span>
352352
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>plt.scatter(X.BMI, y, alpha<span class="op">=</span><span class="fl">.3</span>, color<span class="op">=</span><span class="st">"brown"</span>, label<span class="op">=</span><span class="st">"Data"</span>)</span>
@@ -364,7 +364,7 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="classification-model-exam
364364
</div>
365365
</div>
366366
<p>Given this decision boundary, what is the accuracy?</p>
367-
<div id="0f3a7d84" class="cell" data-execution_count="4">
367+
<div id="415f34f5" class="cell" data-execution_count="4">
368368
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.metrics <span class="im">import</span> (confusion_matrix, accuracy_score)</span>
369369
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
370370
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>prediction_cut <span class="op">=</span> [<span class="dv">1</span> <span class="cf">if</span> x <span class="op">&gt;=</span> <span class="fl">.5</span> <span class="cf">else</span> <span class="dv">0</span> <span class="cf">for</span> x <span class="kw">in</span> logit_model.predict()]</span>
@@ -375,7 +375,7 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="classification-model-exam
375375
</div>
376376
<p>Okay, that’s a starting point!</p>
377377
<p>We can break down classification accuracy to four additional results:</p>
378-
<div id="18903073" class="cell" data-execution_count="5">
378+
<div id="81c5fed6" class="cell" data-execution_count="5">
379379
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>tn, fp, fn, tp <span class="op">=</span> confusion_matrix(y, prediction_cut).ravel().tolist()</span>
380380
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"True Positive:"</span>, tp, <span class="st">"</span><span class="ch">\n</span><span class="st">False Positive: "</span>, fp, <span class="st">"</span><span class="ch">\n</span><span class="st">True Negative: "</span>, tn, <span class="st">"</span><span class="ch">\n</span><span class="st">False Negative:"</span>, fn)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
381381
<div class="cell-output cell-output-stdout">
@@ -387,7 +387,7 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="classification-model-exam
387387
</div>
388388
<p>define tp, fp, tn, fn</p>
389389
<p>define confusion matrix</p>
390-
<div id="42a4e264" class="cell" data-execution_count="6">
390+
<div id="28862d4d" class="cell" data-execution_count="6">
391391
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>cm <span class="op">=</span> confusion_matrix(y, prediction_cut) </span>
392392
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"Confusion Matrix : </span><span class="ch">\n</span><span class="st">"</span>, cm) </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
393393
<div class="cell-output cell-output-stdout">
@@ -485,7 +485,7 @@ <h2 data-number="2.4" class="anchored" data-anchor-id="how-to-evaluate-and-pick-
485485
<section id="prediction" class="level3" data-number="2.4.1">
486486
<h3 data-number="2.4.1" class="anchored" data-anchor-id="prediction"><span class="header-section-number">2.4.1</span> Prediction</h3>
487487
<p>Suppose we try to use the single variable <span class="math inline">\(BMI\)</span> to predict <span class="math inline">\(BloodPressure\)</span> using a linear model.</p>
488-
<div id="7ca969f2" class="cell" data-execution_count="7">
488+
<div id="248c4bee" class="cell" data-execution_count="7">
489489
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
490490
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
491491
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
@@ -517,27 +517,27 @@ <h3 data-number="2.4.1" class="anchored" data-anchor-id="prediction"><span class
517517
</div>
518518
<p>We examine how well our model performs in terms of prediction by seeing how close our model’s predicted <span class="math inline">\(BloodPressure\)</span> is to the Training Set’s true <span class="math inline">\(BloodPressure\)</span>: the <strong>Training Error</strong>. We also take the model to the Testing Set to predict <span class="math inline">\(BloodPressure\)</span> using predictors from the Test Set and compare to the true <span class="math inline">\(BloodPressure\)</span> in the Test Set: the <strong>Testing Error.</strong> We want the model’s Training Error to be adequately small on the Training Set, but what we really care about is the Testing Error, because it is a true test of how the model performs on unseen, new data, and allows us to see how generalizeable the model is.</p>
519519
<p>Okay, let’s how it does on the Training Set:</p>
520-
<div id="5d4b4820" class="cell" data-execution_count="8">
520+
<div id="4907bc81" class="cell" data-execution_count="8">
521521
<div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>np.mean((results.fittedvalues <span class="op">-</span> y_train.BloodPressure) <span class="op">**</span> <span class="dv">2</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
522522
<div class="cell-output cell-output-display" data-execution_count="8">
523-
<pre><code>152.36993639963026</code></pre>
523+
<pre><code>np.float64(152.36993639963026)</code></pre>
524524
</div>
525525
</div>
526-
<div id="287be5fc" class="cell" data-execution_count="9">
526+
<div id="e32d79f4" class="cell" data-execution_count="9">
527527
<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>results.mse_resid</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
528528
<div class="cell-output cell-output-display" data-execution_count="9">
529-
<pre><code>152.4417920640486</code></pre>
529+
<pre><code>np.float64(152.44179206404883)</code></pre>
530530
</div>
531531
</div>
532532
<p>[graph here]</p>
533533
<p>And then on the Test Set:</p>
534-
<div id="983fa0c6" class="cell" data-execution_count="10">
534+
<div id="7bf725d9" class="cell" data-execution_count="10">
535535
<div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>np.mean((results.get_prediction(X_test).predicted_mean <span class="op">-</span> y_test.BloodPressure) <span class="op">**</span> <span class="dv">2</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
536536
<div class="cell-output cell-output-display" data-execution_count="10">
537-
<pre><code>155.83019738256442</code></pre>
537+
<pre><code>np.float64(155.83019738256448)</code></pre>
538538
</div>
539539
</div>
540-
<div id="af1edf69" class="cell" data-execution_count="11">
540+
<div id="93eea0df" class="cell" data-execution_count="11">
541541
<div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>plt.plot(X_test.BMI, results.get_prediction(X_test).predicted_mean, label<span class="op">=</span><span class="st">"fitted line"</span>)</span>
542542
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(X_test.BMI, y_test, alpha<span class="op">=</span><span class="fl">.3</span>, color<span class="op">=</span><span class="st">"black"</span>, label<span class="op">=</span><span class="st">"test set"</span>)</span>
543543
<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>plt.legend()<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -551,7 +551,7 @@ <h3 data-number="2.4.1" class="anchored" data-anchor-id="prediction"><span class
551551
</div>
552552
<p>We see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of <strong>Underfitting</strong>, where our model failed to capture the complexity of the data in both the Training and Testing Set.</p>
553553
<p>Let’s return to the drawing board and fit a new type of model that has more flexibility around complicated patterns of data. Let’s see how it does on the Training Set:</p>
554-
<div id="312d472c" class="cell" data-execution_count="12">
554+
<div id="ce8cbf11" class="cell" data-execution_count="12">
555555
<div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="co">#y, X = model_matrix("BloodPressure ~ poly(BMI, degree=5)", nhanes)</span></span>
556556
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span>
557557
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="co">#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)</span></span>
0 Bytes
Loading
0 Bytes
Loading
0 Bytes
Loading
0 Bytes
Loading
9 Bytes
Loading

0 commit comments

Comments
 (0)