Skip to content

Commit 3901cbf

Browse files
committed
2 parents a03d261 + 5b331cb commit 3901cbf

6 files changed

Lines changed: 47 additions & 525 deletions

File tree

docs/01-Problem-Setup.html

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -323,7 +323,7 @@ <h2 data-number="2.3" class="anchored" data-anchor-id="the-conceptual-example-in
323323
<section id="visualizing-the-outcome" class="level3" data-number="2.3.1">
324324
<h3 data-number="2.3.1" class="anchored" data-anchor-id="visualizing-the-outcome"><span class="header-section-number">2.3.1</span> Visualizing the outcome</h3>
325325
<p>Building a sound machine learning model requires careful understanding of the data, and we often start looking at the response variable.</p>
326-
<div id="ed172eb6" class="cell" data-execution_count="1">
326+
<div id="65b7fefb" class="cell" data-execution_count="1">
327327
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
328328
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
329329
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
@@ -386,7 +386,7 @@ <h3 data-number="2.3.1" class="anchored" data-anchor-id="visualizing-the-outcome
386386
<section id="splitting-the-data" class="level3" data-number="2.3.2">
387387
<h3 data-number="2.3.2" class="anchored" data-anchor-id="splitting-the-data"><span class="header-section-number">2.3.2</span> Splitting the data</h3>
388388
<p>Our dataset has 7832 data points:</p>
389-
<div id="3ab9735f" class="cell" data-execution_count="2">
389+
<div id="4c3c31d9" class="cell" data-execution_count="2">
390390
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(nhanes.shape)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
391391
<div class="cell-output cell-output-stdout">
392392
<pre><code>(7832, 77)</code></pre>
@@ -402,11 +402,11 @@ <h3 data-number="2.3.2" class="anchored" data-anchor-id="splitting-the-data"><sp
402402
<li><p>The response data is small has multiple peaks</p></li>
403403
</ul>
404404
<p>but random splitting will suffice for this example.</p>
405-
<div id="03017683" class="cell" data-execution_count="3">
405+
<div id="4c4d8304" class="cell" data-execution_count="3">
406406
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>nhanes_train, nhanes_test <span class="op">=</span> train_test_split(nhanes, test_size<span class="op">=</span><span class="fl">0.2</span>, random_state<span class="op">=</span><span class="dv">42</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
407407
</div>
408408
<p>And let’s look at the number of data points after splitting:</p>
409-
<div id="bea02520" class="cell" data-execution_count="4">
409+
<div id="14b858b5" class="cell" data-execution_count="4">
410410
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"Training size:"</span>, nhanes_train.shape)</span>
411411
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"Testing size:"</span>, nhanes_test.shape)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
412412
<div class="cell-output cell-output-stdout">
@@ -419,7 +419,7 @@ <h3 data-number="2.3.2" class="anchored" data-anchor-id="splitting-the-data"><sp
419419
<h3 data-number="2.3.3" class="anchored" data-anchor-id="exploratory-data-analysis"><span class="header-section-number">2.3.3</span> Exploratory Data Analysis</h3>
420420
<p>Now, using <em>only</em> the Training Set, we try to discern which variables might be good predictors of our response, as well as how they relate - is it linear, nonlinear, or something else? There are many ways to pick predictors for a model, ranging from Exploratory Data Analysis to quantitative methods, and we will be more comprehensive later in this course.</p>
421421
<p>Let’s look at the relationship between <span class="math inline">\(MeanBloodPressure\)</span> and potential predictor <span class="math inline">\(BMI\)</span>. We add a smooth line fit to the scatterplot, because it shows the average trend between these two variables. The black dotted lines are the ranges of healthy mean blood pressure from our response histogram.</p>
422-
<div id="1bd89c70" class="cell" data-execution_count="5">
422+
<div id="acbdacfd" class="cell" data-execution_count="5">
423423
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
424424
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.regplot(y<span class="op">=</span><span class="st">"MeanBloodPressure"</span>, x<span class="op">=</span><span class="st">"BMI"</span>, data<span class="op">=</span>nhanes_train, lowess<span class="op">=</span><span class="va">True</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.2</span>}, line_kws<span class="op">=</span>{<span class="st">'color'</span>:<span class="st">"r"</span>})</span>
425425
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>ax.axhline(y<span class="op">=</span><span class="dv">70</span>, color<span class="op">=</span><span class="st">'black'</span>, linestyle<span class="op">=</span><span class="st">'--'</span>)</span>
@@ -436,7 +436,7 @@ <h3 data-number="2.3.3" class="anchored" data-anchor-id="exploratory-data-analys
436436
</div>
437437
<p>Okay, great, it looks like when someone’s BMI is higher, then it is more likely they have higher mean blood pressure.</p>
438438
<p>Let’s look at <span class="math inline">\(Age\)</span>:</p>
439-
<div id="0b34d1cb" class="cell" data-execution_count="6">
439+
<div id="4ffdd8e9" class="cell" data-execution_count="6">
440440
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
441441
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.regplot(y<span class="op">=</span><span class="st">"MeanBloodPressure"</span>, x<span class="op">=</span><span class="st">"Age"</span>, data<span class="op">=</span>nhanes_train, lowess<span class="op">=</span><span class="va">True</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.2</span>}, line_kws<span class="op">=</span>{<span class="st">'color'</span>:<span class="st">"r"</span>})</span>
442442
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>ax.axhline(y<span class="op">=</span><span class="dv">70</span>, color<span class="op">=</span><span class="st">'black'</span>, linestyle<span class="op">=</span><span class="st">'--'</span>)</span>
@@ -451,7 +451,7 @@ <h3 data-number="2.3.3" class="anchored" data-anchor-id="exploratory-data-analys
451451
</div>
452452
</div>
453453
<p>We see a similar trend. How about <span class="math inline">\(Gender\)</span>?</p>
454-
<div id="16c37a4c" class="cell" data-execution_count="7">
454+
<div id="b45c7ca8" class="cell" data-execution_count="7">
455455
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
456456
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.boxplot(x<span class="op">=</span><span class="st">"Gender"</span>, y<span class="op">=</span><span class="st">"MeanBloodPressure"</span>, data<span class="op">=</span>nhanes)</span>
457457
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>ax.axhline(y<span class="op">=</span><span class="dv">70</span>, color<span class="op">=</span><span class="st">'black'</span>, linestyle<span class="op">=</span><span class="st">'--'</span>)</span>
@@ -466,7 +466,7 @@ <h3 data-number="2.3.3" class="anchored" data-anchor-id="exploratory-data-analys
466466
</div>
467467
</div>
468468
<p>Males tend to have a higher <span class="math inline">\(MeanBloodPressure\)</span>. Let’s look at one more, <span class="math inline">\(DirectChol\)</span>:</p>
469-
<div id="fc08fe75" class="cell" data-execution_count="8">
469+
<div id="0e2ebb2b" class="cell" data-execution_count="8">
470470
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
471471
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>ax <span class="op">=</span> sns.regplot(y<span class="op">=</span><span class="st">"MeanBloodPressure"</span>, x<span class="op">=</span><span class="st">"DirectChol"</span>, data<span class="op">=</span>nhanes, lowess<span class="op">=</span><span class="va">True</span>, scatter_kws<span class="op">=</span>{<span class="st">'alpha'</span>:<span class="fl">0.1</span>}, line_kws<span class="op">=</span>{<span class="st">'color'</span>:<span class="st">"r"</span>})</span>
472472
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>ax.axhline(y<span class="op">=</span><span class="dv">70</span>, color<span class="op">=</span><span class="st">'black'</span>, linestyle<span class="op">=</span><span class="st">'--'</span>)</span>
@@ -494,11 +494,11 @@ <h2 data-number="2.4" class="anchored" data-anchor-id="picking-a-model-linear-re
494494
\]</span></p>
495495
<p>where the unknown variables <span class="math inline">\(\beta_0\)</span>, <span class="math inline">\(\beta_1\)</span>, <span class="math inline">\(\beta_2\)</span>, <span class="math inline">\(\beta_3\)</span>, called <strong>parameters</strong> or <strong>coefficients</strong>, will be learned in the model training process.</p>
496496
<p>We specify this form:</p>
497-
<div id="150936bd" class="cell" data-execution_count="9">
497+
<div id="b5589aec" class="cell" data-execution_count="9">
498498
<div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>y_train, X_train <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ Age + BMI + Gender"</span>, nhanes_train)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
499499
</div>
500500
<p>And fit the model, which gives our parameters:</p>
501-
<div id="47c843d0" class="cell" data-execution_count="10">
501+
<div id="d9d94d11" class="cell" data-execution_count="10">
502502
<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn <span class="im">import</span> linear_model</span>
503503
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_model.LinearRegression()</span>
504504
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>linear_reg <span class="op">=</span> linear_reg.fit(X_train, y_train)</span>
@@ -513,7 +513,7 @@ <h2 data-number="2.4" class="anchored" data-anchor-id="picking-a-model-linear-re
513513
<section id="picking-a-model-decision-tree" class="level2" data-number="2.5">
514514
<h2 data-number="2.5" class="anchored" data-anchor-id="picking-a-model-decision-tree"><span class="header-section-number">2.5</span> Picking a model: Decision Tree</h2>
515515
<p>A different model is called a <strong>Decision Tree</strong>. It is composed of a set of hierarchical if/then statements based on the predictors that ends in a <strong>node</strong> that dictate what the response prediction should be. Below shows an example Decision Tree with three hierarchical if/then statements:</p>
516-
<div id="07cdea8f" class="cell" data-execution_count="11">
516+
<div id="240f744c" class="cell" data-execution_count="11">
517517
<div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.tree <span class="im">import</span> DecisionTreeRegressor</span>
518518
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn <span class="im">import</span> tree</span>
519519
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -544,7 +544,7 @@ <h2 data-number="2.6" class="anchored" data-anchor-id="model-evaluation"><span c
544544
</ul>
545545
<p>We will consider other model evaluation metrics throughout the course, especially in situation when the dataset isn’t big enough for a training and splitting set.</p>
546546
<p>Let’s look at our MAE of our Linear Regression model on the test data:</p>
547-
<div id="c1565a4f" class="cell" data-execution_count="12">
547+
<div id="173c4edc" class="cell" data-execution_count="12">
548548
<div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.metrics <span class="im">import</span> mean_absolute_error</span>
549549
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>y_test, X_test <span class="op">=</span> model_matrix(<span class="st">"MeanBloodPressure ~ Age + BMI + Gender"</span>, nhanes_test)</span>
550550
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a>y_test_predicted <span class="op">=</span> linear_reg.predict(X_test)</span>
@@ -557,7 +557,7 @@ <h2 data-number="2.6" class="anchored" data-anchor-id="model-evaluation"><span c
557557
</div>
558558
<p>Okay, on average our model is off by 8.65 on the scale of <span class="math inline">\(MeanBloodPressure\)</span>.</p>
559559
<p>Let’s visualize this:</p>
560-
<div id="a348e67b" class="cell" data-execution_count="13">
560+
<div id="2e96be64" class="cell" data-execution_count="13">
561561
<div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>plt.clf()</span>
562562
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(y_test_predicted, y_test, alpha<span class="op">=</span><span class="fl">.5</span>)</span>
563563
<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>plt.axline((<span class="dv">70</span>, <span class="dv">70</span>), slope<span class="op">=</span><span class="dv">1</span>, color<span class="op">=</span><span class="st">'r'</span>, linestyle<span class="op">=</span><span class="st">'--'</span>)</span>
@@ -574,7 +574,7 @@ <h2 data-number="2.6" class="anchored" data-anchor-id="model-evaluation"><span c
574574
</div>
575575
</div>
576576
<p>Let’s do the same for the Regression Tree model:</p>
577-
<div id="be9394dd" class="cell" data-execution_count="14">
577+
<div id="a2ee88cb" class="cell" data-execution_count="14">
578578
<div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>y_test_predicted <span class="op">=</span> decision_tree.predict(X_test)</span>
579579
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span>
580580
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>test_err <span class="op">=</span> <span class="bu">round</span>(mean_absolute_error(y_test_predicted, y_test), <span class="dv">2</span>)</span>

0 commit comments

Comments
 (0)