Skip to content

Commit 348b6fc

Browse files
committed
2 parents 7debbe1 + ff0a35e commit 348b6fc

5 files changed

Lines changed: 143 additions & 12 deletions

File tree

docs/01-Problem-Setup.html

Lines changed: 117 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,40 @@
2020
margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */
2121
vertical-align: middle;
2222
}
23+
/* CSS for syntax highlighting */
24+
pre > code.sourceCode { white-space: pre; position: relative; }
25+
pre > code.sourceCode > span { line-height: 1.25; }
26+
pre > code.sourceCode > span:empty { height: 1.2em; }
27+
.sourceCode { overflow: visible; }
28+
code.sourceCode > span { color: inherit; text-decoration: inherit; }
29+
div.sourceCode { margin: 1em 0; }
30+
pre.sourceCode { margin: 0; }
31+
@media screen {
32+
div.sourceCode { overflow: auto; }
33+
}
34+
@media print {
35+
pre > code.sourceCode { white-space: pre-wrap; }
36+
pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; }
37+
}
38+
pre.numberSource code
39+
{ counter-reset: source-line 0; }
40+
pre.numberSource code > span
41+
{ position: relative; left: -4em; counter-increment: source-line; }
42+
pre.numberSource code > span > a:first-child::before
43+
{ content: counter(source-line);
44+
position: relative; left: -1em; text-align: right; vertical-align: baseline;
45+
border: none; display: inline-block;
46+
-webkit-touch-callout: none; -webkit-user-select: none;
47+
-khtml-user-select: none; -moz-user-select: none;
48+
-ms-user-select: none; user-select: none;
49+
padding: 0 4px; width: 4em;
50+
}
51+
pre.numberSource { margin-left: 3em; padding-left: 4px; }
52+
div.sourceCode
53+
{ }
54+
@media screen {
55+
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
56+
}
2357
</style>
2458

2559

@@ -178,8 +212,18 @@ <h2 id="toc-title">Table of contents</h2>
178212

179213
<ul>
180214
<li><a href="#population-and-sample" id="toc-population-and-sample" class="nav-link active" data-scroll-target="#population-and-sample"><span class="header-section-number">2.1</span> Population and Sample</a></li>
181-
<li><a href="#how-to-evaluate-and-pick-a-model" id="toc-how-to-evaluate-and-pick-a-model" class="nav-link" data-scroll-target="#how-to-evaluate-and-pick-a-model"><span class="header-section-number">2.2</span> How to evaluate and pick a model?</a></li>
182-
<li><a href="#preview-linear-regression" id="toc-preview-linear-regression" class="nav-link" data-scroll-target="#preview-linear-regression"><span class="header-section-number">2.3</span> Preview: linear regression</a></li>
215+
<li><a href="#how-to-evaluate-and-pick-a-model" id="toc-how-to-evaluate-and-pick-a-model" class="nav-link" data-scroll-target="#how-to-evaluate-and-pick-a-model"><span class="header-section-number">2.2</span> How to evaluate and pick a model?</a>
216+
<ul class="collapse">
217+
<li><a href="#prediction" id="toc-prediction" class="nav-link" data-scroll-target="#prediction"><span class="header-section-number">2.2.1</span> Prediction</a></li>
218+
<li><a href="#inference" id="toc-inference" class="nav-link" data-scroll-target="#inference"><span class="header-section-number">2.2.2</span> Inference</a></li>
219+
</ul></li>
220+
<li><a href="#the-numpy-package" id="toc-the-numpy-package" class="nav-link" data-scroll-target="#the-numpy-package"><span class="header-section-number">2.3</span> The NumPy Package</a>
221+
<ul class="collapse">
222+
<li><a href="#subsetting" id="toc-subsetting" class="nav-link" data-scroll-target="#subsetting"><span class="header-section-number">2.3.1</span> Subsetting</a></li>
223+
<li><a href="#how-to-split-the-data-for-training-and-testing" id="toc-how-to-split-the-data-for-training-and-testing" class="nav-link" data-scroll-target="#how-to-split-the-data-for-training-and-testing"><span class="header-section-number">2.3.2</span> How to split the data for training and testing</a></li>
224+
</ul></li>
225+
<li><a href="#linear-regression-preview" id="toc-linear-regression-preview" class="nav-link" data-scroll-target="#linear-regression-preview"><span class="header-section-number">2.4</span> Linear Regression Preview?</a></li>
226+
<li><a href="#appendix-other-terms" id="toc-appendix-other-terms" class="nav-link" data-scroll-target="#appendix-other-terms"><span class="header-section-number">2.5</span> Appendix: Other terms</a></li>
183227
</ul>
184228
<div class="toc-actions"><ul><li><a href="https://github.com/ottrproject/OTTR_Quarto/edit/main/01-Problem-Setup.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://docs.google.com/forms/d/e/1FAIpQLSfBvVELBg8lcynKj0TrzMlov1zil-Sbkh9VhMKRcSpeo1xo6g/viewform" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav>
185229
</div>
@@ -230,17 +274,84 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="population-and-sample"><s
230274
<p><strong>Sample:</strong> A smaller collection of individual units that the researcher has selected to study. For NHANES, this could be a random sampling of the US population.</p>
231275
<p>In Machine Learning problems, we often like to take two, non-overlapping samples from the population: the <strong>Training Set</strong>, and the <strong>Test Set</strong>. We <strong>train</strong> our model using the Training Set, which gives us a function <span class="math inline">\(f()\)</span> that relates the predictors to the outcome. Then, for our main use cases:</p>
232276
<ol type="1">
233-
<li><strong>Prediction:</strong> We use the trained model to predict the outcome using predictors from the Test Set and compare the predicted outcome to the true value in the Test Set.</li>
277+
<li><strong>Prediction:</strong> We use the trained model to predict the outcome using predictors from the Test Set and compare to the true value in the Test Set.</li>
234278
<li><strong>Inference</strong>: We examine the function <span class="math inline">\(f()\)</span>’s trained values, which are called <strong>parameters</strong>. For instance, <span class="math inline">\(f(Age,BMI,Income,…)=20 + 3 \cdot Age - .2 \cdot BMI + .00015 \cdot Income\)</span>, the values <span class="math inline">\(20\)</span>, <span class="math inline">\(3\)</span>, <span class="math inline">\(-.2\)</span>, and <span class="math inline">\(.00015\)</span> are the parameters. Because these parameters are derived from the Training Set, they are an <em>estimated</em> quantity from a sample, similar to other summary statistics like the mean of a sample. Therefore, to say anything about the true population, we have to use statistical tools such as p-values and confidence intervals.</li>
235279
</ol>
236280
<p>If the concepts of population, sample, estimation, p-value, and confidence interval is new to you, we recommend do a bit of reading here [todo].</p>
237281
</section>
238282
<section id="how-to-evaluate-and-pick-a-model" class="level2" data-number="2.2">
239283
<h2 data-number="2.2" class="anchored" data-anchor-id="how-to-evaluate-and-pick-a-model"><span class="header-section-number">2.2</span> How to evaluate and pick a model?</h2>
240-
<p>The little example model we showcased above is an example of a <strong>linear model</strong>, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model.</p>
284+
<p>The little example model we showcased above is an example of a <strong>linear model</strong>, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model. Let’s start with the use case of prediction.</p>
285+
<section id="prediction" class="level3" data-number="2.2.1">
286+
<h3 data-number="2.2.1" class="anchored" data-anchor-id="prediction"><span class="header-section-number">2.2.1</span> Prediction</h3>
287+
<p>Suppose we try to use the variable <span class="math inline">\(BMI\)</span> to predict <span class="math inline">\(BloodPressure\)</span> using a linear model.</p>
288+
<div id="2c1f20ff" class="cell" data-execution_count="1">
289+
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
290+
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
291+
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>nhanes <span class="op">=</span> pd.read_csv(<span class="st">"classroom_data/NHANES.csv"</span>)</span>
292+
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>nhanes[<span class="st">'BloodPressure'</span>] <span class="op">=</span> nhanes[<span class="st">'BPDiaAve'</span>] <span class="op">+</span> (nhanes[<span class="st">'BPSysAve'</span>] <span class="op">-</span> nhanes[<span class="st">'BPDiaAve'</span>]) <span class="op">/</span> <span class="dv">3</span> </span>
293+
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a></span>
294+
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>plot <span class="op">=</span> sns.lmplot(x<span class="op">=</span><span class="st">"BMI"</span>, y<span class="op">=</span><span class="st">"BloodPressure"</span>, data<span class="op">=</span>nhanes)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
295+
<div class="cell-output cell-output-display">
296+
<div>
297+
<figure class="figure">
298+
<p><img src="01-Problem-Setup_files/figure-html/cell-2-output-1.png" width="470" height="470" class="figure-img"></p>
299+
</figure>
300+
</div>
301+
</div>
302+
</div>
303+
<p>We examine how well our model performs in terms of prediction by seeing how close our model’s predicted <span class="math inline">\(BloodPressure\)</span> is to the Training Set’s true <span class="math inline">\(BloodPressure\)</span>: the <strong>Training Error</strong>. We also take the model to the Testing Set to predict <span class="math inline">\(BloodPressure\)</span> using predictors from the Test Set and compare to the true <span class="math inline">\(BloodPressure\)</span> in the Test Set: the <strong>Testing Error.</strong> We want the model’s Training Error to be adequately small on the Training Set, but what we really care about is the Testing Error, because it is a true test of how the model performs on unseen, new data, and allows us to see how generalizeable the model is.</p>
304+
<p>Okay, let’s how it does on the Training Set:</p>
305+
<p>[graph here]</p>
306+
<p>And then on the Test Set:</p>
307+
<p>[graph here]</p>
308+
<p>We see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of <strong>Underfitting</strong>, where our model failed to capture the complexity of the data in both the Training and Testing Set.</p>
309+
<p>Let’s return to the drawing board and fit a new type of model that has more flexibility around complicated patterns of data. Let’s see how it does on the Training Set:</p>
310+
<p>[graph here]</p>
311+
<p>And then on the Test Set:</p>
312+
<p>[graph here]</p>
313+
<p>We see that the Training Error is low, but the Testing Error is huge! This is an example of <strong>Overfitting</strong>, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.</p>
314+
<p>We want to find a model that is “just right” that doesn’t underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. See below:</p>
315+
<div class="quarto-figure quarto-figure-center">
316+
<figure class="figure">
317+
<p><img src="images/testing_error-01.png" class="img-fluid figure-img"></p>
318+
<figcaption>Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor.</figcaption>
319+
</figure>
320+
</div>
321+
<p>Also see this interactive tutorial: <a href="https://mlu-explain.github.io/bias-variance/+">https://mlu-explain.github.io/bias-variance/</a></p>
322+
</section>
323+
<section id="inference" class="level3" data-number="2.2.2">
324+
<h3 data-number="2.2.2" class="anchored" data-anchor-id="inference"><span class="header-section-number">2.2.2</span> Inference</h3>
325+
<p>Let’s consider how we would evaluate and choose models for Inference.</p>
326+
<p>For models with low number of predictors, there are some plots and metrics one would consider, such as BIC.</p>
327+
<p>For models with high number of predictors, we will talk about it in more detail in weeks 5 &amp; 6.</p>
328+
<p>Besides how flexible a model is, another categorization of machine models is how <strong>interpretable</strong> they are. The more interpretable a model is, the better one can describe how each variable has an predictor of the model. That makes the inference process easier.</p>
329+
<p>Below are some example models mapped to these two dichotomies. The linear model lies very similar as the “Least Squares” models.</p>
330+
<div class="quarto-figure quarto-figure-center">
331+
<figure class="figure">
332+
<p><img src="images/flexibility_vs_interpretability.png" class="img-fluid figure-img" width="500"></p>
333+
<figcaption>Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor</figcaption>
334+
</figure>
335+
</div>
336+
</section>
337+
</section>
338+
<section id="the-numpy-package" class="level2" data-number="2.3">
339+
<h2 data-number="2.3" class="anchored" data-anchor-id="the-numpy-package"><span class="header-section-number">2.3</span> The NumPy Package</h2>
340+
<section id="subsetting" class="level3" data-number="2.3.1">
341+
<h3 data-number="2.3.1" class="anchored" data-anchor-id="subsetting"><span class="header-section-number">2.3.1</span> Subsetting</h3>
342+
</section>
343+
<section id="how-to-split-the-data-for-training-and-testing" class="level3" data-number="2.3.2">
344+
<h3 data-number="2.3.2" class="anchored" data-anchor-id="how-to-split-the-data-for-training-and-testing"><span class="header-section-number">2.3.2</span> How to split the data for training and testing</h3>
345+
</section>
346+
</section>
347+
<section id="linear-regression-preview" class="level2" data-number="2.4">
348+
<h2 data-number="2.4" class="anchored" data-anchor-id="linear-regression-preview"><span class="header-section-number">2.4</span> Linear Regression Preview?</h2>
241349
</section>
242-
<section id="preview-linear-regression" class="level2" data-number="2.3">
243-
<h2 data-number="2.3" class="anchored" data-anchor-id="preview-linear-regression"><span class="header-section-number">2.3</span> Preview: linear regression</h2>
350+
<section id="appendix-other-terms" class="level2" data-number="2.5">
351+
<h2 data-number="2.5" class="anchored" data-anchor-id="appendix-other-terms"><span class="header-section-number">2.5</span> Appendix: Other terms</h2>
352+
<p>Parametric vs.&nbsp;Non-parametric</p>
353+
<p>Bias-Variance trade-off</p>
354+
<p>Supervised vs.&nbsp;Unsupervised</p>
244355

245356

246357
</section>
154 KB
Loading
98 KB
Loading

docs/images/testing_error-01.png

60.8 KB
Loading

0 commit comments

Comments
 (0)