|
20 | 20 | margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ |
21 | 21 | vertical-align: middle; |
22 | 22 | } |
| 23 | +/* CSS for syntax highlighting */ |
| 24 | +pre > code.sourceCode { white-space: pre; position: relative; } |
| 25 | +pre > code.sourceCode > span { line-height: 1.25; } |
| 26 | +pre > code.sourceCode > span:empty { height: 1.2em; } |
| 27 | +.sourceCode { overflow: visible; } |
| 28 | +code.sourceCode > span { color: inherit; text-decoration: inherit; } |
| 29 | +div.sourceCode { margin: 1em 0; } |
| 30 | +pre.sourceCode { margin: 0; } |
| 31 | +@media screen { |
| 32 | +div.sourceCode { overflow: auto; } |
| 33 | +} |
| 34 | +@media print { |
| 35 | +pre > code.sourceCode { white-space: pre-wrap; } |
| 36 | +pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } |
| 37 | +} |
| 38 | +pre.numberSource code |
| 39 | + { counter-reset: source-line 0; } |
| 40 | +pre.numberSource code > span |
| 41 | + { position: relative; left: -4em; counter-increment: source-line; } |
| 42 | +pre.numberSource code > span > a:first-child::before |
| 43 | + { content: counter(source-line); |
| 44 | + position: relative; left: -1em; text-align: right; vertical-align: baseline; |
| 45 | + border: none; display: inline-block; |
| 46 | + -webkit-touch-callout: none; -webkit-user-select: none; |
| 47 | + -khtml-user-select: none; -moz-user-select: none; |
| 48 | + -ms-user-select: none; user-select: none; |
| 49 | + padding: 0 4px; width: 4em; |
| 50 | + } |
| 51 | +pre.numberSource { margin-left: 3em; padding-left: 4px; } |
| 52 | +div.sourceCode |
| 53 | + { } |
| 54 | +@media screen { |
| 55 | +pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } |
| 56 | +} |
23 | 57 | </style> |
24 | 58 |
|
25 | 59 |
|
@@ -178,8 +212,18 @@ <h2 id="toc-title">Table of contents</h2> |
178 | 212 |
|
179 | 213 | <ul> |
180 | 214 | <li><a href="#population-and-sample" id="toc-population-and-sample" class="nav-link active" data-scroll-target="#population-and-sample"><span class="header-section-number">2.1</span> Population and Sample</a></li> |
181 | | - <li><a href="#how-to-evaluate-and-pick-a-model" id="toc-how-to-evaluate-and-pick-a-model" class="nav-link" data-scroll-target="#how-to-evaluate-and-pick-a-model"><span class="header-section-number">2.2</span> How to evaluate and pick a model?</a></li> |
182 | | - <li><a href="#preview-linear-regression" id="toc-preview-linear-regression" class="nav-link" data-scroll-target="#preview-linear-regression"><span class="header-section-number">2.3</span> Preview: linear regression</a></li> |
| 215 | + <li><a href="#how-to-evaluate-and-pick-a-model" id="toc-how-to-evaluate-and-pick-a-model" class="nav-link" data-scroll-target="#how-to-evaluate-and-pick-a-model"><span class="header-section-number">2.2</span> How to evaluate and pick a model?</a> |
| 216 | + <ul class="collapse"> |
| 217 | + <li><a href="#prediction" id="toc-prediction" class="nav-link" data-scroll-target="#prediction"><span class="header-section-number">2.2.1</span> Prediction</a></li> |
| 218 | + <li><a href="#inference" id="toc-inference" class="nav-link" data-scroll-target="#inference"><span class="header-section-number">2.2.2</span> Inference</a></li> |
| 219 | + </ul></li> |
| 220 | + <li><a href="#the-numpy-package" id="toc-the-numpy-package" class="nav-link" data-scroll-target="#the-numpy-package"><span class="header-section-number">2.3</span> The NumPy Package</a> |
| 221 | + <ul class="collapse"> |
| 222 | + <li><a href="#subsetting" id="toc-subsetting" class="nav-link" data-scroll-target="#subsetting"><span class="header-section-number">2.3.1</span> Subsetting</a></li> |
| 223 | + <li><a href="#how-to-split-the-data-for-training-and-testing" id="toc-how-to-split-the-data-for-training-and-testing" class="nav-link" data-scroll-target="#how-to-split-the-data-for-training-and-testing"><span class="header-section-number">2.3.2</span> How to split the data for training and testing</a></li> |
| 224 | + </ul></li> |
| 225 | + <li><a href="#linear-regression-preview" id="toc-linear-regression-preview" class="nav-link" data-scroll-target="#linear-regression-preview"><span class="header-section-number">2.4</span> Linear Regression Preview?</a></li> |
| 226 | + <li><a href="#appendix-other-terms" id="toc-appendix-other-terms" class="nav-link" data-scroll-target="#appendix-other-terms"><span class="header-section-number">2.5</span> Appendix: Other terms</a></li> |
183 | 227 | </ul> |
184 | 228 | <div class="toc-actions"><ul><li><a href="https://github.com/ottrproject/OTTR_Quarto/edit/main/01-Problem-Setup.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://docs.google.com/forms/d/e/1FAIpQLSfBvVELBg8lcynKj0TrzMlov1zil-Sbkh9VhMKRcSpeo1xo6g/viewform" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> |
185 | 229 | </div> |
@@ -230,17 +274,84 @@ <h2 data-number="2.1" class="anchored" data-anchor-id="population-and-sample"><s |
230 | 274 | <p><strong>Sample:</strong> A smaller collection of individual units that the researcher has selected to study. For NHANES, this could be a random sampling of the US population.</p> |
231 | 275 | <p>In Machine Learning problems, we often like to take two, non-overlapping samples from the population: the <strong>Training Set</strong>, and the <strong>Test Set</strong>. We <strong>train</strong> our model using the Training Set, which gives us a function <span class="math inline">\(f()\)</span> that relates the predictors to the outcome. Then, for our main use cases:</p> |
232 | 276 | <ol type="1"> |
233 | | -<li><strong>Prediction:</strong> We use the trained model to predict the outcome using predictors from the Test Set and compare the predicted outcome to the true value in the Test Set.</li> |
| 277 | +<li><strong>Prediction:</strong> We use the trained model to predict the outcome using predictors from the Test Set and compare to the true value in the Test Set.</li> |
234 | 278 | <li><strong>Inference</strong>: We examine the function <span class="math inline">\(f()\)</span>’s trained values, which are called <strong>parameters</strong>. For instance, <span class="math inline">\(f(Age,BMI,Income,…)=20 + 3 \cdot Age - .2 \cdot BMI + .00015 \cdot Income\)</span>, the values <span class="math inline">\(20\)</span>, <span class="math inline">\(3\)</span>, <span class="math inline">\(-.2\)</span>, and <span class="math inline">\(.00015\)</span> are the parameters. Because these parameters are derived from the Training Set, they are an <em>estimated</em> quantity from a sample, similar to other summary statistics like the mean of a sample. Therefore, to say anything about the true population, we have to use statistical tools such as p-values and confidence intervals.</li> |
235 | 279 | </ol> |
236 | 280 | <p>If the concepts of population, sample, estimation, p-value, and confidence interval is new to you, we recommend do a bit of reading here [todo].</p> |
237 | 281 | </section> |
238 | 282 | <section id="how-to-evaluate-and-pick-a-model" class="level2" data-number="2.2"> |
239 | 283 | <h2 data-number="2.2" class="anchored" data-anchor-id="how-to-evaluate-and-pick-a-model"><span class="header-section-number">2.2</span> How to evaluate and pick a model?</h2> |
240 | | -<p>The little example model we showcased above is an example of a <strong>linear model</strong>, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model.</p> |
| 284 | +<p>The little example model we showcased above is an example of a <strong>linear model</strong>, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model. Let’s start with the use case of prediction.</p> |
| 285 | +<section id="prediction" class="level3" data-number="2.2.1"> |
| 286 | +<h3 data-number="2.2.1" class="anchored" data-anchor-id="prediction"><span class="header-section-number">2.2.1</span> Prediction</h3> |
| 287 | +<p>Suppose we try to use the variable <span class="math inline">\(BMI\)</span> to predict <span class="math inline">\(BloodPressure\)</span> using a linear model.</p> |
| 288 | +<div id="2c1f20ff" class="cell" data-execution_count="1"> |
| 289 | +<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span> |
| 290 | +<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span> |
| 291 | +<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>nhanes <span class="op">=</span> pd.read_csv(<span class="st">"classroom_data/NHANES.csv"</span>)</span> |
| 292 | +<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>nhanes[<span class="st">'BloodPressure'</span>] <span class="op">=</span> nhanes[<span class="st">'BPDiaAve'</span>] <span class="op">+</span> (nhanes[<span class="st">'BPSysAve'</span>] <span class="op">-</span> nhanes[<span class="st">'BPDiaAve'</span>]) <span class="op">/</span> <span class="dv">3</span> </span> |
| 293 | +<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a></span> |
| 294 | +<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>plot <span class="op">=</span> sns.lmplot(x<span class="op">=</span><span class="st">"BMI"</span>, y<span class="op">=</span><span class="st">"BloodPressure"</span>, data<span class="op">=</span>nhanes)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> |
| 295 | +<div class="cell-output cell-output-display"> |
| 296 | +<div> |
| 297 | +<figure class="figure"> |
| 298 | +<p><img src="01-Problem-Setup_files/figure-html/cell-2-output-1.png" width="470" height="470" class="figure-img"></p> |
| 299 | +</figure> |
| 300 | +</div> |
| 301 | +</div> |
| 302 | +</div> |
| 303 | +<p>We examine how well our model performs in terms of prediction by seeing how close our model’s predicted <span class="math inline">\(BloodPressure\)</span> is to the Training Set’s true <span class="math inline">\(BloodPressure\)</span>: the <strong>Training Error</strong>. We also take the model to the Testing Set to predict <span class="math inline">\(BloodPressure\)</span> using predictors from the Test Set and compare to the true <span class="math inline">\(BloodPressure\)</span> in the Test Set: the <strong>Testing Error.</strong> We want the model’s Training Error to be adequately small on the Training Set, but what we really care about is the Testing Error, because it is a true test of how the model performs on unseen, new data, and allows us to see how generalizeable the model is.</p> |
| 304 | +<p>Okay, let’s how it does on the Training Set:</p> |
| 305 | +<p>[graph here]</p> |
| 306 | +<p>And then on the Test Set:</p> |
| 307 | +<p>[graph here]</p> |
| 308 | +<p>We see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of <strong>Underfitting</strong>, where our model failed to capture the complexity of the data in both the Training and Testing Set.</p> |
| 309 | +<p>Let’s return to the drawing board and fit a new type of model that has more flexibility around complicated patterns of data. Let’s see how it does on the Training Set:</p> |
| 310 | +<p>[graph here]</p> |
| 311 | +<p>And then on the Test Set:</p> |
| 312 | +<p>[graph here]</p> |
| 313 | +<p>We see that the Training Error is low, but the Testing Error is huge! This is an example of <strong>Overfitting</strong>, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.</p> |
| 314 | +<p>We want to find a model that is “just right” that doesn’t underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. See below:</p> |
| 315 | +<div class="quarto-figure quarto-figure-center"> |
| 316 | +<figure class="figure"> |
| 317 | +<p><img src="images/testing_error-01.png" class="img-fluid figure-img"></p> |
| 318 | +<figcaption>Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor.</figcaption> |
| 319 | +</figure> |
| 320 | +</div> |
| 321 | +<p>Also see this interactive tutorial: <a href="https://mlu-explain.github.io/bias-variance/+">https://mlu-explain.github.io/bias-variance/</a></p> |
| 322 | +</section> |
| 323 | +<section id="inference" class="level3" data-number="2.2.2"> |
| 324 | +<h3 data-number="2.2.2" class="anchored" data-anchor-id="inference"><span class="header-section-number">2.2.2</span> Inference</h3> |
| 325 | +<p>Let’s consider how we would evaluate and choose models for Inference.</p> |
| 326 | +<p>For models with low number of predictors, there are some plots and metrics one would consider, such as BIC.</p> |
| 327 | +<p>For models with high number of predictors, we will talk about it in more detail in weeks 5 & 6.</p> |
| 328 | +<p>Besides how flexible a model is, another categorization of machine models is how <strong>interpretable</strong> they are. The more interpretable a model is, the better one can describe how each variable has an predictor of the model. That makes the inference process easier.</p> |
| 329 | +<p>Below are some example models mapped to these two dichotomies. The linear model lies very similar as the “Least Squares” models.</p> |
| 330 | +<div class="quarto-figure quarto-figure-center"> |
| 331 | +<figure class="figure"> |
| 332 | +<p><img src="images/flexibility_vs_interpretability.png" class="img-fluid figure-img" width="500"></p> |
| 333 | +<figcaption>Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor</figcaption> |
| 334 | +</figure> |
| 335 | +</div> |
| 336 | +</section> |
| 337 | +</section> |
| 338 | +<section id="the-numpy-package" class="level2" data-number="2.3"> |
| 339 | +<h2 data-number="2.3" class="anchored" data-anchor-id="the-numpy-package"><span class="header-section-number">2.3</span> The NumPy Package</h2> |
| 340 | +<section id="subsetting" class="level3" data-number="2.3.1"> |
| 341 | +<h3 data-number="2.3.1" class="anchored" data-anchor-id="subsetting"><span class="header-section-number">2.3.1</span> Subsetting</h3> |
| 342 | +</section> |
| 343 | +<section id="how-to-split-the-data-for-training-and-testing" class="level3" data-number="2.3.2"> |
| 344 | +<h3 data-number="2.3.2" class="anchored" data-anchor-id="how-to-split-the-data-for-training-and-testing"><span class="header-section-number">2.3.2</span> How to split the data for training and testing</h3> |
| 345 | +</section> |
| 346 | +</section> |
| 347 | +<section id="linear-regression-preview" class="level2" data-number="2.4"> |
| 348 | +<h2 data-number="2.4" class="anchored" data-anchor-id="linear-regression-preview"><span class="header-section-number">2.4</span> Linear Regression Preview?</h2> |
241 | 349 | </section> |
242 | | -<section id="preview-linear-regression" class="level2" data-number="2.3"> |
243 | | -<h2 data-number="2.3" class="anchored" data-anchor-id="preview-linear-regression"><span class="header-section-number">2.3</span> Preview: linear regression</h2> |
| 350 | +<section id="appendix-other-terms" class="level2" data-number="2.5"> |
| 351 | +<h2 data-number="2.5" class="anchored" data-anchor-id="appendix-other-terms"><span class="header-section-number">2.5</span> Appendix: Other terms</h2> |
| 352 | +<p>Parametric vs. Non-parametric</p> |
| 353 | +<p>Bias-Variance trade-off</p> |
| 354 | +<p>Supervised vs. Unsupervised</p> |
244 | 355 |
|
245 | 356 |
|
246 | 357 | </section> |
|
0 commit comments