You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Okay, great, it looks like when someone’s BMI is higher, then it is more likely that the person has Hypertension.</p>
310
310
<p>Now, let’s build the model <spanclass="math inline">\(Hypertension = f(BMI)\)</span> to make a prediction of <spanclass="math inline">\(Hyptertension\)</span> given <spanclass="math inline">\(BMI\)</span>.</p>
<p>Instead of boxplots, we plotted the data just using points, with “Hypertension” having a probability of 1 and “No Hypertension” having a probability of 0. We see that we have a fitted line in blue for every value of BMI, which represents our machine learning model <spanclass="math inline">\(f(BMI)\)</span>. This model is called <strong>Logistic Regression</strong>.</p>
347
347
<p>The first thing we want to investigate about this model is how well it performs in terms of Classification. Just using <spanclass="math inline">\(BMI\)</span> as a variable, what is the Accuracy of <spanclass="math inline">\(f(BMI)\)</span> classifying whether a person has <spanclass="math inline">\(Hypertension\)</span>? Notice that first <spanclass="math inline">\(f(BMI)\)</span> gives us continuous probability values, such as given a BMI of 30, there is a 20% chance the person has Hypertension. We need a discrete cutoff of this model to decide whether the person has Hypertension.</p>
348
348
<p>A reasonable cutoff to start is 50%: if the probability of having Hypertension is >=50%, then classify that person having Hypertension. Same for < 50%. This is called the <strong>Decision Boundary</strong>.</p>
<spanid="cb5-3"><ahref="#cb5-3" aria-hidden="true" tabindex="-1"></a>prediction_cut <spanclass="op">=</span> [<spanclass="dv">1</span><spanclass="cf">if</span> x <spanclass="op">>=</span><spanclass="fl">.5</span><spanclass="cf">else</span><spanclass="dv">0</span><spanclass="cf">for</span> x <spanclass="kw">in</span> logit_model.predict()]</span>
<p>Suppose we try to use the single variable <spanclass="math inline">\(BMI\)</span> to predict <spanclass="math inline">\(BloodPressure\)</span> using a linear model.</p>
@@ -517,27 +517,27 @@ <h3 data-number="2.4.1" class="anchored" data-anchor-id="prediction"><span class
517
517
</div>
518
518
<p>We examine how well our model performs in terms of prediction by seeing how close our model’s predicted <spanclass="math inline">\(BloodPressure\)</span> is to the Training Set’s true <spanclass="math inline">\(BloodPressure\)</span>: the <strong>Training Error</strong>. We also take the model to the Testing Set to predict <spanclass="math inline">\(BloodPressure\)</span> using predictors from the Test Set and compare to the true <spanclass="math inline">\(BloodPressure\)</span> in the Test Set: the <strong>Testing Error.</strong> We want the model’s Training Error to be adequately small on the Training Set, but what we really care about is the Testing Error, because it is a true test of how the model performs on unseen, new data, and allows us to see how generalizeable the model is.</p>
519
519
<p>Okay, let’s how it does on the Training Set:</p>
<spanid="cb18-3"><ahref="#cb18-3" aria-hidden="true" tabindex="-1"></a>plt.legend()<spanclass="op">;</span></span></code><buttontitle="Copy to Clipboard" class="code-copy-button"><iclass="bi"></i></button></pre></div>
@@ -551,7 +551,7 @@ <h3 data-number="2.4.1" class="anchored" data-anchor-id="prediction"><span class
551
551
</div>
552
552
<p>We see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of <strong>Underfitting</strong>, where our model failed to capture the complexity of the data in both the Training and Testing Set.</p>
553
553
<p>Let’s return to the drawing board and fit a new type of model that has more flexibility around complicated patterns of data. Let’s see how it does on the Training Set:</p>
0 commit comments