Skip to content

Commit a03d261

Browse files
committed
blob
1 parent 2d4cb0f commit a03d261

2 files changed

Lines changed: 38 additions & 18 deletions

File tree

02-Regression.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ Let's take a look what are the assumptions needed for a sound model, and what we
6666

6767
The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn't ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.
6868

69-
To check for this relationship, we have to calculate the **residual**, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a **residual plot** of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.
69+
We can check this relationship by seeing whether predictor is linear with the predicted response value, but this is cumbersome with multiple predictors. Rather, we typically calculate the **residual**, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a **residual plot** of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.
7070

7171
If there's a trend in the data, that means there are non-linear associations between some of the predictors and the response.
7272

03-Classification.qmd

Lines changed: 37 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
*In the third week of class, we will look at classification...*
44

5-
So far, we have looked at making predictions in which the outcome is a continous value. Today, we will look at **classification**, in which our outcome is a categorical value. We first start with binary classification, in which there is only two possible outcomes, usually defined by `True` or `False` values. However, for many classification models, they predict the *probability* of whether an event happens or not, on a continuous scale from 0 to 1. Then, given the predicted probability, we say after a threshold, we classify the outcome as `True` or `False` values.
5+
So far, we have looked at making predictions in which the outcome is a continous value. Today, we will look at **classification**, in which our outcome is a categorical value. We first start with binary classification, in which there is only two possible outcomes, usually defined by `True` or `False` values. However, for many classification models, they predict the *probability* of whether an event happens or not, on a continuous scale from 0 to 1. Then, given the predicted probability, we draw a boundary to classify the outcome as `True` or `False` values.
66

77
Using the same data as before, we define someone is at risk for $Hypertension$ if their diastolic pressure is greater than 80 or their systolic pressure is greater than 130. Our goal is to classify whether someone is at high risk for $Hypertension$.
88

@@ -17,7 +17,7 @@ import matplotlib.pyplot as plt
1717
from formulaic import model_matrix
1818
from sklearn import linear_model
1919
from sklearn.linear_model import LogisticRegression
20-
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
20+
from sklearn.metrics import mean_squared_error, log_loss, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
2121
2222
nhanes = pd.read_csv("classroom_data/NHANES.csv")
2323
nhanes.drop_duplicates(inplace=True)
@@ -36,7 +36,6 @@ We see that there is a lot more data for the No Hypertension group (88%) compare
3636
We then split our data in training and testing, as usual:
3737

3838
```{python}
39-
#train test
4039
nhanes_train, nhanes_test = train_test_split(nhanes, test_size=0.2, random_state=42)
4140
```
4241

@@ -49,7 +48,7 @@ ax.set_ylabel('')
4948
plt.show()
5049
```
5150

52-
Great, there seems to be an association. However, recall that our classification model is going to be *making predictions of probabilit*y on a continuous scale of 0 to 1 before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability in our data exploration. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
51+
Great, there seems to be an association. However, recall that our classification model is going to be *making predictions of probabilit*y *on a continuous scale of 0 to 1* before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability in our data exploration. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
5352

5453
```{python}
5554
nhanes_train['bins'] = pd.cut(nhanes_train['BMI'], bins=20)
@@ -83,11 +82,15 @@ Now, let's build the model $P(Hypertension) = f(BMI)$ to make a prediction of $H
8382

8483
### Logistic Transformation
8584

86-
Our usual Linear Regression model $P(Hypertension)=\beta_0+\beta_1 \cdot BMI$ *does not* give us outputs between 0 and 1. To deal with this, we perform the **Logistic Transformation:**
85+
Our usual Linear Regression model
86+
87+
$$P(Hypertension)=\beta_0+\beta_1 \cdot BMI$$
88+
89+
*does not* give us outputs between 0 and 1. To deal with this, we perform the **Logistic Transformation:**
8790

8891
$$P(Hyptertension) = \frac{1}{1+e^{-(\beta_0 + \beta_1 \cdot BMI)}}$$
8992

90-
Therefore, the relationship between the X and Y axis is not going to be a straight line, but rather a non-linear, "S-shaped" one. Let's fit this model and look at the model visually to understand.
93+
This forces the right hand side of the equation to be between 0 and 1, which is at the scale of probability. The relationship between the X and Y axis is not going to be a straight line, but rather a non-linear, "S-shaped" one. Let's fit this model and look at the model visually to understand.
9194

9295
```{python}
9396
y_train, X_train = model_matrix("Hypertension ~ BMI", nhanes_train)
@@ -96,8 +99,6 @@ logit_model = LogisticRegression().fit(X_train, y_train)
9699
plt.clf()
97100
plt.scatter(X_train.BMI, logit_model.predict_proba(X_train)[:, 1], color="red", label="Fitted Line")
98101
plt.scatter(nhanes_train_binned['bin_midpoint'], nhanes_train_binned['p'], color='blue')
99-
100-
101102
plt.xlabel('BMI')
102103
plt.ylabel('Probability of Hypertension')
103104
plt.ylim(0, 1)
@@ -129,9 +130,9 @@ Now we see why exactly our logistic regression model was limited to fit the rela
129130

130131
### Model Evaluation
131132

132-
Remember that we still need to get from probability to classification. We will set a reasonable, interpretable cutoff of 50%: if the probability of having Hypertension is \>=50%, then classify that person having Hypertension. Otherwise, they do not have Hypertension This cutoff called the **Decision Boundary**.
133+
Remember that we still need to get from probability to classification. We will set a reasonable, interpretable cutoff of 50%: if the probability of having Hypertension is \>=50%, then classify that person having Hypertension. Otherwise, they do not have Hypertension. This cutoff called the **Decision Boundary**.
133134

134-
As an aside, we can also evaluate the model based on just the probability it predicted, and it actually contains more information than if we had set our decision boundary to make it a classifier. However, these metrics of evaluation, namely [**Cross Entropy** and **Brier Scores**](https://aml4td.org/chapters/cls-metrics.html#sec-cls-metrics-soft), are harder to interpret, and are less commonly reported in biomedical research.
135+
As an aside, we can also evaluate the model based just on the probability it predicted, and it actually contains more information than if we had set our decision boundary and classified our response as a True/False dichomony. However, metrics of evaluation on probablities, namely [**Cross Entropy** and **Brier Scores**](https://aml4td.org/chapters/cls-metrics.html#sec-cls-metrics-soft), are harder to interpret, and are less commonly reported in biomedical research. We still stick with evaluation metrics for classification for this course.
135136

136137
Given this decision boundary, let's examine evaluate the model on the test set, and look at its accuracy rate:
137138

@@ -164,7 +165,7 @@ Therefore, we do a pretty terrible job of predicting the Hypertension cases!
164165

165166
What happened exactly? Let's look back at the Training Data: it seems that from the plots that we are making predictions of Hypertension for BMI of 50 or more. However, there are so few people with such a high BMI that even if most of those folks have Hypertension, the model missed most of the folks with Hypertension in the 20-40 BMI range. This range wasn't high enough for our decision boundary of 50% probability, so we missed out most of our Hypertension people.
166167

167-
What can we do? We can change the decision boundary to be lower, which will improve our sensitivity at the expense of our specificity, and vice versa if we change the decision boundary to be higher. What if we set the new decision boundary to be .2?
168+
What can we do? There are lot's of things we can change about the model, but let's tinker around with the decision boundary for a moment. We can change the decision boundary to be lower, which will improve our sensitivity at the expense of our specificity, and vice versa if we change the decision boundary to be higher. What if we set the new decision boundary to be .2?
168169

169170
```{python}
170171
@@ -176,20 +177,41 @@ disp.plot()
176177
plt.show()
177178
```
178179

180+
Our **Sensitivity** (accuracy of Hypertension events) is defined as: $\frac{TP}{TP+FN}$, which is 254/(254+86) = 74%
181+
182+
Our **Specificity** (accuracy of No Hypertension events) is defined as: $\frac{TN}{TN+FP}$, which is 582/(582+570) = 51%.
183+
184+
We have improved our sensitivity at the cost of our specificity! You can explore the range of tradeoffs from the decision boundary cutoff using the Receiver Operating Characteristic (ROC) curve in the Appendix.
185+
186+
Let's pause here for model evaluation for now, and look back at the assumptions of logistic regression, as we did for linear regression.
187+
179188
## Assumptions of logistic regression
180189

181190
### Linearity of log odds - predictor relationship
182191

183-
We can rewrite $P(Hyptertension) = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}}$ as $log(\frac{P(Hyptertension)}{1 - P(Hyptertension)}) = \beta_0 + \beta_1 \cdot BMI$
192+
Recall that our logistic regression model
193+
194+
$$P(Hyptertension) = \frac{1}{1+e^{-(\beta_0 + \beta_1 \cdot BMI)}}$$
184195

185-
where the left hand side is called the **log odds** or the **logit**.
196+
can be rewritten as:
197+
198+
$$log(\frac{P(Hyptertension)}{1 - P(Hyptertension)}) = \beta_0 + \beta_1 \cdot BMI$$
199+
200+
where the left hand side is called the **log odds** or the **logit**. From exploratory data analysis, we need the log odds of the response to be in linear relationship with each predictor in logistic regression.
201+
202+
In linear regression, we can check the linearity between response and multiple predictors by calculating the **residual** and compare it to the predicted response. There is a similar analysis in logistic regression: **residual** **deviance.**
186203

187204
```{python}
205+
model = sm.Logit(y_train, X_train)
206+
results = model.fit()
188207
208+
plt.clf()
209+
plt.scatter(X_train.BMI, results.resid_dev)
210+
plt.xlabel('BMI')
211+
plt.ylabel('Residual Deviance')
212+
plt.show()
189213
```
190214

191-
Deviance: https://stackoverflow.com/questions/50975774/calculate-residual-deviance-from-scikit-learn-logistic-regression-model
192-
193215
### Predictors are not colinear
194216

195217
covered last week
@@ -202,8 +224,6 @@ covered last week
202224

203225
covered last week
204226

205-
### interactions: H-statistic?
206-
207227
## Appendix: Inference for Logistic Regression
208228

209229
Let's do the same for our Logic Regression Classifier model, which has an equation of:

0 commit comments

Comments
 (0)