You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 02-Regression.qmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -66,7 +66,7 @@ Let's take a look what are the assumptions needed for a sound model, and what we
66
66
67
67
The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn't ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.
68
68
69
-
To check for this relationship, we have to calculate the **residual**, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a **residual plot** of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.
69
+
We can check this relationship by seeing whether predictor is linear with the predicted response value, but this is cumbersome with multiple predictors. Rather, we typically calculate the **residual**, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a **residual plot** of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.
70
70
71
71
If there's a trend in the data, that means there are non-linear associations between some of the predictors and the response.
Copy file name to clipboardExpand all lines: 03-Classification.qmd
+37-17Lines changed: 37 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
*In the third week of class, we will look at classification...*
4
4
5
-
So far, we have looked at making predictions in which the outcome is a continous value. Today, we will look at **classification**, in which our outcome is a categorical value. We first start with binary classification, in which there is only two possible outcomes, usually defined by `True` or `False` values. However, for many classification models, they predict the *probability* of whether an event happens or not, on a continuous scale from 0 to 1. Then, given the predicted probability, we say after a threshold, we classify the outcome as `True` or `False` values.
5
+
So far, we have looked at making predictions in which the outcome is a continous value. Today, we will look at **classification**, in which our outcome is a categorical value. We first start with binary classification, in which there is only two possible outcomes, usually defined by `True` or `False` values. However, for many classification models, they predict the *probability* of whether an event happens or not, on a continuous scale from 0 to 1. Then, given the predicted probability, we draw a boundary to classify the outcome as `True` or `False` values.
6
6
7
7
Using the same data as before, we define someone is at risk for $Hypertension$ if their diastolic pressure is greater than 80 or their systolic pressure is greater than 130. Our goal is to classify whether someone is at high risk for $Hypertension$.
8
8
@@ -17,7 +17,7 @@ import matplotlib.pyplot as plt
17
17
from formulaic import model_matrix
18
18
from sklearn import linear_model
19
19
from sklearn.linear_model import LogisticRegression
20
-
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
20
+
from sklearn.metrics import mean_squared_error, log_loss, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
21
21
22
22
nhanes = pd.read_csv("classroom_data/NHANES.csv")
23
23
nhanes.drop_duplicates(inplace=True)
@@ -36,7 +36,6 @@ We see that there is a lot more data for the No Hypertension group (88%) compare
36
36
We then split our data in training and testing, as usual:
Great, there seems to be an association. However, recall that our classification model is going to be *making predictions of probabilit*y on a continuous scale of 0 to 1 before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability in our data exploration. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
51
+
Great, there seems to be an association. However, recall that our classification model is going to be *making predictions of probabilit*y *on a continuous scale of 0 to 1* before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability in our data exploration. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
@@ -83,11 +82,15 @@ Now, let's build the model $P(Hypertension) = f(BMI)$ to make a prediction of $H
83
82
84
83
### Logistic Transformation
85
84
86
-
Our usual Linear Regression model $P(Hypertension)=\beta_0+\beta_1 \cdot BMI$ *does not* give us outputs between 0 and 1. To deal with this, we perform the **Logistic Transformation:**
85
+
Our usual Linear Regression model
86
+
87
+
$$P(Hypertension)=\beta_0+\beta_1 \cdot BMI$$
88
+
89
+
*does not* give us outputs between 0 and 1. To deal with this, we perform the **Logistic Transformation:**
Therefore, the relationship between the X and Y axis is not going to be a straight line, but rather a non-linear, "S-shaped" one. Let's fit this model and look at the model visually to understand.
93
+
This forces the right hand side of the equation to be between 0 and 1, which is at the scale of probability. The relationship between the X and Y axis is not going to be a straight line, but rather a non-linear, "S-shaped" one. Let's fit this model and look at the model visually to understand.
@@ -129,9 +130,9 @@ Now we see why exactly our logistic regression model was limited to fit the rela
129
130
130
131
### Model Evaluation
131
132
132
-
Remember that we still need to get from probability to classification. We will set a reasonable, interpretable cutoff of 50%: if the probability of having Hypertension is \>=50%, then classify that person having Hypertension. Otherwise, they do not have Hypertension This cutoff called the **Decision Boundary**.
133
+
Remember that we still need to get from probability to classification. We will set a reasonable, interpretable cutoff of 50%: if the probability of having Hypertension is \>=50%, then classify that person having Hypertension. Otherwise, they do not have Hypertension. This cutoff called the **Decision Boundary**.
133
134
134
-
As an aside, we can also evaluate the model based on just the probability it predicted, and it actually contains more information than if we had set our decision boundary to make it a classifier. However, these metrics of evaluation, namely [**Cross Entropy** and **Brier Scores**](https://aml4td.org/chapters/cls-metrics.html#sec-cls-metrics-soft), are harder to interpret, and are less commonly reported in biomedical research.
135
+
As an aside, we can also evaluate the model based just on the probability it predicted, and it actually contains more information than if we had set our decision boundary and classified our response as a True/False dichomony. However, metrics of evaluation on probablities, namely [**Cross Entropy** and **Brier Scores**](https://aml4td.org/chapters/cls-metrics.html#sec-cls-metrics-soft), are harder to interpret, and are less commonly reported in biomedical research. We still stick with evaluation metrics for classification for this course.
135
136
136
137
Given this decision boundary, let's examine evaluate the model on the test set, and look at its accuracy rate:
137
138
@@ -164,7 +165,7 @@ Therefore, we do a pretty terrible job of predicting the Hypertension cases!
164
165
165
166
What happened exactly? Let's look back at the Training Data: it seems that from the plots that we are making predictions of Hypertension for BMI of 50 or more. However, there are so few people with such a high BMI that even if most of those folks have Hypertension, the model missed most of the folks with Hypertension in the 20-40 BMI range. This range wasn't high enough for our decision boundary of 50% probability, so we missed out most of our Hypertension people.
166
167
167
-
What can we do? We can change the decision boundary to be lower, which will improve our sensitivity at the expense of our specificity, and vice versa if we change the decision boundary to be higher. What if we set the new decision boundary to be .2?
168
+
What can we do? There are lot's of things we can change about the model, but let's tinker around with the decision boundary for a moment. We can change the decision boundary to be lower, which will improve our sensitivity at the expense of our specificity, and vice versa if we change the decision boundary to be higher. What if we set the new decision boundary to be .2?
168
169
169
170
```{python}
170
171
@@ -176,20 +177,41 @@ disp.plot()
176
177
plt.show()
177
178
```
178
179
180
+
Our **Sensitivity** (accuracy of Hypertension events) is defined as: $\frac{TP}{TP+FN}$, which is 254/(254+86) = 74%
181
+
182
+
Our **Specificity** (accuracy of No Hypertension events) is defined as: $\frac{TN}{TN+FP}$, which is 582/(582+570) = 51%.
183
+
184
+
We have improved our sensitivity at the cost of our specificity! You can explore the range of tradeoffs from the decision boundary cutoff using the Receiver Operating Characteristic (ROC) curve in the Appendix.
185
+
186
+
Let's pause here for model evaluation for now, and look back at the assumptions of logistic regression, as we did for linear regression.
187
+
179
188
## Assumptions of logistic regression
180
189
181
190
### Linearity of log odds - predictor relationship
182
191
183
-
We can rewrite $P(Hyptertension) = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}}$ as $log(\frac{P(Hyptertension)}{1 - P(Hyptertension)}) = \beta_0 + \beta_1 \cdot BMI$
where the left hand side is called the **log odds** or the **logit**. From exploratory data analysis, we need the log odds of the response to be in linear relationship with each predictor in logistic regression.
201
+
202
+
In linear regression, we can check the linearity between response and multiple predictors by calculating the **residual** and compare it to the predicted response. There is a similar analysis in logistic regression: **residual****deviance.**
0 commit comments