Skip to content

Commit 8cd9fb6

Browse files
committed
Update 02-Regression.qmd
1 parent b8bcf5c commit 8cd9fb6

1 file changed

Lines changed: 20 additions & 12 deletions

File tree

02-Regression.qmd

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -51,21 +51,23 @@ This model is formed by making the the line of best fit determined by the minimu
5151

5252
Another illustration:
5353

54-
![Left panel: difference between response and predicted response (residual). Right panel: squared difference between predicted response and response. Image source: https://kenndanielso.github.io/mlrefined/blog_posts/8_Linear_regression/8_1_Least_squares_regression.html](https://kenndanielso.github.io/mlrefined/mlrefined_images/superlearn_images/Least_Squares.png){width="800"}
54+
![Image source: https://kenndanielso.github.io/mlrefined/blog_posts/8_Linear_regression/8_1_Least_squares_regression.html](https://kenndanielso.github.io/mlrefined/mlrefined_images/superlearn_images/Least_Squares.png){width="800"}
55+
56+
Left panel: difference between response and predicted response (residual). Right panel: squared difference between predicted response and response.
5557

5658
## Assumptions of linear regression
5759

5860
Any model that one uses has some assumptions about the data that allows the model to make good predictions. *Note that there are other types of assumptions if your modeling technique is focused on inference*.
5961

60-
Here are common situations when the assumptions of linear regression are *not* held:
62+
Let's take a look what are the assumptions needed for a sound model, and what we can do to address it not upheld.
6163

62-
### Non-Linearity of responder-predictor relationship
64+
### Linearity of responder-predictor relationship
6365

6466
The linear regression model assumes that there is a straight line (linear) relationship between the predictors and the response. It doesn't ask for the straight line relationship to be perfect, but rather on average the cloud of points has a linear shape. If that is not true, then our prediction is going to be less accurate.
6567

6668
To check for this relationship, we have to calculate the **residual**, which is the difference between the response value and the predicted response value (similar to a type of model performance metrics we examined last week). Then, we can make a **residual plot** of the predicted response vs. residual. Ideally, this residual plot should have no pattern - some residuals above 0, some below 0, but no strong trend.
6769

68-
If there's a trend in the data, that means there are non-linear associations in the data.
70+
If there's a trend in the data, that means there are non-linear associations between some of the predictors and the response.
6971

7072
```{python}
7173
residual = y_train - y_train_predicted
@@ -79,17 +81,23 @@ plt.show()
7981

8082
We see there's a slight curve in our residual plot. We will look at ways to deal with this later in this lecture.
8183

82-
### Outliers
84+
### No Outliers
85+
86+
An **outlier** is an obseravtion for which the response is far from the value predicted response (y-axis). An observation has high **leverage** if it has an unusual predictor value (x-axis). Outliers and high leverage observations arise may arise out of incorrect measurements, among many other causes. When these observations cause significant changes to the regression model, they are called **influential**. They may greatly contribute to the Mean Squared Error, as observations away the majority of the data will have exponentially large residuals.
87+
88+
Some possible solutions:
89+
90+
- We can eyeball for for potential influential points by exploratory data analysis, and see how the model changes if we remove it. We may troubleshoot with the instruments that generated the data in the first place to diagnosis.
8391

84-
- An **outlier** is an obseravtion for which the response is far from the value predicted response. An observation has high **leverage** if it has an unusual predictor value. Outliers and high leverage observations arise may arise out of incorrect measurements, among many other causes. Outliers and high leverage observations that cause changes to the regression model is called **influential**. They may greatly contribute to the Mean Squared Error, as observations away the majority of the data will have exponentially large residuals.
92+
- We can detect an influential point via computing the studentized residuals or cook's distance and decide whether it makes sense to remove it.
8593

86-
- We can detect an influential point via computing the studentized residuals or cook's distance and decide whether it makes sense to remove it
94+
- We can use a different linear regression method, called Huber loss regerssion, that allows more tolerance for outliers.
8795

88-
- We can use a different linear regression method, called Huber loss regerssion, that allows more tolerance for outliers.
96+
### Predictors are not colinear
8997

90-
### Collinearity of predictors
98+
**Colinearity** is the situation when two or more predictors are linearly related to each other. If we put collinear predictors into our regression model, they start to serve as redundant information to our model and can degrade predictive performance.
9199

92-
**Colinearity** is the situation when two or more predictors are closely linearly related to each other. If we put collinear predictors into our regression model, they start to serve as redundant information to our model and can degrade predictive performance.
100+
Some possible solutions:
93101

94102
- We can detect collinearity to look at the correlation matrix between predictors. This works well for pairwise correlations.
95103

@@ -123,7 +131,7 @@ ax.set_xlim([10, 50])
123131
plt.show()
124132
```
125133

126-
### Number of predictors more than the number of samples
134+
### Number of predictors is less than the number of samples
127135

128136
Sometimes, in machine learning, we have more predictors than the number of samples. This is called a **high dimensional** problem. Our regression method will not work here and we need to find ways to reduce the number of predictors.
129137

@@ -198,7 +206,7 @@ $$
198206
MeanBloodPressure= \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot Gender
199207
$$
200208
201-
According to our model, the relationship between $BMI$ and $MeanBloodPressure$ is a linear line with slope $\beta_1$, and the additional predictor of $Gender$ will change our prediction by only a constant, $\beta_2$. However, this plot suggests that our original model isn't quite right: the additional predictor of Gender changes our prediction by more than a constant - it is dependent on $BMI$ also.
209+
According to our model, the relationship between $BMI$ and $MeanBloodPressure$ is a linear line with slope $\beta_1$, and the additional predictor of $Gender$ will change our prediction by only a constant, $\beta_2$. Visually, that would look like two parallel lines, with $\beta_2$ dictating the distance between parallel lines. However, this plot suggests that our original model isn't quite right: the additional predictor of Gender changes our prediction by more than a constant - it is dependent on $BMI$ also.
202210
203211
When multiple predictors have an synergistic effect on the outcome, their effect on the outcome occurs jointly - this is called an **Interaction**. To incorporate this into our model, we add an interaction term:
204212

0 commit comments

Comments
 (0)