Skip to content

Commit b63982d

Browse files
committed
week 1 sketchup
1 parent b059990 commit b63982d

5 files changed

Lines changed: 10073 additions & 3 deletions

File tree

01-Problem-Setup.qmd

Lines changed: 72 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,82 @@ The way we formulate machine learning model is based on some fundamental concept
3232

3333
In Machine Learning problems, we often like to take two, non-overlapping samples from the population: the **Training Set**, and the **Test Set**. We **train** our model using the Training Set, which gives us a function $f()$ that relates the predictors to the outcome. Then, for our main use cases:
3434

35-
1. **Prediction:** We use the trained model to predict the outcome using predictors from the Test Set and compare the predicted outcome to the true value in the Test Set.
35+
1. **Prediction:** We use the trained model to predict the outcome using predictors from the Test Set and compare to the true value in the Test Set.
3636
2. **Inference**: We examine the function $f()$'s trained values, which are called **parameters**. For instance, $f(Age,BMI,Income,…)=20 + 3 \cdot Age - .2 \cdot BMI + .00015 \cdot Income$, the values $20$, $3$, $-.2$, and $.00015$ are the parameters. Because these parameters are derived from the Training Set, they are an *estimated* quantity from a sample, similar to other summary statistics like the mean of a sample. Therefore, to say anything about the true population, we have to use statistical tools such as p-values and confidence intervals.
3737

3838
If the concepts of population, sample, estimation, p-value, and confidence interval is new to you, we recommend do a bit of reading here \[todo\].
3939

4040
## How to evaluate and pick a model?
4141

42-
The little example model we showcased above is an example of a **linear model**, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model.
42+
The little example model we showcased above is an example of a **linear model**, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model. Let's start with the use case of prediction.
4343

44-
## Preview: linear regression
44+
### Prediction
45+
46+
Suppose we try to use the variable $BMI$ to predict $BloodPressure$ using a linear model.
47+
48+
```{python}
49+
import pandas as pd
50+
import seaborn as sns
51+
nhanes = pd.read_csv("classroom_data/NHANES.csv")
52+
nhanes['BloodPressure'] = nhanes['BPDiaAve'] + (nhanes['BPSysAve'] - nhanes['BPDiaAve']) / 3
53+
54+
plot = sns.lmplot(x="BMI", y="BloodPressure", data=nhanes)
55+
```
56+
57+
We examine how well our model performs in terms of prediction by seeing how close our model's predicted $BloodPressure$ is to the Training Set's true $BloodPressure$: the **Training Error**. We also take the model to the Testing Set to predict $BloodPressure$ using predictors from the Test Set and compare to the true $BloodPressure$ in the Test Set: the **Testing Error.** We want the model's Training Error to be adequately small on the Training Set, but what we really care about is the Testing Error, because it is a true test of how the model performs on unseen, new data, and allows us to see how generalizeable the model is.
58+
59+
Okay, let's how it does on the Training Set:
60+
61+
\[graph here\]
62+
63+
And then on the Test Set:
64+
65+
\[graph here\]
66+
67+
We see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of **Underfitting**, where our model failed to capture the complexity of the data in both the Training and Testing Set.
68+
69+
Let's return to the drawing board and fit a new type of model that has more flexibility around complicated patterns of data. Let's see how it does on the Training Set:
70+
71+
\[graph here\]
72+
73+
And then on the Test Set:
74+
75+
\[graph here\]
76+
77+
We see that the Training Error is low, but the Testing Error is huge! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.
78+
79+
We want to find a model that is "just right" that doesn't underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. See below:
80+
81+
![Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor. ](images/testing_error-01.png)
82+
83+
Also see this interactive tutorial: [https://mlu-explain.github.io/bias-variance/](https://mlu-explain.github.io/bias-variance/+)
84+
85+
### Inference
86+
87+
Let's consider how we would evaluate and choose models for Inference.
88+
89+
For models with low number of predictors, there are some plots and metrics one would consider, such as BIC.
90+
91+
For models with high number of predictors, we will talk about it in more detail in weeks 5 & 6.
92+
93+
Besides how flexible a model is, another categorization of machine models is how **interpretable** they are. The more interpretable a model is, the better one can describe how each variable has an predictor of the model. That makes the inference process easier.
94+
95+
Below are some example models mapped to these two dichotomies. The linear model lies very similar as the "Least Squares" models.
96+
97+
![Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor](images/flexibility_vs_interpretability.png){width="500"}
98+
99+
## The NumPy Package
100+
101+
### Subsetting
102+
103+
### How to split the data for training and testing
104+
105+
## Linear Regression Preview?
106+
107+
## Appendix: Other terms
108+
109+
Parametric vs. Non-parametric
110+
111+
Bias-Variance trade-off
112+
113+
Supervised vs. Unsupervised

0 commit comments

Comments
 (0)