You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -16,12 +16,12 @@ Where $f(Age, BMI)$ is a machine learning model that takes in the variables $Age
16
16
17
17
A machine learning model, such as the one described above, has *two main uses:*
18
18
19
-
1.**Classification and Prediction:** How accurately can we classify or predict the outcome?
19
+
1.**Classification and Prediction (Focus of this course):** How accurately can we classify or predict the outcome?
20
20
21
21
- Classification: Given a new person's $Age, BMI$, classify whether the person has $Hyptertension$. The outcome is a yes/no classification.
22
22
- Prediction: Given a person's $Age, BMI$, predict the person's $BloodPressure$ value. The outcome is a continuous value.
23
23
24
-
2.**Inference:** Which predictors are associated with the response, and how strong is the association?
24
+
2.**Inference (Secondary in this course):** Which predictors are associated with the response, and how strong is the association?
25
25
26
26
- Classification model example: What is the odds ratio of of $Age$ on $Hyptertension$? If the odds ratio of $Age$ on $Hyptertension$ is 2, then an increase of 1 in $Age$ increases the odds of $Hyptertension$ by 2.
27
27
- Prediction model example: Suppose the model is described as $BloodPressure = f(Age,BMI)=20 + 3 \cdot Age - .2 \cdot BMI$. Each variable has a relationship to the outcome: an increase of $Age$ by 1 will lead to an increase of $BloodPressure$ by 3. This measures the strength of association between a variable and the outcome.
@@ -54,17 +54,6 @@ Okay, great, it looks like when someone's BMI is higher, then it is more likely
54
54
Now, let's build the model $Hypertension = f(BMI)$ to make a prediction of $Hyptertension$ given $BMI$.
55
55
56
56
```{python}
57
-
import pandas as pd
58
-
import seaborn as sns
59
-
import numpy as np
60
-
from sklearn.model_selection import train_test_split
@@ -121,7 +110,7 @@ cm = confusion_matrix(y, prediction_cut)
121
110
print("Confusion Matrix : \n", cm)
122
111
```
123
112
124
-
### Summary of Exampl;e
113
+
### Summary of Example
125
114
126
115
So what have we done so far? / Preview of what is to come:
127
116
@@ -149,17 +138,15 @@ Let's review the List data structure. For any data structure, we ask the followi
149
138
150
139
- What can it do (in terms of functions)?
151
140
152
-
And if it "makes sense" to us, then it is well-designed.
141
+
And if it "makes sense" to us, then it is well-designed data structure.
153
142
154
-
The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: *what does it contain, and what can it do?* It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following:
143
+
Formally, a data structure in Python (also known as an **Object**) may contain the following:
155
144
156
-
-**Value** that holds the essential data for the object.
145
+
-**Value** that holds the essential data for the data structure.
157
146
158
-
-**Attributes** that hold subset or additional data for the object.
147
+
-**Attributes** that hold subset or additional data for the data structure.
159
148
160
-
- Functions called **Methods** that are for the object and *have to* take in the variable referenced as an input.
161
-
162
-
This organizing structure on an object applies to pretty much all Python data types and data structures.
149
+
- Functions called **Methods** that are for the data structure and *have to* take in the variable referenced as an input.
163
150
164
151
Let's see how this applies to the **List**:
165
152
@@ -177,163 +164,36 @@ How about **Dataframe**?
177
164
178
165
-**Methods** that can be used on the object: [df.merge(other_df, on="column_name")](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
179
166
180
-
Feel free to look at the [cheatsheet on data structures from Intro to Python](https://docs.google.com/document/d/1IHD9_Edg3mbMY9lilAF0QWVjaKBK0eE0-hbo0fSywAA/edit?tab=t.0#heading=h.2bko76vfr8r6).
167
+
Feel free to look at the [cheatsheet on data structures from Intro to Python](https://docs.google.com/document/d/1IHD9_Edg3mbMY9lilAF0QWVjaKBK0eE0-hbo0fSywAA/edit?tab=t.0#heading=h.2bko76vfr8r6) to refresh yourself.
181
168
182
169
### NumPy
183
170
184
-
A new Data Structure we will work with in this course is NumPy's ndarray data structure. It is very similar to a Dataframe, but has the following characteristics for building machine learning models:
171
+
A new Data Structure we will work with in this course is NumPy's ndarray ("n-dimensional array") data structure. It is commonly referred as "**NumPy Array**". It is very similar to a Dataframe, but has the following characteristics for building machine learning models:
185
172
186
173
- All elements are homogeneous and numeric.
187
174
188
-
- There are no column or row labels.
175
+
- There are no column or row names.
189
176
190
-
- Mathematical operations are optimized
177
+
- Mathematical operations are optimized to be fast.
191
178
192
179
So, let's see some examples:
193
180
194
-
-**Value**: the 2-dimensional numerical table. It actually can go up to n-dimensions.
181
+
-**Value**: the 2-dimensional numerical table. It actually can be any dimension, but we will just work with 1-dimensional (similar to a List) and 2-dimensional.
195
182
196
183
-**Attributes** that store additional values:
197
184
198
-
-
199
-
200
-
-**Methods** that can be used on the object:
201
-
202
-
-
203
-
204
-
## Population and Sample
205
-
206
-
The way we formulate machine learning model is based on some fundamental concepts in inferential statistics. We will refresh this quickly in the context of our problem. Recall the following definitions:
207
-
208
-
**Population:** The entire collection of individual units that a researcher is interested to study. For NHANES, this could be the entire US population.
209
-
210
-
**Sample:** A smaller collection of individual units that the researcher has selected to study. For NHANES, this could be a random sampling of the US population.
211
-
212
-
In Machine Learning problems, we often like to take two, non-overlapping samples from the population: the **Training Set**, and the **Test Set**. We **train** our model using the Training Set, which gives us a function $f()$ that relates the predictors to the outcome. Then, for our main use cases:
213
-
214
-
1.**Prediction:** We use the trained model to predict the outcome using predictors from the Test Set and compare to the true value in the Test Set.
215
-
2.**Inference**: We examine the function $f()$'s trained values, which are called **parameters**. For instance, $f(Age,BMI,Income)=20 + 3 \cdot Age - .2 \cdot BMI + .00015 \cdot Income$, the values $20$, $3$, $-.2$, and $.00015$ are the parameters. Because these parameters are derived from the Training Set, they are an *estimated* quantity from a sample, similar to other summary statistics like the mean of a sample. Therefore, to say anything about the true population, we have to use statistical tools such as p-values and confidence intervals.
216
-
217
-
If the concepts of population, sample, estimation, p-value, and confidence interval is new to you, we recommend do a bit of reading here \[todo\].
218
-
219
-
## How to evaluate and pick a model?
220
-
221
-
The little example model we showcased above is an example of a **linear model**, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model. Let's start with the use case of prediction.
222
-
223
-
### Prediction
224
-
225
-
Suppose we try to use the single variable $BMI$ to predict $BloodPressure$ using a linear model.
226
-
227
-
```{python}
228
-
import pandas as pd
229
-
import seaborn as sns
230
-
import numpy as np
231
-
from sklearn.model_selection import train_test_split
- Two-dimensional subsetting, similar to lists: `data[:5, :3]` subsets for for the first 5 rows and first three columns. `data[:5, [0, 2, 3]]` subsets for the first 5 rows and 1st, 3rd, and 4th columns.
251
186
252
-
We examine how well our model performs in terms of prediction by seeing how close our model's predicted $BloodPressure$ is to the Training Set's true $BloodPressure$: the **Training Error**. We also take the model to the Testing Set to predict $BloodPressure$ using predictors from the Test Set and compare to the true $BloodPressure$ in the Test Set: the **Testing Error.** We want the model's Training Error to be adequately small on the Training Set, but what we really care about is the Testing Error, because it is a true test of how the model performs on unseen, new data, and allows us to see how generalizeable the model is.
187
+
-`data.shape` gives the shape of the NumPy Array. `data.dim` will tell you the number of dimensions of the NumPy Array.
We see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of **Underfitting**, where our model failed to capture the complexity of the data in both the Training and Testing Set.
281
-
282
-
Let's return to the drawing board and fit a new type of model that has more flexibility around complicated patterns of data. Let's see how it does on the Training Set:
283
-
284
-
```{python}
285
-
#y, X = model_matrix("BloodPressure ~ poly(BMI, degree=5)", nhanes)
286
-
287
-
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
We see that the Training Error is low, but the Testing Error is huge! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.
304
-
305
-
We want to find a model that is "just right" that doesn't underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. See below:
306
-
307
-

308
-
309
-
Also see this interactive tutorial: [https://mlu-explain.github.io/bias-variance/](https://mlu-explain.github.io/bias-variance/+)
310
-
311
-
### Inference
312
-
313
-
Let's consider how we would evaluate and choose models for Inference.
314
-
315
-
For models with low number of predictors, there are some plots and metrics one would consider, such as BIC.
316
-
317
-
For models with high number of predictors, we will talk about it in more detail in weeks 5 & 6.
318
-
319
-
Besides how flexible a model is, another categorization of machine models is how **interpretable** they are. The more interpretable a model is, the better one can describe how each variable has an predictor of the model. That makes the inference process easier.
320
-
321
-
Below are some example models mapped to these two dichotomies. The linear model lies very similar as the "Least Squares" models.
322
-
323
-
{width="500"}
324
-
325
-
## The NumPy Package
326
-
327
-
### Subsetting
328
-
329
-
### How to split the data for training and testing
189
+
-**Methods** that can be used on the object:
330
190
331
-
## Linear Regression Preview?
191
+
-`data.sum(axis=0)` sums over rows, `data.sum(axis=1)` sums over columns.
332
192
333
-
## Appendix: Other terms
193
+
For this course, we often load in a dataset in the Pandas Dataframe format, and then once we pick the our outcome and predictors, we will transform the Dataframe into an NumPy Array, such as this line of code we saw earlier: `y, X = model_matrix("Hypertension ~ BMI", nhanes)`. We specify our outcome, predictor, and Dataframe for the `model_matrix()` function, and the outputs are two NumPy Arrays, one for the outcome, and one for the predictors. Any downstream Machine Learning modeling work off the NumPy Arrays `y` and `X`.
334
194
335
-
Parametric vs. Non-parametric
195
+
More introduction can be found on [NumPy's tutorial guide](https://numpy.org/devdocs/user/absolute_beginners.html).
336
196
337
-
Bias-Variance trade-off
197
+
### What is this data structure?
338
198
339
-
Supervised vs. Unsupervised
199
+
If you are not sure about what your variable's data structure, use the `type()` function, such as `type(mystery_data)` and it will tell you.
0 commit comments