fhdsl
diff --git a/‎01-Problem-Setup.qmd‎
Lines changed: 72 additions & 3 deletions b/‎01-Problem-Setup.qmd‎
Lines changed: 72 additions & 3 deletions
@@ -32,13 +32,82 @@ The way we formulate machine learning model is based on some fundamental concept
 
 In Machine Learning problems, we often like to take two, non-overlapping samples from the population: the **Training Set**, and the **Test Set**. We **train** our model using the Training Set, which gives us a function $f()$ that relates the predictors to the outcome. Then, for our main use cases:
 
-1.  **Prediction:** We use the trained model to predict the outcome using predictors from the Test Set and compare the predicted outcome to the true value in the Test Set.
+1.  **Prediction:** We use the trained model to predict the outcome using predictors from the Test Set and compare to the true value in the Test Set.
 2.  **Inference**: We examine the function $f()$'s trained values, which are called **parameters**. For instance, $f(Age,BMI,Income,…)=20 + 3 \cdot Age - .2 \cdot BMI + .00015 \cdot Income$, the values $20$, $3$, $-.2$, and $.00015$ are the parameters. Because these parameters are derived from the Training Set, they are an *estimated* quantity from a sample, similar to other summary statistics like the mean of a sample. Therefore, to say anything about the true population, we have to use statistical tools such as p-values and confidence intervals.
 
 If the concepts of population, sample, estimation, p-value, and confidence interval is new to you, we recommend do a bit of reading here \[todo\].
 
 ## How to evaluate and pick a model?
 
-The little example model we showcased above is an example of a **linear model**, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model.
+The little example model we showcased above is an example of a **linear model**, but we will look at several types of models in this course. In order to decide how to evaluate and pick a model, we will need to develop a framework to assess a model. Let's start with the use case of prediction.
 
-## Preview: linear regression
+### Prediction
+
+Suppose we try to use the variable $BMI$ to predict $BloodPressure$ using a linear model.
+
+```{python}
+import pandas as pd
+import seaborn as sns
+nhanes = pd.read_csv("classroom_data/NHANES.csv")
+nhanes['BloodPressure'] = nhanes['BPDiaAve'] + (nhanes['BPSysAve'] - nhanes['BPDiaAve']) / 3 
+
+plot = sns.lmplot(x="BMI", y="BloodPressure", data=nhanes)
+```
+
+We examine how well our model performs in terms of prediction by seeing how close our model's predicted $BloodPressure$ is to the Training Set's true $BloodPressure$: the **Training Error**. We also take the model to the Testing Set to predict $BloodPressure$ using predictors from the Test Set and compare to the true $BloodPressure$ in the Test Set: the **Testing Error.** We want the model's Training Error to be adequately small on the Training Set, but what we really care about is the Testing Error, because it is a true test of how the model performs on unseen, new data, and allows us to see how generalizeable the model is.
+
+Okay, let's how it does on the Training Set:
+
+\[graph here\]
+
+And then on the Test Set:
+
+\[graph here\]
+
+We see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of **Underfitting**, where our model failed to capture the complexity of the data in both the Training and Testing Set.
+
+Let's return to the drawing board and fit a new type of model that has more flexibility around complicated patterns of data. Let's see how it does on the Training Set:
+
+\[graph here\]
+
+And then on the Test Set:
+
+\[graph here\]
+
+We see that the Training Error is low, but the Testing Error is huge! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.
+
+We want to find a model that is "just right" that doesn't underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. See below:
+
+![Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor. ](images/testing_error-01.png)
+
+Also see this interactive tutorial: [https://mlu-explain.github.io/bias-variance/](https://mlu-explain.github.io/bias-variance/+)
+
+### Inference
+
+Let's consider how we would evaluate and choose models for Inference.
+
+For models with low number of predictors, there are some plots and metrics one would consider, such as BIC.
+
+For models with high number of predictors, we will talk about it in more detail in weeks 5 & 6.
+
+Besides how flexible a model is, another categorization of machine models is how **interpretable** they are. The more interpretable a model is, the better one can describe how each variable has an predictor of the model. That makes the inference process easier.
+
+Below are some example models mapped to these two dichotomies. The linear model lies very similar as the "Least Squares" models.
+
+![Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor](images/flexibility_vs_interpretability.png){width="500"}
+
+## The NumPy Package
+
+### Subsetting
+
+### How to split the data for training and testing
+
+## Linear Regression Preview?
+
+## Appendix: Other terms 
+
+Parametric vs. Non-parametric
+
+Bias-Variance trade-off
+
+Supervised vs. Unsupervised