StatLearning/10-inference-withnotes.Rmd at main · hq286/StatLearning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
# Inference using simulation methods


[This "chapter" is essentially the same as \@ref(inference), but with some of the notes I wrote in the document as we went.]

This section is kind of a grab-bag of computational techniques in support of statistics and inference. We'll introduce some simulation methods: randomization and the bootstrap. We'll use the `infer` package.


```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(knitr)
opts_chunk$set(eval = TRUE, message = FALSE, error = TRUE)
```

## Reading in external data

So far, we have been working with data that came from within R packages. But, most data we are interested in doesn't come built into R! So, we need to read the data into our environment. The R package `readr` will help with many common file formats. (Another file-reading package is called `haven`, and helps you read in specialty data formats from other statistical packages like SAS and SPSS. There's also `readxl` for Excel sheets. If you use a particular kind of data, you can probably find a package to read it into R!)

```{r, eval=TRUE}
library(readr)
```

We are going to work with two datasets, one called `Physicians.csv` and the other called `policekillings.csv` (I don't have provenance information on these datasets, sorry!)

Download those two .csv files.

You can locate the files on [my GitHub](https://github.com/AmeliaMN/StatLearning/tree/main/data). Select "raw" to see the raw version of the file.

![](img/data_raw.png)

Then, right-click to download the file.

![](img/right-click.png)


Once you've downloaded those files, place them into the same folder as your RMarkdown file, or inside a sub-folder called something like `data`.


Let's read in `Physicians.csv` first.

```{r, eval = TRUE}
doctors <- read_csv("data/Physicians.csv")
```


As with other tidyverse functions, `read_csv` gives us a message, telling us how it parsed the data. We can change those defaults if we want to.

Right off the bat, we notice something strange-- the number of doctors is being read in as a character variable, meaning it is not numeric. We'll come back to that.


#### Renaming variables

The variable names in this dataset are hard to use,
```{r}
names(doctors)
```

In order to run something like `skim()`, we'd have to use backtics to write out the variable name.

```{r}
library(skimr)
doctors %>%
  skim(`Physicians per 100,000 Population`)
```

We can manually rename that variable,

```{r}
library(dplyr)
doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)
```
That's better!

Or, we could use a package to automatically fix all the names in a dataset.

```{r}
# install.packages("janitor")
library(janitor)

doctors %>%
  clean_names() %>%
  skim(physicians_per_100_000_population)
```

With either of these approaches, you would need to overwrite the original data with the renamed data in order to use it. For the sake of having a short name, let's use the `rename()` function.

```{r}
doctors <- doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)
```

#### Searching for strange issues in variables
But, we've still got a variable type issue. When we run summary statistics, R complains because there is something weird in the data. Let's try `count()`ing up the values in the data.

```{r}
doctors %>%
  count(Physicians) %>%
  arrange(desc(Physicians))
```

Now we see it! There's a strange NA value that R doesn't know how to deal with. We can actually add this into our data-reading-in code in `read_csv`.

```{r}
doctors <- read_csv("data/Physicians.csv",
                    na = c("", NA, "N/A"))
```

Of course, we then have to redo our renaming.
```{r}
doctors <- doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)
```

```{r}
doctors %>%
  summarize(mean = mean(Physicians, na.rm = TRUE))
```

1. Now, you try reading in the `policekillings.csv` dataset.

```{r, echo=FALSE}
policekillings <- read_csv("data/policekillings.csv")
```

### Joining data
In the last section, we talked about one-table verbs from `dplyr`. But, another task we often need to do is combine data from one table with data from another table.

The command that allows you to combine data frames is a "join", which corresponds to the database operation called a `JOIN`. We will focus on the `left_join()` here, but there are many other types of joins.

When two database tables (or data frames) are joined, the rows of one table get matched to the rows of the other. Computers are exceptionally better suited for this task than humans, but the computer needs to be told what the criteria is for matching up the rows. The variable(s) upon which the matching is based are called a **key**. Both tables must contain the key columns, but there are variety of ways in which the matching can be done.

Joins can be complicated, so you may want to look at a visual explanation:

- [visualization of database JOINs](http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html)
- [shiny app about R joins](https://beta.rstudioconnect.com/content/1867/)
- [data transformation cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)

Let's say we want to join our `doctors` data and our `policekillings` data. We need a variable to "join on." In this case, we want to match states with states. Unfortunately, `doctors` has state names written out and `policekillings` just has the state abbreviations. We need to change one to match the other.

Luckily, R comes with a dataset on state names

```{r}
state.name
state.abb
```

We can use this to make a new variable in the `policekillings` dataset.

```{r}
policekillings <- policekillings %>%
  mutate(state_long = state.name[match(state, state.abb)])
skim(policekillings)
```

Now, we can do some joining. After all, if we want to understand the relationship between physicians and police killings, we have to have both variables in the same data frame.

Before we start, let's consider the dimensions of our two data frames.
```{r}
dim(policekillings)
dim(doctors)
```

They don't have the same number of rows, so we should think about how large we want our data to be at the end of our join. Do we want to keep all 53 rows in `doctors` and match elements from `policekillings`, filling with NA values where there aren't matches? Or, are we more interested in `policekillings` and we want to just match the 47 states with elements from `doctors`?

In databases, the default JOIN type is INNER JOIN. In this case only the rows that are present in *both* data frames get returned. In R, I find I most often use a `left_join`, which retains *all* of the records from the first data frame, regardless of whether there is a match in the second data frame.

Let's try a `left_join`.

```{r, error=TRUE}
library(dplyr)
joinedData <- left_join(policekillings, doctors)
```

This throws an error, because it is expecting there to be a variable with the same name in each data set. To get it to work, we need to specify which variables to join on.

```{r}
joinedData <- left_join(policekillings, doctors,
  by = c("state_long" = "State")
)
skim(joinedData)
```

Notice that this dataset only has 47 observations, because there were only 47 in `policekillings` (the "left" in our `left_join`). We could have reversed the order of the datasets to get something slightly different.

```{r}
joinedData <- left_join(doctors, policekillings,
  by = c("State" = "state_long")
)
skim(joinedData)
```
Now, the data has 51 observations! (We could have gotten the same result by running `right_join` with the data specified the same way as our first join.)

## Fitting a model

While that was a good example for joining data, we shouldn't use it for modeling because a major condition for linear regression is violated with the data set. (Which one?)

So, let's read in another dataset to make some new models.

```{r}
library(readr)
FirstYearGPA <- read_csv("data/FirstYearGPA.csv")
```

## Testing and training data

One computational strategy for **assess**ing a model is using training and testing data. We can use testing and training data in order to determine if our model is good because it works well on the particular data we fit it on, or if it can be generalized to a larger set of data.

The basic idea is to break your dataset into two parts: one for training the model (running `lm()`) and one for testing the model (running `predict()`).

```{r}
set.seed(42)
which_train <- sample(1:219, size = 131, replace = FALSE)

training <- FirstYearGPA %>%
  slice(which_train)
testing <- FirstYearGPA %>%
  slice(-which_train)
```

Notice that I had to "set the seed" to make this document reproducible. That essentially means I'm placing the beginning of the pseudo-random number generator in R at the same spot every time.

We can now run a model,

```{r}
m1 <- lm(GPA ~ HSGPA + HU + White, data = training)
summary(m1)
```

Notice that this model is slightly different than if we'd run the model on the full dataset,

```{r}
coef(lm(GPA ~ HSGPA + HU + White, data = FirstYearGPA))
```

We can use our trained model to make predictions on our testing data.

```{r}
testing <- testing %>%
  mutate(yhats = predict(m1, newdata = testing))
```

We might also want to have the residuals for these predictions, so we can compute those as well.

```{r}
testing <- testing %>%
  mutate(residuals = GPA - yhats)
```

Look at your data object in your environment. What has happened to it?

Once we have this information, we can compute all sorts of useful stuff with it. For example, we could find the "cross-validation correlation," the correlation between the predictions and the actual values.

```{r}
testing %>%
  summarize(cor = cor(GPA, yhats))
```

If we square this number, it's akin to an $R^2$ value,

```{r}
testing %>%
  summarize(cor = cor(GPA, yhats)) %>%
  mutate(R2 = cor^2)
```

Sometimes, we want to quantify the difference between the $R^2$ for the training data and the squared cross validation correlation.

```{r}
testing %>%
  summarize(cor = cor(GPA, yhats)) %>%
  mutate(R2 = cor^2, shrinkage = summary(m1)$r.squared - R2)
```

You want smaller shrinkage values.

We can also look at the residual plots for our testing data,

```{r}
library(ggplot2)
ggplot(testing, aes(x = yhats, y = residuals)) +
  geom_point() +
  geom_smooth(se = FALSE)
```

Those don't look too bad to me. We could look closer to see if the mean and standard deviation of the prediction errors look good.

```{r}
testing %>%
  summarize(mean(residuals), sd(residuals))
```

We'd like the mean to be close to 0, and the standard deviation to be close to the standard deviation of the error term from the fit to the training sample.

```{r}
sd(residuals(m1))
```

Those look pretty good to me!

## Inference based on distributional approximations

```{r}
m1 <- lm(GPA ~ HSGPA + HU + White, data = FirstYearGPA)
summary(m1)
```

If you've taken a statistics class before, you have likely seen inference based on distributional approximations. This is inference using known probability distributions like the Normal distribution, t-distribution, F-distribution, etc.

In the case of a slope coefficient, one way to make inference would be to create a confidence interval,

"what are some other reasonable values we could have observed?"

$$
\hat{\beta}_1 \pm t^* \cdot SE_{\hat{\beta}_1}
$$

"We are 95\% confident that for a 1-unit increase in x, we would see an [increase/decrease] in the response of between [bottom] and [top] units."

The other inferential task often done with distributional approximations is hypothesis testing. For regression, the hypotheses are

"is this number different from 0?"

$$
H_0: \beta_1 = 0 \\
H_A: \beta_1 \neq 0
$$

We can test the hypothesis by locating

$$
t = \frac{\hat{\beta_1}}{SE_{\hat{\beta_1}}}
$$

in a t-distribution with $df=n-k-1$. When you look at a regression summary, you see p-values based on $t$ values,

```{r}
summary(m1)
```

You can generate confidence intervals based on distributional approximations using `confint()`,

```{r}
confint(m1)
```

However, these distributional approximations only hold if the conditions are upheld. A more general solution is to use simulation-based inference.


## Simulation-based inference

Recall that the inference for linear regression relies on the assumptions that the following three conditions are met:

1. Linearity
2. Normality of Residuals
2. Equality of Variance

We know how to assess whether these conditions are met, and we have learned a few techniques for correcting them when they are not (e.g. transformations). Today, we will learn a few techniques based on simulation for making *non-parametric* inferences. Such inferences do not rely on stringent assumptions about the distribution on of the residuals.

## Randomization (Permutation) Tests

Goes with hypothesis tests. "Is this number different than 0?"

We'll consider how we could make inference about a relationship using simulation methods. To do this, we'll use a randomization test.

Before we begin, let's examine the relationship between two variables from the `FirstYearGPA` dataset graphically.

```{r}
ggplot(data = FirstYearGPA, aes(x = SATV, y = GPA)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)
```

There appears to be some linear association between these variables, but it is not particularly strong. We can quantify this relationship using the slope coefficient,

```{r}
m1 <- lm(GPA~SATV, data = FirstYearGPA)
summary(m1)
```

In this case the value of the slope is 0.00169, which is not large, but appears to be statistically significant. However, the validity of the hypothesis test shown in the default regression table relies on the conditions for simple linear regression being met. Let's assume that in this case the assumptions are **not** met. Could we still feel confident that the slope is non-zero?

If `GPA` and `SATV` were really related, then there is a real relationship binding the $i^{th}$ value of `GPA` to the $i^{th}$ value of `SATV`. In this case it would not make sense to link the $i^{th}$ value of `GPA` to some other value of `SATV`. But if the relationship between these two variables was in fact zero, then it wouldn't matter how we matched up the entries in the variables!

The basic idea of the permutation test is to shuffle the mapping between the two variables many times (i.e. sample *without replacement*), and examine the distribution of the resulting slope coefficient. If the actual value of the slope is a rare member of that distribution, then we have evidence that the null hypothesis of $\beta_1=0$ might not be true.

- Execute the following code several times. Get a feel for how the regression line can change, depending on the permutation of the data.
- Do you still believe that the slope is non-zero?

```{r}
ggplot(data = FirstYearGPA, aes(x = mosaic::shuffle(SATV),
                                y = GPA)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)
```

The procedure for the randomization test is simple. We simply shuffle the response variable and compute the resulting slope with the explanatory variable. But we do this many times, until we have a sense of the distribution of that slope coefficient. We then examine where the observed slope falls in that distribution.

```{r, cache = TRUE}
library(infer)
slopetest <- FirstYearGPA %>%
  specify(response = GPA, explanatory = SATV) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")
library(ggplot2)
ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = 0.0016986, color = "red")
```

This looks pretty weird, visually!

Of course, we can also explicitly find where in the distribution the observed slope lies.

```{r}
slopetest %>%
  get_p_value(obs_stat = 0.0016986, direction = "both")
```

`direction` tells R whether to compute the amount of data to the right or the left of the test statistic, or in both directions. In this case, our alternative hypothesis is $\beta_1 \neq 0$, so we want both sides of the distribution.

Finally, we can find a non-parametric 95% confidence interval for the correlation coefficient. The interpretation here is if our actual slope value fell in this interval, we would **not**  consider it statistically significant.

```{r}
get_ci(slopetest)
```

- Compare this confidence interval to the one returned by `confint()` on original model. Why are they different?
-  Perform the above procedure to test the slope value to predict `GPA` using `SATM`.

```{r, cache = TRUE}
slopetest <- FirstYearGPA %>%
  specify(response = GPA, explanatory = SATM) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")

lm(GPA~SATM, data=FirstYearGPA)

ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = 0.001202, color = "red")
```

- Perform the above procedure to test the slope value to predict `GPA` using `HSGPA`.

```{r}
library(car)
data(Salaries)
m1 <- lm(salary ~ discipline, data = Salaries)
summary(m1)
```

```{r, cache = TRUE}
slopetest <- Salaries %>%
  specify(response = salary, explanatory = sex) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")

lm(salary~sex, data=Salaries)

ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = 14008, color = "red")
```


#### The Bootstrap

A similar simulation technique is the bootstrap, however the bootstrap is most useful for answering the question "what are some other values we could have observed?" (confidence intervals).

In bootstrapping, we repeatedly sample cases from our data set, *with replacement*, and approximate the sampling distribution of a statistic. The key idea here is that even though the solution to the regression model is deterministic, the data itself is assumed to be randomly sampled, and so all of the estimates that we make in a regression model are **random**. If the data changes, then the estimates change as well. The bootstrap gives us a non-parametric understanding of the distribution of those estimates. Once again, the advantage to this method is that we can construct meaningful confidence intervals for, say, the slope of the regression line, without having to assume that the residuals are normally distributed.

We'll use this technique to create a confidence interval for the same slope coefficient we were studying in the randomization problem. Remember, here's what that relationship looks like:

```{r}
ggplot(data = FirstYearGPA, aes(x = SATV, y = GPA)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)
```

To create a bootstrap sample, we select rows from the data frame uniformly at random, but with replacement.

```{r}
# this needs the mosaic package again!
ggplot(data = mosaic::resample(FirstYearGPA),
       aes(x = SATV, y = GPA)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, se = FALSE)
```

* Note the differences between this plot and the previous one. What do you notice?
* Run the preceding code several times. What is changing? What is staying the same?

##### Bootstrap distributions and confidence intervals

One advantage of the bootstrap is that it allows us to construct a sampling distribution for the slope coefficient that is not dependent upon the conditions for linear regression being met.

The original confidence intervals for our SLR model depend upon the conditions being true.

```{r}
summary(m1)
confint(m1)
```

Now let's create a bootstrap distribution for the regression coefficients.

```{r, cache=TRUE}
# I'm only doing 100 samples, but you should do more!
slopeboot <- FirstYearGPA %>%
  specify(response = GPA, explanatory = SATV) %>%
  generate(reps = 5000, type = "bootstrap") %>%
  calculate(stat = "slope")

ggplot(data = slopeboot, aes(x = stat)) +
  geom_histogram() +
  geom_vline(xintercept =  0.0016986, color = "red")
```

The bootstrap distribution will always be centered around the sample statistic from our real data, but shows us some other likely values for the coefficient (essentially, sampling error). One way to quantify this variability is to create a confidence interval.

There are several methods constructing confidence intervals from the bootstrap. My preferred method does not require that the bootstrap distribution be normal, but works best when it is roughly symmetric. In this case we simply use the percentiles of the bootstrap distribution to build confidence intervals. This method makes the most sense in the most cases.

```{r}
get_ci(slopeboot, level = 0.9)
visualize(slopeboot)
```


"I am 95% confident that for a 1-point increase in SATV, the college GPA will increase between 0.000944 and 0.00242." This interval does not cross 0, so we are pretty confident that there **is** a relationship between SATV and GPA.

- Create a bootstrap sample of at least 2000 and construct a confidence intervals using the bootstrap. How does this interval differ from the typical confidence interval?

- Construct bootstrapped confidence intervals for the $GPA \sim SATV$ and $GPA \sim SATM$ models.

```{r, cache=TRUE}
data("Cars2015", package = "Lock5Data")
m1 <- lm(UTurn~PageNum, data=Cars2015)
summary(m1)

slopetest <- Cars2015 %>%
  specify(response = UTurn, explanatory = PageNum) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")


ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = (-0.014043), color = "red")

slopeboot <- Cars2015 %>%
  specify(response = UTurn, explanatory = PageNum) %>%
  generate(reps = 5000, type = "bootstrap") %>%
  calculate(stat = "slope")

ggplot(data = slopeboot, aes(x = stat)) +
  geom_histogram()
get_ci(slopeboot, level = 0.99)
```


"I'm 99% confident that as we page through Consumer reports, the u-turn diameter for a car increases between 0.00145 feet and -0.0308 feet"


## Further Reading

- The Boostrapping and Randomization chapters from @IsmKim.
- @Hest2015
- @Efr1979

## Homework

Use the data from the `nycflights13` package to answer some questions about airline flight delays.

```{r}
library(nycflights13)
```

1. First, you will want to familiarize yourself with the datasets available in the package. There are multiple datasets:

```{r}
?airlines
?airports
?flights
?planes
?weather
```

You may need to make some data visualizations or summary statistics to better understand the data.

2. Determine which airline had the longest delays. Since we do not have airline codes memorized, we want to see the full name of the airlines, not the codes. In order to answer this question, you will need to join two datasets together. One dataset you will need is `flights`.

Use the following code skeleton to get started.
```{r}
library(dplyr)
library(tidyr)
flights %>%
  drop_na(arr_delay) %>%
  left_join(airlines, by = c("carrier" = "carrier")) %>% # join
  group_by(name) %>%
  summarize(meandelay = mean(arr_delay)) %>% # compute arrival delays
  arrange(desc(meandelay)) # arrange arrival delays
```

3. Considering only flights arriving in Baltimore (BWI), see which single variable is the best predictor for delay.

```{r}

```

4. Use bootstrapping to generate a confidence interval for the slope coefficient of your model.