Skip to content

Commit bb32e0d

Browse files
committed
brms-rstan: rstan moved as suggested.
1 parent 194ee75 commit bb32e0d

2 files changed

Lines changed: 265 additions & 260 deletions

File tree

content/tutorials/r_brms/brms_eng/workshop_1_mcmc_en_brms_eng.Rmd

Lines changed: 133 additions & 131 deletions
Original file line numberDiff line numberDiff line change
@@ -1063,8 +1063,135 @@ comp_waic %>%
10631063
Both based on the PPC and the comparisons with different model selection criteria, we can conclude that the second Poisson model with random intercepts fits the data best. In principle, we could have expected this based on our own intuition and the design of the study, i.e. the use of the Poisson distribution to model numbers and the use of random intercepts to control for a hierarchical design (habitats nested within sites).
10641064

10651065

1066-
## Deep Dive: `rstan`
1067-
### Stan: What? Why?!
1066+
1067+
# Final model results
1068+
1069+
When we look at the model fit object, we see results that are similar to results we see when we fit a frequentist model. On the one hand we get an estimate of all parameters with their uncertainty, but on the other hand we see that this is clearly the output of a Bayesian model. We get information about the parameters we used for the MCMC algorithm, we get a 95% credible interval (CI) instead of a confidence interval and we also get the $\hat{R}$ value for each parameter as discussed earlier.
1070+
1071+
```{r results-fit-poisson}
1072+
# Look at the fit object of the Poisson model with random effects
1073+
fit_poisson2
1074+
```
1075+
1076+
A useful package for visualising the results of our final model is the [tidybayes](https://mjskay.github.io/tidybayes/articles/tidy-brms.html) package. Through this package, you can work with the posterior distributions as you would work with any dataset through the **tidyverse** package.
1077+
1078+
With the function `gather_draws()` you can take a certain number of samples from the posterior distributions of certain parameters and convert them into a long format table. You usually do not want to select all posterior samples because there are sometimes unnecessarily many. By specifying a 'seed' you ensure that these are the same samples every time you run the script again. You can then calculate certain summary statistics via the classic **dplyr** functions.
1079+
1080+
```{r results-fit-poisson-2}
1081+
fit_poisson2 %>%
1082+
# gather 1000 posterior samples for 2 parameters in long format
1083+
gather_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
1084+
# calculate summary statistics for each variable
1085+
group_by(.variable) %>%
1086+
summarise(min = min(.value),
1087+
q_05 = quantile(.value, probs = 0.05),
1088+
q_20 = quantile(.value, probs = 0.20),
1089+
mean = mean(.value),
1090+
median = median(.value),
1091+
q_80 = quantile(.value, probs = 0.80),
1092+
q_95 = quantile(.value, probs = 0.95),
1093+
max = max(.value))
1094+
```
1095+
1096+
Useful functions of the **tidybayes** package are also `median_qi()`, `mean_qi()` ... after `gather_draws()` which you can use instead of `group_by()` and `summarise()` .
1097+
1098+
We would now like to visualise the estimated number of species per habitat type with associated uncertainty. With the function `spread_draws()` you can take a certain number of samples from the posterior distribution and convert them into a wide format table. The average number of species in bogs according to our model is $\exp(\beta_0)$ and in forests $\exp(\beta_0+\beta_1)$. We show the posterior distributions with the posterior median and 60 and 90% credible intervals.
1099+
1100+
```{r resultats-fit-poisson-3}
1101+
fit_poisson2 %>%
1102+
# spread 1000 posterior samples for 2 parameters in wide format
1103+
spread_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
1104+
# calculate average numbers and convert to long format for visualisation
1105+
mutate(bog = exp(b_Intercept),
1106+
forest = exp(b_Intercept + b_habitatForest)) %>%
1107+
pivot_longer(cols = c("bog", "forest"), names_to = "habitat",
1108+
values_to = "sp_rich") %>%
1109+
# visualise via ggplot()
1110+
ggplot(aes(y = sp_rich, x = habitat)) +
1111+
stat_eye(point_interval = "median_qi", .width = c(0.6, 0.9)) +
1112+
scale_y_continuous(limits = c(0, NA))
1113+
```
1114+
1115+
In addition to `stat_eye()` you will find [here](https://mjskay.github.io/tidybayes/articles/tidy-brms.html#other-visualizations-of-distributions-stat_slabinterval) some nice ways to visualise posterior distributions .
1116+
1117+
We see a clear difference in the number of species between the two habitats. Is there a significant difference between the number of species in bogs and forests? We test the hypothesis that numbers are equal in bogs and forests.
1118+
1119+
$$
1120+
\exp(\beta_0) = \exp(\beta_0+\beta_1)\\
1121+
\Rightarrow \beta_0 = \beta_0 + \beta_1\\
1122+
\Rightarrow \beta_1 = 0\\
1123+
$$
1124+
1125+
This can easily be done via the `hypothesis()` function of the **brms** package.
1126+
The argument `alpha` specifies the size of the credible interval.
1127+
This allows hypothesis testing in a similar way to the frequentist null hypothesis testing framework.
1128+
1129+
```{r resultats-hypothesis-test}
1130+
# Test hypothesis difference between habitats
1131+
hyp <- hypothesis(fit_poisson2, "habitatForest = 0", alpha = 0.1)
1132+
hyp
1133+
```
1134+
1135+
```{r resultats-hypothesis-test-vis}
1136+
# Plot posterior distribution hypothesis
1137+
plot(hyp)
1138+
```
1139+
1140+
We can conclude that there is a significant difference since 0 is not included in the 90% credible interval.
1141+
1142+
Finally, we visualise the random effects of the sites. We sort them from high to low species richness.
1143+
1144+
```{r resultats-visualise-random-effects}
1145+
# Take the mean of SD of random effects
1146+
# to add to figure later
1147+
sd_mean <- fit_poisson2 %>%
1148+
spread_draws(sd_site__Intercept, ndraws = 1000, seed = 123) %>%
1149+
summarise(mean_sd = mean(sd_site__Intercept)) %>%
1150+
pull()
1151+
1152+
# Take random effects and plot
1153+
fit_poisson2 %>%
1154+
spread_draws(r_site[site,], ndraws = 1000, seed = 123) %>%
1155+
ungroup() %>%
1156+
mutate(site = reorder(site, r_site)) %>%
1157+
ggplot(aes(x = r_site, y = site)) +
1158+
geom_vline(xintercept = 0, color = "darkgrey", linewidth = 1) +
1159+
geom_vline(xintercept = c(sd_mean * qnorm(0.05), sd_mean * qnorm(0.95)),
1160+
color = "darkgrey", linetype = 2) +
1161+
stat_halfeye(point_interval = "median_qi", .width = 0.9, size = 2/3,
1162+
fill = "cornflowerblue")
1163+
```
1164+
1165+
1166+
# Comparison with frequentist statistics
1167+
1168+
Let's go back to our very first model where we used the Normal distribution. This was equivalent to a linear regression with categorical variable. A linear regression with categorical variable is also called ANOVA and if there are only two groups, an ANOVA is equivalent to a t-test. We can therefore take the opportunity to compare the results of our first model (a Bayesian model) with the results of a classical (frequentist) t-test.
1169+
1170+
```{r compare-frequentist}
1171+
# Extract summary statistics from the Bayesian model
1172+
sum_fit_normal1 <- summary(fit_normal1, prob = 0.9)
1173+
diff_bog1 <- sum_fit_normal1$fixed$Estimate[2]
1174+
ll_diff_bog1 <- sum_fit_normal1$fixed$`l-90% CI`[2]
1175+
ul_diff_bog1 <- sum_fit_normal1$fixed$`u-90% CI`[2]
1176+
1177+
sum_fit_normal1
1178+
```
1179+
1180+
```{r compare-frequentist-t-test}
1181+
# Perform t-test and extract summary statistics
1182+
t_test_normal1 <- t.test(sp_rich ~ habitat, data = ants_df, conf.level = 0.9)
1183+
diff_bog2 <- t_test_normal1$estimate[2] - t_test_normal1$estimate[1]
1184+
ll_diff_bog2 <- -t_test_normal1$conf.int[2]
1185+
ul_diff_bog2 <- -t_test_normal1$conf.int[1]
1186+
1187+
t_test_normal1
1188+
```
1189+
1190+
We see that this indeed produces almost exactly the same results. Our Bayesian model estimates that on average `r round(diff_bog1, 3)` more ant species occur in forests than in bogs (90% credible interval: `r round(ll_diff_bog1, 3)` to `r round(ul_diff_bog1, 3)`). The t-test estimates that on average `r round(diff_bog2, 3)` more ant species occur in forests than in bogs (90% confidence interval: `r round(ll_diff_bog2, 3)` to `r round(ul_diff_bog2, 3)`).
1191+
1192+
1193+
# Deep Dive: `rstan`
1194+
## Stan: What? Why?!
10681195
The `brms` package is a convenience wrapper for the `rstan` package, which in turn ports `stan` functionality to R.
10691196
Stan is a modeling framework written in the `C` programming language, which implements many probabilistic ("Bayesian") modeling tools.
10701197
More info can be found on [the Stan website](https://mc-stan.org).
@@ -1074,7 +1201,7 @@ The advantage of `brms` is usability: many functions work out-of-the-box, with r
10741201
However, the relative ease-of-use comes at the cost of flexibility, and do some degree, readability.
10751202

10761203
In contrast, Stan and `rstan` lean more to the mathematical formulation of models.
1077-
Every aspect of the model has to be explicitly set, which can be an advantage (e.g. if you face non-standard use cases), or disadvantage (e.g. if you secify models in non-optional ways).
1204+
Every aspect of the model has to be explicitly set, which can be an advantage (e.g. if you face non-standard use cases), or disadvantage (e.g. if you specify models in non-optimal ways).
10781205

10791206

10801207
To briefly give an impression, we will build the same models as above, using the Stan framework.
@@ -1086,7 +1213,7 @@ conflicted::conflicts_prefer(rstan::extract)
10861213
conflicted::conflicts_prefer(brms::loo)
10871214
```
10881215

1089-
### Model Definition
1216+
## Model Definition
10901217
RMarkdown can handle `stan` code chunks, though more general model definition is outsourced to a separate "*.stan" file.
10911218
Alternatively, you can define your model in a big text block, as shown below.
10921219
The simple poisson model resembles [one of the `stan`-dard examples](https://mc-stan.org/docs/stan-users-guide/posterior-prediction.html#posterior-prediction-for-regressions), which you can refer to for all further details and more.
@@ -1135,7 +1262,7 @@ stan_poisson_model <- stan_model(
11351262
```
11361263

11371264

1138-
### Sampling
1265+
## Sampling
11391266

11401267
Sampling does pretty much the same as above, since at the core, `brms` is just `stan`.
11411268

@@ -1172,7 +1299,7 @@ In other cases, it might pay off.
11721299
Know that Stan is there for you, do not hesitate to turn to its extensive documentation, and do not fear to give it a try!
11731300

11741301

1175-
### Homework: Hierarchical Model
1302+
## Homework: Hierarchical Model
11761303
To take your modeling skills even further, you may implement and sample the "random intercept" model.
11771304
In "Bayesian" terms, the [general terminology is "hierarchical" model](https://mc-stan.org/docs/stan-users-guide/regression.html#hierarchical-regression).
11781305

@@ -1237,131 +1364,6 @@ stan_poisson_fit
12371364
With Stan, po(i)ssibilities are almost endless - don't get lost in model building!
12381365

12391366

1240-
# Final model results
1241-
1242-
When we look at the model fit object, we see results that are similar to results we see when we fit a frequentist model. On the one hand we get an estimate of all parameters with their uncertainty, but on the other hand we see that this is clearly the output of a Bayesian model. We get information about the parameters we used for the MCMC algorithm, we get a 95% credible interval (CI) instead of a confidence interval and we also get the $\hat{R}$ value for each parameter as discussed earlier.
1243-
1244-
```{r results-fit-poisson}
1245-
# Look at the fit object of the Poisson model with random effects
1246-
fit_poisson2
1247-
```
1248-
1249-
A useful package for visualising the results of our final model is the [tidybayes](https://mjskay.github.io/tidybayes/articles/tidy-brms.html) package. Through this package, you can work with the posterior distributions as you would work with any dataset through the **tidyverse** package.
1250-
1251-
With the function `gather_draws()` you can take a certain number of samples from the posterior distributions of certain parameters and convert them into a long format table. You usually do not want to select all posterior samples because there are sometimes unnecessarily many. By specifying a 'seed' you ensure that these are the same samples every time you run the script again. You can then calculate certain summary statistics via the classic **dplyr** functions.
1252-
1253-
```{r results-fit-poisson-2}
1254-
fit_poisson2 %>%
1255-
# gather 1000 posterior samples for 2 parameters in long format
1256-
gather_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
1257-
# calculate summary statistics for each variable
1258-
group_by(.variable) %>%
1259-
summarise(min = min(.value),
1260-
q_05 = quantile(.value, probs = 0.05),
1261-
q_20 = quantile(.value, probs = 0.20),
1262-
mean = mean(.value),
1263-
median = median(.value),
1264-
q_80 = quantile(.value, probs = 0.80),
1265-
q_95 = quantile(.value, probs = 0.95),
1266-
max = max(.value))
1267-
```
1268-
1269-
Useful functions of the **tidybayes** package are also `median_qi()`, `mean_qi()` ... after `gather_draws()` which you can use instead of `group_by()` and `summarise()` .
1270-
1271-
We would now like to visualise the estimated number of species per habitat type with associated uncertainty. With the function `spread_draws()` you can take a certain number of samples from the posterior distribution and convert them into a wide format table. The average number of species in bogs according to our model is $\exp(\beta_0)$ and in forests $\exp(\beta_0+\beta_1)$. We show the posterior distributions with the posterior median and 60 and 90% credible intervals.
1272-
1273-
```{r resultats-fit-poisson-3}
1274-
fit_poisson2 %>%
1275-
# spread 1000 posterior samples for 2 parameters in wide format
1276-
spread_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
1277-
# calculate average numbers and convert to long format for visualisation
1278-
mutate(bog = exp(b_Intercept),
1279-
forest = exp(b_Intercept + b_habitatForest)) %>%
1280-
pivot_longer(cols = c("bog", "forest"), names_to = "habitat",
1281-
values_to = "sp_rich") %>%
1282-
# visualise via ggplot()
1283-
ggplot(aes(y = sp_rich, x = habitat)) +
1284-
stat_eye(point_interval = "median_qi", .width = c(0.6, 0.9)) +
1285-
scale_y_continuous(limits = c(0, NA))
1286-
```
1287-
1288-
In addition to `stat_eye()` you will find [here](https://mjskay.github.io/tidybayes/articles/tidy-brms.html#other-visualizations-of-distributions-stat_slabinterval) some nice ways to visualise posterior distributions .
1289-
1290-
We see a clear difference in the number of species between the two habitats. Is there a significant difference between the number of species in bogs and forests? We test the hypothesis that numbers are equal in bogs and forests.
1291-
1292-
$$
1293-
\exp(\beta_0) = \exp(\beta_0+\beta_1)\\
1294-
\Rightarrow \beta_0 = \beta_0 + \beta_1\\
1295-
\Rightarrow \beta_1 = 0\\
1296-
$$
1297-
1298-
This can easily be done via the `hypothesis()` function of the **brms** package.
1299-
The argument `alpha` specifies the size of the credible interval.
1300-
This allows hypothesis testing in a similar way to the frequentist null hypothesis testing framework.
1301-
1302-
```{r resultats-hypothesis-test}
1303-
# Test hypothesis difference between habitats
1304-
hyp <- hypothesis(fit_poisson2, "habitatForest = 0", alpha = 0.1)
1305-
hyp
1306-
```
1307-
1308-
```{r resultats-hypothesis-test-vis}
1309-
# Plot posterior distribution hypothesis
1310-
plot(hyp)
1311-
```
1312-
1313-
We can conclude that there is a significant difference since 0 is not included in the 90% credible interval.
1314-
1315-
Finally, we visualise the random effects of the sites. We sort them from high to low species richness.
1316-
1317-
```{r resultats-visualise-random-effects}
1318-
# Take the mean of SD of random effects
1319-
# to add to figure later
1320-
sd_mean <- fit_poisson2 %>%
1321-
spread_draws(sd_site__Intercept, ndraws = 1000, seed = 123) %>%
1322-
summarise(mean_sd = mean(sd_site__Intercept)) %>%
1323-
pull()
1324-
1325-
# Take random effects and plot
1326-
fit_poisson2 %>%
1327-
spread_draws(r_site[site,], ndraws = 1000, seed = 123) %>%
1328-
ungroup() %>%
1329-
mutate(site = reorder(site, r_site)) %>%
1330-
ggplot(aes(x = r_site, y = site)) +
1331-
geom_vline(xintercept = 0, color = "darkgrey", linewidth = 1) +
1332-
geom_vline(xintercept = c(sd_mean * qnorm(0.05), sd_mean * qnorm(0.95)),
1333-
color = "darkgrey", linetype = 2) +
1334-
stat_halfeye(point_interval = "median_qi", .width = 0.9, size = 2/3,
1335-
fill = "cornflowerblue")
1336-
```
1337-
1338-
1339-
# Comparison with frequentist statistics
1340-
1341-
Let's go back to our very first model where we used the Normal distribution. This was equivalent to a linear regression with categorical variable. A linear regression with categorical variable is also called ANOVA and if there are only two groups, an ANOVA is equivalent to a t-test. We can therefore take the opportunity to compare the results of our first model (a Bayesian model) with the results of a classical (frequentist) t-test.
1342-
1343-
```{r compare-frequentist}
1344-
# Extract summary statistics from the Bayesian model
1345-
sum_fit_normal1 <- summary(fit_normal1, prob = 0.9)
1346-
diff_bog1 <- sum_fit_normal1$fixed$Estimate[2]
1347-
ll_diff_bog1 <- sum_fit_normal1$fixed$`l-90% CI`[2]
1348-
ul_diff_bog1 <- sum_fit_normal1$fixed$`u-90% CI`[2]
1349-
1350-
sum_fit_normal1
1351-
```
1352-
1353-
```{r compare-frequentist-t-test}
1354-
# Perform t-test and extract summary statistics
1355-
t_test_normal1 <- t.test(sp_rich ~ habitat, data = ants_df, conf.level = 0.9)
1356-
diff_bog2 <- t_test_normal1$estimate[2] - t_test_normal1$estimate[1]
1357-
ll_diff_bog2 <- -t_test_normal1$conf.int[2]
1358-
ul_diff_bog2 <- -t_test_normal1$conf.int[1]
1359-
1360-
t_test_normal1
1361-
```
1362-
1363-
We see that this indeed produces almost exactly the same results. Our Bayesian model estimates that on average `r round(diff_bog1, 3)` more ant species occur in forests than in bogs (90% credible interval: `r round(ll_diff_bog1, 3)` to `r round(ul_diff_bog1, 3)`). The t-test estimates that on average `r round(diff_bog2, 3)` more ant species occur in forests than in bogs (90% confidence interval: `r round(ll_diff_bog2, 3)` to `r round(ul_diff_bog2, 3)`).
1364-
13651367

13661368
# References {-}
13671369

0 commit comments

Comments
 (0)