inbo
diff --git a/‎content/tutorials/r_brms/brms_eng/workshop_1_mcmc_en_brms_eng.Rmd‎
Lines changed: 133 additions & 131 deletions b/‎content/tutorials/r_brms/brms_eng/workshop_1_mcmc_en_brms_eng.Rmd‎
Lines changed: 133 additions & 131 deletions
@@ -1063,8 +1063,135 @@ comp_waic %>%
 Both based on the PPC and the comparisons with different model selection criteria, we can conclude that the second Poisson model with random intercepts fits the data best. In principle, we could have expected this based on our own intuition and the design of the study, i.e. the use of the Poisson distribution to model numbers and the use of random intercepts to control for a hierarchical design (habitats nested within sites).
 
 
-## Deep Dive: `rstan`
-### Stan: What? Why?!
+
+# Final model results
+
+When we look at the model fit object, we see results that are similar to results we see when we fit a frequentist model. On the one hand we get an estimate of all parameters with their uncertainty, but on the other hand we see that this is clearly the output of a Bayesian model. We get information about the parameters we used for the MCMC algorithm, we get a 95% credible interval (CI) instead of a confidence interval and we also get the $\hat{R}$ value for each parameter as discussed earlier.
+
+```{r results-fit-poisson}
+# Look at the fit object of the Poisson model with random effects
+fit_poisson2
+```
+
+A useful package for visualising the results of our final model is the [tidybayes](https://mjskay.github.io/tidybayes/articles/tidy-brms.html) package. Through this package, you can work with the posterior distributions as you would work with any dataset through the **tidyverse** package.
+
+With the function `gather_draws()` you can take a certain number of samples from the posterior distributions of certain parameters and convert them into a long format table. You usually do not want to select all posterior samples because there are sometimes unnecessarily many. By specifying a 'seed' you ensure that these are the same samples every time you run the script again. You can then calculate certain summary statistics via the classic **dplyr** functions.
+
+```{r results-fit-poisson-2}
+fit_poisson2 %>%
+  # gather 1000 posterior samples for 2 parameters in long format
+  gather_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
+  # calculate summary statistics for each variable
+  group_by(.variable) %>%
+  summarise(min = min(.value),
+            q_05 = quantile(.value, probs = 0.05),
+            q_20 = quantile(.value, probs = 0.20),
+            mean = mean(.value),
+            median = median(.value),
+            q_80 = quantile(.value, probs = 0.80),
+            q_95 = quantile(.value, probs = 0.95),
+            max = max(.value))
+```
+
+Useful functions of the **tidybayes** package are also `median_qi()`, `mean_qi()` ... after `gather_draws()` which you can use instead of `group_by()` and `summarise()` .
+
+We would now like to visualise the estimated number of species per habitat type with associated uncertainty. With the function `spread_draws()` you can take a certain number of samples from the posterior distribution and convert them into a wide format table. The average number of species in bogs according to our model is $\exp(\beta_0)$ and in forests $\exp(\beta_0+\beta_1)$. We show the posterior distributions with the posterior median and 60 and 90% credible intervals.
+
+```{r resultats-fit-poisson-3}
+fit_poisson2 %>%
+  # spread 1000 posterior samples for 2 parameters in wide format
+  spread_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
+  # calculate average numbers and convert to long format for visualisation
+  mutate(bog = exp(b_Intercept),
+         forest = exp(b_Intercept + b_habitatForest)) %>%
+  pivot_longer(cols = c("bog", "forest"), names_to = "habitat", 
+               values_to = "sp_rich") %>%
+  # visualise via ggplot()
+  ggplot(aes(y = sp_rich, x = habitat)) +
+    stat_eye(point_interval = "median_qi", .width = c(0.6, 0.9)) +
+    scale_y_continuous(limits = c(0, NA))
+```
+
+In addition to `stat_eye()` you will find [here](https://mjskay.github.io/tidybayes/articles/tidy-brms.html#other-visualizations-of-distributions-stat_slabinterval) some nice ways to visualise posterior distributions .
+
+We see a clear difference in the number of species between the two habitats. Is there a significant difference between the number of species in bogs and forests? We test the hypothesis that numbers are equal in bogs and forests.
+
+$$
+\exp(\beta_0) = \exp(\beta_0+\beta_1)\\
+\Rightarrow \beta_0 = \beta_0 + \beta_1\\
+\Rightarrow \beta_1 = 0\\
+$$
+
+This can easily be done via the `hypothesis()` function of the **brms** package.
+The argument `alpha` specifies the size of the credible interval.
+This allows hypothesis testing in a similar way to the frequentist null hypothesis testing framework.
+
+```{r resultats-hypothesis-test}
+# Test hypothesis difference between habitats
+hyp <- hypothesis(fit_poisson2, "habitatForest = 0", alpha = 0.1)
+hyp
+```
+
+```{r resultats-hypothesis-test-vis}
+# Plot posterior distribution hypothesis
+plot(hyp)
+```
+
+We can conclude that there is a significant difference since 0 is not included in the 90% credible interval.
+
+Finally, we visualise the random effects of the sites. We sort them from high to low species richness.
+
+```{r resultats-visualise-random-effects}
+# Take the mean of SD of random effects
+# to add to figure later
+sd_mean <- fit_poisson2 %>%
+  spread_draws(sd_site__Intercept, ndraws = 1000, seed = 123) %>%
+  summarise(mean_sd = mean(sd_site__Intercept)) %>%
+  pull()
+
+# Take random effects and plot
+fit_poisson2 %>%
+  spread_draws(r_site[site,], ndraws = 1000, seed = 123) %>%
+  ungroup() %>%
+  mutate(site = reorder(site, r_site)) %>%
+  ggplot(aes(x = r_site, y = site)) +
+    geom_vline(xintercept = 0, color = "darkgrey", linewidth = 1) +
+    geom_vline(xintercept = c(sd_mean * qnorm(0.05), sd_mean * qnorm(0.95)),
+               color = "darkgrey", linetype = 2) +
+    stat_halfeye(point_interval = "median_qi", .width = 0.9, size = 2/3,
+                 fill = "cornflowerblue")
+```
+
+
+# Comparison with frequentist statistics
+
+Let's go back to our very first model where we used the Normal distribution. This was equivalent to a linear regression with categorical variable. A linear regression with categorical variable is also called ANOVA and if there are only two groups, an ANOVA is equivalent to a t-test. We can therefore take the opportunity to compare the results of our first model (a Bayesian model) with the results of a classical (frequentist) t-test.
+
+```{r compare-frequentist}
+# Extract summary statistics from the  Bayesian model
+sum_fit_normal1 <- summary(fit_normal1, prob = 0.9)
+diff_bog1 <- sum_fit_normal1$fixed$Estimate[2]
+ll_diff_bog1 <- sum_fit_normal1$fixed$`l-90% CI`[2]
+ul_diff_bog1 <- sum_fit_normal1$fixed$`u-90% CI`[2]
+
+sum_fit_normal1
+```
+
+```{r compare-frequentist-t-test}
+# Perform t-test and extract summary statistics
+t_test_normal1 <- t.test(sp_rich ~ habitat, data = ants_df, conf.level = 0.9)
+diff_bog2 <- t_test_normal1$estimate[2] - t_test_normal1$estimate[1]
+ll_diff_bog2 <- -t_test_normal1$conf.int[2]
+ul_diff_bog2 <- -t_test_normal1$conf.int[1]
+
+t_test_normal1
+```
+
+We see that this indeed produces almost exactly the same results. Our Bayesian model estimates that on average `r round(diff_bog1, 3)` more ant species occur in forests than in bogs (90% credible interval:  `r round(ll_diff_bog1, 3)` to `r round(ul_diff_bog1, 3)`). The t-test estimates that on average `r round(diff_bog2, 3)` more ant species occur in forests than in bogs (90% confidence interval: `r round(ll_diff_bog2, 3)` to `r round(ul_diff_bog2, 3)`).
+
+
+# Deep Dive: `rstan`
+## Stan: What? Why?!
 The `brms` package is a convenience wrapper for the `rstan` package, which in turn ports `stan` functionality to R. 
 Stan is a modeling framework written in the `C` programming language, which implements many probabilistic ("Bayesian") modeling tools.
 More info can be found on [the Stan website](https://mc-stan.org).
@@ -1074,7 +1201,7 @@ The advantage of `brms` is usability: many functions work out-of-the-box, with r
 However, the relative ease-of-use comes at the cost of flexibility, and do some degree, readability.
 
 In contrast, Stan and `rstan` lean more to the mathematical formulation of models.
-Every aspect of the model has to be explicitly set, which can be an advantage (e.g. if you face non-standard use cases), or disadvantage (e.g. if you secify models in non-optional ways).
+Every aspect of the model has to be explicitly set, which can be an advantage (e.g. if you face non-standard use cases), or disadvantage (e.g. if you specify models in non-optimal ways).
 
 
 To briefly give an impression, we will build the same models as above, using the Stan framework.
@@ -1086,7 +1213,7 @@ conflicted::conflicts_prefer(rstan::extract)
 conflicted::conflicts_prefer(brms::loo)
 ```
 
-### Model Definition
+## Model Definition
 RMarkdown can handle `stan` code chunks, though more general model definition is outsourced to a separate "*.stan" file.
 Alternatively, you can define your model in a big text block, as shown below.
 The simple poisson model resembles [one of the `stan`-dard examples](https://mc-stan.org/docs/stan-users-guide/posterior-prediction.html#posterior-prediction-for-regressions), which you can refer to for all further details and more.
@@ -1135,7 +1262,7 @@ stan_poisson_model <- stan_model(
 ```
 
 
-### Sampling
+## Sampling
 
 Sampling does pretty much the same as above, since at the core, `brms` is just `stan`.
 
@@ -1172,7 +1299,7 @@ In other cases, it might pay off.
 Know that Stan is there for you, do not hesitate to turn to its extensive documentation, and do not fear to give it a try!
 
 
-### Homework: Hierarchical Model
+## Homework: Hierarchical Model
 To take your modeling skills even further, you may implement and sample the "random intercept" model.
 In "Bayesian" terms, the [general terminology is "hierarchical" model](https://mc-stan.org/docs/stan-users-guide/regression.html#hierarchical-regression).
 
@@ -1237,131 +1364,6 @@ stan_poisson_fit
 With Stan, po(i)ssibilities are almost endless - don't get lost in model building!
 
 
-# Final model results
-
-When we look at the model fit object, we see results that are similar to results we see when we fit a frequentist model. On the one hand we get an estimate of all parameters with their uncertainty, but on the other hand we see that this is clearly the output of a Bayesian model. We get information about the parameters we used for the MCMC algorithm, we get a 95% credible interval (CI) instead of a confidence interval and we also get the $\hat{R}$ value for each parameter as discussed earlier.
-
-```{r results-fit-poisson}
-# Look at the fit object of the Poisson model with random effects
-fit_poisson2
-```
-
-A useful package for visualising the results of our final model is the [tidybayes](https://mjskay.github.io/tidybayes/articles/tidy-brms.html) package. Through this package, you can work with the posterior distributions as you would work with any dataset through the **tidyverse** package.
-
-With the function `gather_draws()` you can take a certain number of samples from the posterior distributions of certain parameters and convert them into a long format table. You usually do not want to select all posterior samples because there are sometimes unnecessarily many. By specifying a 'seed' you ensure that these are the same samples every time you run the script again. You can then calculate certain summary statistics via the classic **dplyr** functions.
-
-```{r results-fit-poisson-2}
-fit_poisson2 %>%
-  # gather 1000 posterior samples for 2 parameters in long format
-  gather_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
-  # calculate summary statistics for each variable
-  group_by(.variable) %>%
-  summarise(min = min(.value),
-            q_05 = quantile(.value, probs = 0.05),
-            q_20 = quantile(.value, probs = 0.20),
-            mean = mean(.value),
-            median = median(.value),
-            q_80 = quantile(.value, probs = 0.80),
-            q_95 = quantile(.value, probs = 0.95),
-            max = max(.value))
-```
-
-Useful functions of the **tidybayes** package are also `median_qi()`, `mean_qi()` ... after `gather_draws()` which you can use instead of `group_by()` and `summarise()` .
-
-We would now like to visualise the estimated number of species per habitat type with associated uncertainty. With the function `spread_draws()` you can take a certain number of samples from the posterior distribution and convert them into a wide format table. The average number of species in bogs according to our model is $\exp(\beta_0)$ and in forests $\exp(\beta_0+\beta_1)$. We show the posterior distributions with the posterior median and 60 and 90% credible intervals.
-
-```{r resultats-fit-poisson-3}
-fit_poisson2 %>%
-  # spread 1000 posterior samples for 2 parameters in wide format
-  spread_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
-  # calculate average numbers and convert to long format for visualisation
-  mutate(bog = exp(b_Intercept),
-         forest = exp(b_Intercept + b_habitatForest)) %>%
-  pivot_longer(cols = c("bog", "forest"), names_to = "habitat", 
-               values_to = "sp_rich") %>%
-  # visualise via ggplot()
-  ggplot(aes(y = sp_rich, x = habitat)) +
-    stat_eye(point_interval = "median_qi", .width = c(0.6, 0.9)) +
-    scale_y_continuous(limits = c(0, NA))
-```
-
-In addition to `stat_eye()` you will find [here](https://mjskay.github.io/tidybayes/articles/tidy-brms.html#other-visualizations-of-distributions-stat_slabinterval) some nice ways to visualise posterior distributions .
-
-We see a clear difference in the number of species between the two habitats. Is there a significant difference between the number of species in bogs and forests? We test the hypothesis that numbers are equal in bogs and forests.
-
-$$
-\exp(\beta_0) = \exp(\beta_0+\beta_1)\\
-\Rightarrow \beta_0 = \beta_0 + \beta_1\\
-\Rightarrow \beta_1 = 0\\
-$$
-
-This can easily be done via the `hypothesis()` function of the **brms** package.
-The argument `alpha` specifies the size of the credible interval.
-This allows hypothesis testing in a similar way to the frequentist null hypothesis testing framework.
-
-```{r resultats-hypothesis-test}
-# Test hypothesis difference between habitats
-hyp <- hypothesis(fit_poisson2, "habitatForest = 0", alpha = 0.1)
-hyp
-```
-
-```{r resultats-hypothesis-test-vis}
-# Plot posterior distribution hypothesis
-plot(hyp)
-```
-
-We can conclude that there is a significant difference since 0 is not included in the 90% credible interval.
-
-Finally, we visualise the random effects of the sites. We sort them from high to low species richness.
-
-```{r resultats-visualise-random-effects}
-# Take the mean of SD of random effects
-# to add to figure later
-sd_mean <- fit_poisson2 %>%
-  spread_draws(sd_site__Intercept, ndraws = 1000, seed = 123) %>%
-  summarise(mean_sd = mean(sd_site__Intercept)) %>%
-  pull()
-
-# Take random effects and plot
-fit_poisson2 %>%
-  spread_draws(r_site[site,], ndraws = 1000, seed = 123) %>%
-  ungroup() %>%
-  mutate(site = reorder(site, r_site)) %>%
-  ggplot(aes(x = r_site, y = site)) +
-    geom_vline(xintercept = 0, color = "darkgrey", linewidth = 1) +
-    geom_vline(xintercept = c(sd_mean * qnorm(0.05), sd_mean * qnorm(0.95)),
-               color = "darkgrey", linetype = 2) +
-    stat_halfeye(point_interval = "median_qi", .width = 0.9, size = 2/3,
-                 fill = "cornflowerblue")
-```
-
-
-# Comparison with frequentist statistics
-
-Let's go back to our very first model where we used the Normal distribution. This was equivalent to a linear regression with categorical variable. A linear regression with categorical variable is also called ANOVA and if there are only two groups, an ANOVA is equivalent to a t-test. We can therefore take the opportunity to compare the results of our first model (a Bayesian model) with the results of a classical (frequentist) t-test.
-
-```{r compare-frequentist}
-# Extract summary statistics from the  Bayesian model
-sum_fit_normal1 <- summary(fit_normal1, prob = 0.9)
-diff_bog1 <- sum_fit_normal1$fixed$Estimate[2]
-ll_diff_bog1 <- sum_fit_normal1$fixed$`l-90% CI`[2]
-ul_diff_bog1 <- sum_fit_normal1$fixed$`u-90% CI`[2]
-
-sum_fit_normal1
-```
-
-```{r compare-frequentist-t-test}
-# Perform t-test and extract summary statistics
-t_test_normal1 <- t.test(sp_rich ~ habitat, data = ants_df, conf.level = 0.9)
-diff_bog2 <- t_test_normal1$estimate[2] - t_test_normal1$estimate[1]
-ll_diff_bog2 <- -t_test_normal1$conf.int[2]
-ul_diff_bog2 <- -t_test_normal1$conf.int[1]
-
-t_test_normal1
-```
-
-We see that this indeed produces almost exactly the same results. Our Bayesian model estimates that on average `r round(diff_bog1, 3)` more ant species occur in forests than in bogs (90% credible interval:  `r round(ll_diff_bog1, 3)` to `r round(ul_diff_bog1, 3)`). The t-test estimates that on average `r round(diff_bog2, 3)` more ant species occur in forests than in bogs (90% confidence interval: `r round(ll_diff_bog2, 3)` to `r round(ul_diff_bog2, 3)`).
-
 
 # References {-}