brms_rstan: general updates

falkmielke · falkmielke · commit d09e99e21432 · 2024-11-21T14:18:20.000+01:00
diff --git a/content/tutorials/r_brms/brms_eng/workshop_1_mcmc_en_brms_eng.Rmd b/content/tutorials/r_brms/brms_eng/workshop_1_mcmc_en_brms_eng.Rmd
@@ -297,7 +297,7 @@ confint(lm1, level = 0.9)
 We source some short functions to calculate the (log) likelihood and the prior and to execute the MCMC metropolis algorithm.
 
 ```{r}
-source(file = "./source/mcmc_functions.R")
+source(file = "./mcmc_functions.R")
 ```
 
 For this simple model with a small data set, we can calculate and plot the posterior for a large number of combinations of 'beta_0' and 'beta_1'.
@@ -345,7 +345,7 @@ ggplot(df, aes(x = beta_0, y = beta_1, z = post)) +
 
 The starting value is quite far from the maximum likelihood.
 It takes a while for the MCMC to stabilize in the vicinity of the top of the mountain.
-Therefore, the first part of the MCMC is never used. This is called the 'burn-in' or 'warm-up'.
+Therefore, the first part of the MCMC is never used. This is called the 'warmup', 'burn-in' or 'tuning'.
 
 We now run the same model, but with a much longer MCMC (more iterations):
 
@@ -392,7 +392,7 @@ The MCMC for each parameter can be displayed in a so called 'trace plot' where w
 ann_text <- tibble(param = c("beta_0", "beta_1"),
                    x = 210,
                    y = c(max(mcmc_l$beta_0), max(mcmc_l$beta_1)),
-                   lab = "burn-in")
+                   lab = "warmup")
 mcmc_l %>%
   dplyr::select(iter, beta_0, beta_1) %>%
   pivot_longer(cols = c("beta_0", "beta_1"), names_to = "param") %>%
@@ -521,9 +521,8 @@ Some alternatives also exist:
 
 ![Overview of various Stan software (source: https://jtimonen.github.io/posts/post-01/)](software.png)
 
-# Fitting a model with brms
 
-## Loading the dataset and data exploration
+# Loading the dataset and data exploration
 
 We load a dataset on the number of ant species in New England (USA). Type `?ants` into the console for more info.
 
@@ -593,6 +592,9 @@ ants_df %>%
 As an exercise, we will create a model to compare the number of species between both habitats.
 From the data exploration, we already saw that the number of species seems to be higher in forests and that sites with a higher number in bogs often also have a higher number in forests.
 
+
+# Fitting a model with `brms`
+
 ## Specification of a linear regression
 
 ### Model specification
@@ -617,8 +619,8 @@ First of all we decide which MCMC parameters we will use. Type `?brm` to see wha
 ```{r simple-model-mcmc-par}
 # Set MCMC parameters
 nchains <- 3 # number of chains
-niter <- 2000 # number of iterations (incl. burn-in, see next)
-burnin <- niter / 4 # number of initial samples to remove (= burn-in)
+niter <- 2000 # number of iterations (incl. warmup, see next)
+warmup <- niter / 4 # number of initial samples to remove (= warmup)
 nparallel <- nchains # number of cores for parallel computing
 thinning <- 1 # thinning factor (here 1 = no thinning)
 ```
@@ -630,14 +632,14 @@ The model is fitted using the `brm()` function. The syntax is very similar to fu
 - `file` and `file_refit` to save the model object after it has been fitted. If you run the code again and the model has already been saved, `brm()` will simply load this model instead of refitting it.
 
 
-```{r simple-model-fit-poisson}
+```{r simple-model-fit-poisson, class.source = 'fold-show'}
 # Fit Normal model
 fit_normal1 <- brm(
   formula = sp_rich ~ habitat, # specify the model
   family = gaussian(),         # we use the Normal distribution
   data = ants_df,              # specify data
   chains = nchains,            # MCMC parameters
-  warmup = burnin, 
+  warmup = warmup, 
   iter = niter,
   cores = nparallel,
   thin = thinning,
@@ -649,7 +651,7 @@ Before we look at the results, we first check whether the model converges well.
 
 ### MCMC convergence
 
-There are several ways to check convergence of the MCMC algorithm for each parameter. The burn-in samples are not taken into account. First and foremost, you have *visual controls*.
+There are several ways to check convergence of the MCMC algorithm for each parameter. The warmup samples are not taken into account. First and foremost, you have *visual controls*.
 
 We can obtain the MCMC samples with the `as_draws()` functions or visualise them at once via the [**bayesplot**](https://mc-stan.org/bayesplot/) package that is compatible with brmsfit objects.
 
@@ -836,14 +838,14 @@ $$
 
 So we need to estimate two parameters: $\beta_0$ and $\beta_1$ We use the same MCMC parameters as before. The only thing we need to adjust is the choice `family = poisson()`.
 
-```{r poisson-model-fit}
+```{r poisson-model-fit, class.source = 'fold-show'}
 # Fit Poisson model
 fit_poisson1 <- brm(
   formula = sp_rich ~ habitat, # specify the model
   family = poisson(),          # we use the Poisson distribution
   data = ants_df,              # specify the data
   chains = nchains,            # MCMC parameters
-  warmup = burnin, 
+  warmup = warmup, 
   iter = niter,
   cores = nparallel,
   thin = thinning,
@@ -898,14 +900,14 @@ $$
 b_0 \sim N(0, \sigma_b)
 $$
 
-```{r rand-intercept-model-fit}
+```{r rand-intercept-model-fit, class.source = 'fold-show'}
 # Fit Poisson model with random intercept per site
 fit_poisson2 <- brm(
   formula = sp_rich ~ habitat + (1|site),
   family = poisson(),
   data = ants_df,
   chains = nchains,
-  warmup = burnin, 
+  warmup = warmup, 
   iter = niter,
   cores = nparallel,
   thin = thinning,
@@ -937,11 +939,11 @@ pp_check(fit_poisson2, type = "dens_overlay_grouped", ndraws = 100,
 
 How can we objectively compare these models?
 
-# Compare models
+## Compare models
 
 Based on the PPCs we can already see which model fits the data best. Furthermore, there are some functions that **brms** provides to compare different models. With the function `add_criterion()` you can add model fit criteria to model objects. Type `?add_criterion()` to see which ones are available. See also <https://mc-stan.org/loo/articles/online-only/faq.html>
 
-## Leave-one-out cross validation
+### Leave-one-out cross validation
 
 Cross-validation (CV) is a family of techniques that attempts to estimate how well a model would predict unknown data through predictions of the model fitted to the known data. You do not necessarily have to collect new data for this. You can split your own data into a test and training dataset. You fit the model to the training dataset and then use that model to estimate how well it can predict the data in the test dataset. With leave-one-out CV (LOOCV) you leave out one observation each time as test dataset and refit the model based on all other observations (= training dataset).
 
@@ -977,7 +979,7 @@ comp_loo %>%
          ul_diff = elpd_diff  + qnorm(0.95) * se_diff)
 ```
 
-## K-fold cross-validation
+### K-fold cross-validation
 
 With K-fold cross-validation, the data is split into $K$ groups. We will use $K = 10$ groups (= folds) here. So instead of leaving out a single observation each time, as with leave-one-out CV, we will leave out one $10^{th}$ of the data here. Via the arguments `folds = "stratified"` and `group = "habitat"` we ensure that the relative frequencies of habitat are preserved for each group. This technique will therefore be less precise than the previous one, but will be faster to calculate if you work with a lot of data.
 
@@ -1023,7 +1025,7 @@ comp_kfold %>%
          ul_diff = elpd_diff  + qnorm(0.95) * se_diff)
 ```
 
-## WAIC
+### WAIC
 
 The Widely Applicable Information Criterion (WAIC) does not use cross-validation but is a computational way to estimate the ELPD. How this happens exactly is beyond the purpose of this tutorial. It is yet another measure to apply model selection.
 
@@ -1055,10 +1057,11 @@ comp_waic %>%
          ul_diff = elpd_diff  + qnorm(0.95) * se_diff)
 ```
 
-## Conclusion
+### Conclusion
 
 Both based on the PPC and the comparisons with different model selection criteria, we can conclude that the second Poisson model with random intercepts fits the data best. In principle, we could have expected this based on our own intuition and the design of the study, i.e. the use of the Poisson distribution to model numbers and the use of random intercepts to control for a hierarchical design (habitats nested within sites).
 
+
 # Final model results
 
 When we look at the model fit object, we see results that are similar to results we see when we fit a frequentist model. On the one hand we get an estimate of all parameters with their uncertainty, but on the other hand we see that this is clearly the output of a Bayesian model. We get information about the parameters we used for the MCMC algorithm, we get a 95% credible interval (CI) instead of a confidence interval and we also get the $\hat{R}$ value for each parameter as discussed earlier.
diff --git a/content/tutorials/r_brms/brms_nl/workshop_1_mcmc_en_brms.Rmd b/content/tutorials/r_brms/brms_nl/workshop_1_mcmc_en_brms.Rmd
@@ -293,7 +293,7 @@ confint(lm1, level = 0.9)
 We lezen enkele korte functies in om de (log) likelihood en de prior te berekenen en het MCMC metropolis algoritme uit te voeren.
 
 ```{r}
-source(file = "./source/mcmc_functions.R")
+source(file = "./mcmc_functions.R")
 ```
 
 Voor dit eenvoudig model met een kleine dataset kunnen we de posterior voor een groot aantal combinaties voor `beta_0` en `beta_1` uitrekenen en plotten.
@@ -340,7 +340,7 @@ ggplot(df, aes(x = beta_0, y = beta_1, z = post)) +
   theme(legend.position="none")
 ```
 
-De startwaarde ligt een eind van de maximale likelihood. Het duurt even voor de MCMC stabiliseert in de omgeving van de top van de berg. Daarom wordt het eerste deel van de MCMC nooit gebruikt. Dit is de 'burn-in' (of warmp-up in brms).
+De startwaarde ligt een eind van de maximale likelihood. Het duurt even voor de MCMC stabiliseert in de omgeving van de top van de berg. Daarom wordt het eerste deel van de MCMC nooit gebruikt. Dit is de 'warmup' (engels voor opwarmfase; ook 'burn-in' of 'tuning' in andere bibliotheken).
 
 We runnen nu hetzelde model, maar met een veel langere mcmc
 
@@ -387,7 +387,7 @@ Voor elke parameter kan de MCMC worden weergegeven in een 'trace plot'.
 ann_text <- tibble(param = c("beta_0", "beta_1"),
                    x = 210,
                    y = c(max(mcmc_l$beta_0), max(mcmc_l$beta_1)),
-                   lab = "burn-in")
+                   lab = "warmup")
 mcmc_l %>%
   dplyr::select(iter, beta_0, beta_1) %>%
   pivot_longer(cols = c("beta_0", "beta_1"), names_to = "param") %>%
@@ -509,9 +509,8 @@ Er bestaan ook enkele alternatieven:
 ![Overzicht van verschillende Stan software (bron: https://jtimonen.github.io/posts/post-01/)](software.png)
 
 
-# Een model fitten met brms
 
-## Dataset laden en data exploratie
+# Dataset laden en data exploratie
 
 We laden een dataset in over het aantal mierensoorten in New England (USA). Typ `?ants` in de console voor meer info.
 
@@ -583,6 +582,9 @@ ants_df %>%
 
 Als oefening zullen we een model maken om het aantal soorten te vergelijken tussen beide habitats. Uit de data exploratie zagen we al dat het aantal hoger lijkt te liggen in bossen en dat sites met een hoger aantal in moerassen vaak ook een hoger aantal in bossen hebben.
 
+
+# Een model fitten met `brms`
+
 ## Specificatie van een lineaire regressie
 
 ### Model specificatie
@@ -607,8 +609,8 @@ Eerst en vooral besluiten we welke MCMC parameters we zullen gebruiken. Typ `?br
 ```{r simpel-model-mcmc-par}
 # Instellen MCMC parameters
 nchains <- 3           # aantal chains
-niter <- 2000          # aantal iteraties (incl. burn-in, zie volgende)
-burnin <- niter / 4    # aantal initiële samples om te verwijderen (= burn-in)
+niter <- 2000          # aantal iteraties (incl. warmup, zie volgende)
+warmup <- niter / 4    # aantal initiële samples om te verwijderen (= warmup)
 nparallel <- nchains   # aantal cores voor parallel computing
 thinning <- 1          # verdunningsfactor (hier 1 = geen verdunning)
 ```
@@ -620,14 +622,14 @@ Het model wordt gefit a.d.h.v. de `brm()` functie. De syntax is zeer gelijkaardi
 -   `file` en `file_refit` om het model object op te slaan nadat het gefit is. Als je de code opnieuw runt en het model is al eens opgeslaan, dan zal `brm()` dit model gewoon inladen in plaats van het opnieuw te fitten.
 
 
-```{r simpel-model-fit-poisson}
+```{r simpel-model-fit-poisson, class.source = 'fold-show'}
 # Fit Normaal model
 fit_normal1 <- brm(
   formula = sp_rich ~ habitat, # beschrijving van het model
   family = gaussian(),         # we gebruiken de Normaal verdeling
   data = ants_df,              # ingeven data
   chains = nchains,            # MCMC parameters
-  warmup = burnin, 
+  warmup = warmup, 
   iter = niter,
   cores = nparallel,
   thin = thinning,
@@ -638,7 +640,7 @@ Voor we de resultaten bekijken, controleren we eerst of het model goed convergee
 
 ### MCMC convergentie
 
-Er zijn verschillende manieren om convergentie van het MCMC algoritme voor elke parameter te controleren. Hierbij worden de burn-in samples niet in rekening genomen. Eerst en vooral heb je *visuele controles*.
+Er zijn verschillende manieren om convergentie van het MCMC algoritme voor elke parameter te controleren. Hierbij worden de warmup samples niet in rekening genomen. Eerst en vooral heb je *visuele controles*.
 
 We kunnen de MCMC samples met de `as_draws()` functies verkrijgen ofwel ineens visualiseren via de [bayesplot](https://mc-stan.org/bayesplot/) package die compatibel is met brmsfit objecten.
 
@@ -825,14 +827,14 @@ $$
 
 We moeten dus twee parameters schatten: $\beta_0$ en $\beta_1$ We gebruiken dezelfde MCMC parameters als voordien. Het enige wat we moeten aanpassen is de keuze `family = poisson()`.
 
-```{r poisson-model-fit}
+```{r poisson-model-fit, class.source = 'fold-show'}
 # Fit Poisson model
 fit_poisson1 <- brm(
   formula = sp_rich ~ habitat, # beschrijving van het model
   family = poisson(),          # we gebruiken de Poisson verdeling
   data = ants_df,              # ingeven data
   chains = nchains,            # MCMC parameters
-  warmup = burnin, 
+  warmup = warmup, 
   iter = niter,
   cores = nparallel,
   thin = thinning,
@@ -888,14 +890,14 @@ $$
 b_0 \sim N(0, \sigma_b)
 $$
 
-```{r rand-intercept-model-fit}
+```{r rand-intercept-model-fit, class.source = 'fold-show'}
 # Fit Poisson model met random intercepten per site
 fit_poisson2 <- brm(
   formula = sp_rich ~ habitat + (1|site),
   family = poisson(),
   data = ants_df,
   chains = nchains,
-  warmup = burnin, 
+  warmup = warmup, 
   iter = niter,
   cores = nparallel,
   thin = thinning,
@@ -929,11 +931,11 @@ Hoe kunnen we deze modellen nu objectief gaan vergelijken?
 
 
 
-# Vergelijken van modellen
+## Vergelijken van modellen
 
 Op basis van de PPCs kunnen we reeds zien welk model het best past bij de data. Verder zijn er nog enkele functies die **brms** voorziet om verschillende modellen te vergelijken. Met de functie `add_criterion()` kan je model fit criteria toevoegen aan model objecten. Typ `?add_criterion()` om te zien welke beschikbaar zijn. Zie ook <https://mc-stan.org/loo/articles/online-only/faq.html>
 
-## Leave-one-out cross-validation
+### Leave-one-out cross-validation
 
 Cross-validation (CV) is een familie van technieken die probeert in te schatten hoe goed een model onbekende data zou voorspellen via predicties van het model gefit op de bekende data. Hiervoor moet je niet per se nieuwe data gaan inzamelen. Je kan jouw eigen data opsplitsen in een test en training dataset. Je fit het model op de training dataset en je gebruikt dat model dan om te schatten hoe goed het de data in de test dataset kan voorspellen. Bij leave-one-out CV (LOOCV) ga je telkens één observatie weglaten en het model opnieuw fitten o.b.v. alle andere observaties.
 
@@ -969,7 +971,7 @@ comp_loo %>%
          ul_diff = elpd_diff  + qnorm(0.95) * se_diff)
 ```
 
-## K-fold cross-validation
+### K-fold cross-validation
 
 Bij K-fold cross-validation worden de data in $K$ groepen opgesplitst. Wij zullen hier $K = 10$ groepen (= folds) gebruiken. In plaats van dus telkens één enkele observatie weg te laten zoals bij leave-one-out CV gaan we hier $1/10$e van de data weglaten. Via de argumenten `folds = "stratified"` en `group = "habitat"` zorgen we ervoor dat voor elke groep de relatieve frequenties van habitat bewaard blijven. Deze techniek zal dus minder precies zijn dan de vorige, maar zal sneller zijn om te berekenen indien je met heel veel data werkt.
 
@@ -1014,7 +1016,7 @@ comp_kfold %>%
          ul_diff = elpd_diff  + qnorm(0.95) * se_diff)
 ```
 
-## WAIC
+### WAIC
 
 Het Widely Applicable Information Criterion (WAIC) maakt geen gebruik van cross-validation maar is een computationele manier om de ELPD te schatten. Hoe dit precies gebeurt valt buiten het doel van deze workshop. Het is nog een andere maat om model selectie toe te passen.
 
@@ -1046,10 +1048,11 @@ comp_waic %>%
          ul_diff = elpd_diff  + qnorm(0.95) * se_diff)
 ```
 
-## Conclusie
+### Conclusie
 
 Zowel o.b.v. de PPC als vergelijkingen met verschillende model selectie criteria, kunnen we besluiten dat het tweede Poisson model met random intercepts het best past bij de data. In principe konden we dit ook verwachten op basis van onze eigen intuïtie en het design van de studie, nl. het gebruik van de Poisson distributie om aantallen te modelleren en het gebruik van random intercepts om te controleren voor een hiërarchisch design (habitats genest in sites).
 
+
 # Resultaten finale model
 
 Als we het model fit object bekijken, zien we resultaten die vergelijkbaar zijn met resultaten zoals we die zien als we een frequentist model gefit hebben. We krijgen enerzijds een schatting van alle parameters met hun onzekerheid maar anderzijds zien we dat dit duidelijk de output is van een Bayesiaans model. Zo krijgen we info over de parameters die we gebruikt hebben voor het MCMC algoritme, we krijgen een 95 % credible interval (CI) in plaats van een confidence interval en we krijgen bij elke parameter ook de $\hat{R}$ waarde die we eerder besproken hebben.