Skip to content

Commit d2f32bb

Browse files
committed
add quarto vignettes
1 parent 598996e commit d2f32bb

4 files changed

Lines changed: 896 additions & 0 deletions

File tree

vignettes/common_analyses.qmd

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
---
2+
title: "Common Analyses with betydata"
3+
vignette: >
4+
%\VignetteIndexEntry{Common Analyses with betydata}
5+
%\VignetteEngine{quarto::html}
6+
%\VignetteEncoding{UTF-8}
7+
---
8+
9+
::: {.callout-note}
10+
## What you will learn
11+
12+
- How to extract and summarize yield data for specific genera
13+
- How to link management practices (fertilization, planting) to yield observations
14+
- Patterns for site-level aggregation, author-based queries, and variable lookups
15+
:::
16+
17+
## Setup
18+
19+
```{r}
20+
library(betydata)
21+
library(dplyr)
22+
```
23+
24+
## Extracting Yield Data for a Genus {#sec-yields}
25+
26+
A common starting point is pulling yield observations for a particular genus and summarizing them. The `Ayield` trait represents above-ground annual yield in Mg/ha.
27+
28+
```{r}
29+
miscanthus_yields <- traitsview |>
30+
filter(
31+
genus == "Miscanthus",
32+
trait == "Ayield"
33+
) |>
34+
select(id, mean, date, sitename, scientificname)
35+
36+
miscanthus_yields
37+
38+
nrow(miscanthus_yields)
39+
```
40+
41+
::: {.callout-tip}
42+
## Tibble Printing
43+
44+
All tables are tibbles, which display the first 10 rows by default. With key columns ordered first (`trait`, `mean`, `units`, `scientificname`, `genus`), the default output is immediately informative without needing `head()` or column subsetting.
45+
:::
46+
47+
## Working with Management Practices {#sec-management}
48+
49+
Management practices (planting dates, fertilization rates, harvest methods) are stored in the `managements` table and linked to experimental treatments through the `managements_treatments` junction table. This linkage connects management details to yield observations in `traitsview`.
50+
51+
```{r}
52+
mgmt_treat <- managements_treatments |>
53+
left_join(
54+
managements |> select(id, mgmttype, level, units, date),
55+
by = c("management_id" = "id")
56+
)
57+
58+
grass_yields <- traitsview |>
59+
filter(
60+
genus %in% c("Miscanthus", "Panicum"),
61+
trait == "Ayield"
62+
) |>
63+
left_join(mgmt_treat, by = "treatment_id", relationship = "many-to-many")
64+
65+
grass_yields |>
66+
filter(!is.na(mgmttype)) |>
67+
count(genus, mgmttype, sort = TRUE)
68+
```
69+
70+
## Nitrogen Fertilization Rates {#sec-nitrogen}
71+
72+
Extracting nitrogen application rates and joining them with yield data enables exploration of yield--nitrogen relationships. Nitrogen management is recorded as `fertilizer_N` or `fertilizer_N_rate` in the `mgmttype` column.
73+
74+
```{r}
75+
nitrogen_rates <- managements |>
76+
filter(mgmttype %in% c("fertilizer_N", "fertilizer_N_rate")) |>
77+
left_join(
78+
managements_treatments |> select(management_id, treatment_id),
79+
by = c("id" = "management_id")
80+
) |>
81+
select(treatment_id, nrate = level, units)
82+
83+
yields_with_n <- traitsview |>
84+
filter(
85+
trait == "Ayield",
86+
genus %in% c("Miscanthus", "Panicum")
87+
) |>
88+
left_join(nitrogen_rates, by = "treatment_id", relationship = "many-to-many")
89+
90+
yields_with_n |>
91+
filter(!is.na(nrate)) |>
92+
summarise(
93+
n = n(),
94+
mean_N = round(mean(nrate, na.rm = TRUE), 1),
95+
mean_yield = round(mean(mean, na.rm = TRUE), 1),
96+
.by = genus
97+
) |>
98+
knitr::kable(col.names = c("Genus", "N obs", "Mean N rate", "Mean Yield (Mg/ha)"))
99+
```
100+
101+
## Site-Level Aggregation {#sec-sites}
102+
103+
Aggregating trait data by site is useful for spatial analysis and mapping data density across research locations.
104+
105+
```{r}
106+
#| label: tbl-site-summary
107+
#| tbl-cap: "Top research sites by number of records"
108+
site_summary <- traitsview |>
109+
filter(!is.na(lat), !is.na(lon)) |>
110+
summarise(
111+
n_records = n(),
112+
n_traits = n_distinct(trait),
113+
n_species = n_distinct(species_id),
114+
.by = c(site_id, sitename, lat, lon)
115+
)
116+
117+
site_summary |>
118+
arrange(desc(n_records)) |>
119+
head(15) |>
120+
knitr::kable()
121+
```
122+
123+
::: {.callout-tip}
124+
## Geographic Data
125+
126+
All sites with coordinates have `lat` and `lon` columns in both `traitsview` and the `sites` table. The `sites` table additionally contains `mat` (mean annual temperature) and `map` (mean annual precipitation) for sites where climate data is available.
127+
:::
128+
129+
## Finding Data by Author {#sec-author}
130+
131+
```{r}
132+
lebauer_data <- traitsview |>
133+
filter(grepl("LeBauer", author, ignore.case = TRUE))
134+
135+
lebauer_data |>
136+
count(trait, author, citation_year, sort = TRUE)
137+
```
138+
139+
## Most Data-Rich Citations {#sec-citations}
140+
141+
```{r}
142+
#| label: tbl-citations
143+
#| tbl-cap: "Top 10 citations by number of records"
144+
traitsview |>
145+
count(citation_id, author, citation_year, sort = TRUE) |>
146+
head(10) |>
147+
knitr::kable()
148+
```
149+
150+
## Variable and Trait Lookups {#sec-variables}
151+
152+
The `variables` table provides units, descriptions, and valid ranges for each measured trait. This is useful for understanding what a trait measures and checking whether observed values are within expected bounds.
153+
154+
```{r}
155+
variables |>
156+
filter(name %in% c("SLA", "Vcmax", "leaf_respiration_rate_m2", "Ayield")) |>
157+
select(name, units, description, min, max) |>
158+
knitr::kable()
159+
```
160+
161+
## Performance {#sec-performance}
162+
163+
Since all tables are loaded in memory as R data frames, filtering and joining operations run at in-memory speed with no network overhead.
164+
165+
```{r}
166+
system.time({
167+
result <- traitsview |>
168+
filter(
169+
genus %in% c("Miscanthus", "Panicum", "Populus"),
170+
trait %in% c("SLA", "Vcmax", "Ayield"),
171+
checked == 1
172+
) |>
173+
summarise(
174+
n = n(),
175+
mean = mean(mean, na.rm = TRUE),
176+
.by = c(genus, trait)
177+
)
178+
})
179+
```
180+
181+
## References
182+
183+
- LeBauer, D. S., et al. (2018). BETYdb: a yield, trait, and ecosystem service database applied to second-generation bioenergy feedstock production. *GCB Bioenergy*. [doi:10.1111/gcbb.12420](https://doi.org/10.1111/gcbb.12420)

vignettes/getting_started.qmd

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
---
2+
title: "Getting Started with betydata"
3+
vignette: >
4+
%\VignetteIndexEntry{Getting Started with betydata}
5+
%\VignetteEngine{quarto::html}
6+
%\VignetteEncoding{UTF-8}
7+
---
8+
9+
::: {.callout-note}
10+
## What you will learn
11+
12+
- What data is available in betydata and how the 16 tables relate to each other
13+
- How to explore trait and yield observations using dplyr
14+
- Key concepts: traits, yields, QA/QC flags, and Plant Functional Types
15+
:::
16+
17+
## What is betydata?
18+
19+
The `betydata` package provides offline access to public data from [BETYdb](https://betydb.org), the Biofuel Ecophysiological Traits and Yields database. BETYdb is a centralized repository of plant trait measurements and crop yield data used in ecosystem modeling and agricultural research.
20+
21+
A **trait** is a measurable characteristic of a plant -- for example, Specific Leaf Area (SLA, m2/kg), maximum carboxylation rate (Vcmax, umol/m2/s), or leaf nitrogen content (%). A **yield** is a measure of crop production per unit area (typically Mg/ha). Together, traits and yields form the foundation of ecosystem model parameterization and agricultural research.
22+
23+
## Loading the Package
24+
25+
```{r}
26+
library(betydata)
27+
library(dplyr)
28+
```
29+
30+
## Data Architecture {#sec-architecture}
31+
32+
The package contains **16 tables** organized in three tiers:
33+
34+
```{r}
35+
#| label: tbl-tables
36+
#| tbl-cap: "All tables available in betydata"
37+
data(package = "betydata")$results[, c("Item", "Title")] |>
38+
as.data.frame() |>
39+
knitr::kable()
40+
```
41+
42+
::: {.callout-tip}
43+
## Data Model
44+
45+
The tables follow a relational structure:
46+
47+
- **`traitsview`** is the primary denormalized table (pre-joined for convenience)
48+
- **Metadata tables** (`species`, `sites`, `variables`, `citations`, etc.) provide reference data
49+
- **Relationship tables** (`pfts_species`, `pfts_priors`, etc.) are many-to-many junction tables
50+
51+
You can use `traitsview` for most analyses without joining anything. The metadata and relationship tables are available when you need additional detail or custom aggregations.
52+
:::
53+
54+
## The Primary Table: traitsview {#sec-traitsview}
55+
56+
The `traitsview` table is a denormalized view combining traits and yields with associated metadata. Key analytical columns are placed first for convenient interactive use:
57+
58+
```{r}
59+
traitsview
60+
```
61+
62+
### Key Columns {#sec-columns}
63+
64+
| Column | Description | Example Values |
65+
|------------------|--------------------------------------------------|---------------------------|
66+
| `trait` | Variable name | SLA, Vcmax, Ayield |
67+
| `mean` | Observed value | 22.5, 38.1 |
68+
| `units` | Measurement units | m2/kg, umol/m2/s |
69+
| `scientificname` | Full species name | *Miscanthus x giganteus* |
70+
| `genus` | Genus | Miscanthus, Panicum |
71+
| `sitename` | Research site | Energy Farm, Urbana IL |
72+
| `author` | Citation author | Heaton 2008 |
73+
| `checked` | QA/QC status (0 = unchecked, 1 = verified) | 0, 1 |
74+
75+
## Basic Exploration
76+
77+
```{r}
78+
#| label: tbl-trait-counts
79+
#| tbl-cap: "Top 15 most common traits in betydata"
80+
traitsview |>
81+
count(trait, sort = TRUE) |>
82+
head(15) |>
83+
knitr::kable()
84+
```
85+
86+
## Data Quality: The `checked` Column {#sec-checked}
87+
88+
::: {.callout-important}
89+
## Quality Control
90+
91+
The `checked` column indicates data verification status:
92+
93+
- **`1`** = Verified by an independent reviewer
94+
- **`0`** = Not yet reviewed (use with appropriate caution)
95+
- **`-1`** = Flagged as incorrect (**excluded** from this package)
96+
97+
All data in this package is public (BETYdb `access_level = 4`).
98+
:::
99+
100+
```{r}
101+
table(traitsview$checked, useNA = "ifany")
102+
103+
verified <- traitsview |>
104+
filter(checked == 1)
105+
nrow(verified)
106+
```
107+
108+
## Support Tables {#sec-support}
109+
110+
### Species Taxonomy
111+
112+
The `species` table contains `r format(nrow(species), big.mark = ",")` entries with full taxonomic information:
113+
114+
```{r}
115+
species |>
116+
select(id, scientificname, genus, commonname)
117+
```
118+
119+
### Variables (Trait Definitions)
120+
121+
The `variables` table documents units, descriptions, and valid ranges for each measured trait:
122+
123+
```{r}
124+
variables |>
125+
filter(name %in% c("SLA", "Vcmax", "leaf_respiration_rate_m2", "Ayield")) |>
126+
select(name, units, description)
127+
```
128+
129+
### Sites
130+
131+
```{r}
132+
sites_with_climate <- sites |>
133+
filter(!is.na(mat), !is.na(map))
134+
nrow(sites_with_climate)
135+
```
136+
137+
## Example: Bioenergy Crop Yields {#sec-bioenergy}
138+
139+
```{r}
140+
#| label: tbl-bioenergy
141+
#| tbl-cap: "Yield summary for key bioenergy genera"
142+
bioenergy_genera <- c("Miscanthus", "Panicum", "Populus", "Salix", "Saccharum")
143+
144+
yields <- traitsview |>
145+
filter(
146+
trait == "Ayield",
147+
genus %in% bioenergy_genera,
148+
!is.na(mean)
149+
) |>
150+
select(genus, mean, units, sitename, author, citation_year, lat, lon)
151+
152+
yields |>
153+
summarise(
154+
n = n(),
155+
mean_yield = round(mean(mean, na.rm = TRUE), 1),
156+
sd_yield = round(sd(mean, na.rm = TRUE), 1),
157+
.by = genus
158+
) |>
159+
knitr::kable(col.names = c("Genus", "N", "Mean Yield (Mg/ha)", "SD"))
160+
```
161+
162+
## Working with Plant Functional Types (PFTs) {#sec-pfts}
163+
164+
::: {.callout-note}
165+
## What is a PFT?
166+
167+
A **Plant Functional Type** groups species with similar ecological characteristics for ecosystem modeling. Instead of parameterizing models for each species individually, PFTs like "temperate deciduous trees" or "C4 grasses" define shared parameter distributions. This approach is essential when species-level data is sparse and makes modeling tractable at large scales.
168+
:::
169+
170+
```{r}
171+
miscanthus_sp <- species |>
172+
filter(genus == "Miscanthus") |>
173+
pull(id)
174+
175+
pfts_species |>
176+
filter(specie_id %in% miscanthus_sp) |>
177+
left_join(pfts |> select(id, name), by = c("pft_id" = "id")) |>
178+
distinct(name)
179+
```
180+
181+
## Next Steps
182+
183+
| Vignette | Description |
184+
|--------------------------------|-----------------------------------------------|
185+
| `vignette("common_analyses")` | Common analysis patterns with dplyr |
186+
| `vignette("pfts-priors")` | Working with PFTs and Bayesian priors |
187+
| `vignette("manuscript")` | Reproduce analyses from LeBauer et al. (2018) |
188+
189+
## References
190+
191+
- LeBauer, D. S., et al. (2018). BETYdb: a yield, trait, and ecosystem service database applied to second-generation bioenergy feedstock production. *GCB Bioenergy*. [doi:10.1111/gcbb.12420](https://doi.org/10.1111/gcbb.12420)
192+
- LeBauer, D. S., et al. (2013). Facilitating feedbacks between field measurements and ecosystem models. *Ecological Monographs*, 83(2), 133--154.
193+
- BETYdb documentation: <https://betydb.org>

0 commit comments

Comments
 (0)