feat: Add validation infrastructure and review skills (v3)#183
feat: Add validation infrastructure and review skills (v3)#183rafdoodle wants to merge 1 commit into
Conversation
yulric
left a comment
There was a problem hiding this comment.
When running the quarto preview ./ceps/cep-015-variable-tools/cep-015-variable-tools.qmd its giving an error in another CEP that needs to be fixed.
I only reviewed the one file for now but I'll add more reviews as I go through the others.
|
|
||
| ### The coverage problem | ||
|
|
||
| Not all variables are available for all cycles or database types. A researcher planning a 2001-2023 trend analysis needs to know that `age_start_smoking` has no PUMF source for 2022-2023, or that PUMF pack-years have \~15-20% relative error versus Master. This information exists in worksheet metadata but requires manual inspection. |
There was a problem hiding this comment.
This text PUMF pack-years have \~15-20% relative error versus Master is not a coverage problem but a limitation that users should be aware of when choosing that variable. Recommend moving it to a new section.
|
|
||
| - **`R/variable-discovery.R`** — metadata queries (`get_harmonized_variables()`, `get_source_mappings()`, `find_variable_in_data()`), recommended variable tags, subject/section filtering. Currently v1, smoking-focused. | ||
| - **`recommended` metadata tag** — `{recommended:primary}` / `{recommended:secondary}` in the `notes` field of `variables.csv`. Started for smoking but not applied broadly. | ||
| - **`R/table-generators.R`** — generates summary tables (cycle coverage, variable counts). Primarily for documentation, not user-facing. |
|
|
||
| ### In downstream projects | ||
|
|
||
| Three big-life-lab repositories have independently implemented variable selection and dependency resolution. These are the primary consumers that CEP-015 aims to consolidate. |
There was a problem hiding this comment.
I don't see how any of the resources mentioned here implement variable selection. If I understand correctly from this CEP, variable selection would mean tools that would help a researcher choose the variables that would help them answer a research questions. All the tools in this sections are used to help with the dependency problem mentioned above.
|
|
||
| ## Evaluation | ||
|
|
||
| Requirements were scored against seven criteria. The current project context: 381 variables, 3,698 variable_details rows, 225 DerivedVar entries, 17 recommended tags (smoking only), \~2,900 CRAN downloads/year, 7+ downstream repositories within big-life-lab (bllflow, cvd-trends-Canada, phiat-yll, raiflow, huipain, chmsflow, calibrationTutorial). |
There was a problem hiding this comment.
| Requirements were scored against seven criteria. The current project context: 381 variables, 3,698 variable_details rows, 225 DerivedVar entries, 17 recommended tags (smoking only), \~2,900 CRAN downloads/year, 7+ downstream repositories within big-life-lab (bllflow, cvd-trends-Canada, phiat-yll, raiflow, huipain, chmsflow, calibrationTutorial). | |
| Requirements were scored against seven criteria. The current project context: 381 variables, 3,698 variable details rows, 225 derived variables entries, 17 recommended tags (smoking only), \~2,900 CRAN downloads/year, 7+ downstream repositories within big-life-lab (bllflow, cvd-trends-Canada, phiat-yll, raiflow, huipain, chmsflow, calibrationTutorial). |
|
|
||
| Requirements were scored against seven criteria. The current project context: 381 variables, 3,698 variable_details rows, 225 DerivedVar entries, 17 recommended tags (smoking only), \~2,900 CRAN downloads/year, 7+ downstream repositories within big-life-lab (bllflow, cvd-trends-Canada, phiat-yll, raiflow, huipain, chmsflow, calibrationTutorial). | ||
|
|
||
| | Req | Value | Success | Complexity | Maintenance | User proximity | Incremental | Alternative today | |
There was a problem hiding this comment.
I would rename this column from Req (Requirement) to Feature since it lists things that we're considering building as opposed to a requirement which is something that has to be built. This change also has downstream effects that will need to addressed.
| ### In scope | ||
|
|
||
| 1. **Dependency resolver** — given target variable(s), resolve the full dependency graph | ||
| 2. **Coverage checker** — report variable availability by cycle and database type |
There was a problem hiding this comment.
There is not concept of a cycle within cchsflow, just database. How would you extract the cycle?
| 1. **Dependency resolver** — given target variable(s), resolve the full dependency graph | ||
| 2. **Coverage checker** — report variable availability by cycle and database type | ||
| 3. **Project worksheet generator** — produce minimal `variables.csv` and `variable_details.csv` for a set of target variables | ||
| 4. **Variable discovery** — search and filter harmonised variables by domain, type, coverage |
There was a problem hiding this comment.
By coverage do you mean database?
|
|
||
| - Ordered list of variables from leaves (no dependencies) to roots (target variables) | ||
| - For each variable: its direct feeders, whether it's a DerivedVar, and which `Func::` function implements it | ||
| - Cycle of circular dependencies or missing feeders |
There was a problem hiding this comment.
I assume these are errors that should be reported? Would also give examples.
| **Outputs:** | ||
|
|
||
| - Matrix: variable x database → available / not available / available with caveats | ||
| - Caveats: PUMF precision warnings, optional module flags, Master-only indicators |
| - Produces project-specific worksheets | ||
| - Explains what harmonisation transforms are applied | ||
|
|
||
| ## Architecture considerations |
There was a problem hiding this comment.
Would move the content within this section to the Requirements section since that's what it is. No reason to split them.
yulric
left a comment
There was a problem hiding this comment.
This just reviews the cchsflow-derived folder.
- All the Gold templates are missing Roxygen documentation. Is there a reason for that?
- The
foundations.mddocument mentions that the Bronze tier should have minimal Roxygen comments but none of the pattern documents do that.
| ## What it is | ||
|
|
||
| A function that converts categorical ranges into continuous values using | ||
| midpoint imputation. The input is a categorical variable (e.g., "1-2 years"), |
There was a problem hiding this comment.
"1-2 years" is an example of a category and not a categorical variable. Recommend rewording.
|
|
||
| ## When to use | ||
|
|
||
| - Input is categorical with ordered ranges |
There was a problem hiding this comment.
| - Input is categorical with ordered ranges | |
| - Input is ordinal with ranges for each category |
There was a problem hiding this comment.
Is there a reason for the three templates as opposed to just going with the Gold? I assume that's the best one?
| ```r | ||
| calculate_my_var_cont <- function(cat_var, continuous_companion = NULL, | ||
| output_format = "tagged_na") { | ||
| # Step 1: Clean inputs — always tagged_na for Step 2 |
There was a problem hiding this comment.
Can you expand on the comment 'always tagged_na for Step 2'?
|
|
||
| ## Reference implementations | ||
|
|
||
| - `calculate_SMK_06A_cont()` — R/smoking-cessation.R (quit timing midpoints) |
There was a problem hiding this comment.
I don't think this is in the branch. In any case, I would not reference other places in the repo unless we've set up something that gives an error if they've been removed.
| | Level | Name | Purpose | Example | | ||
| |-------|------|---------|---------| | ||
| | L1 | Foundational utility | Low-level missing data, cleaning, pattern detection | `any_missing()`, `clean_variables()`, `assign_missing()` | | ||
| | L2 | Midpoint mapping | Convert categorical ranges to continuous values via lookup table | `smkg_age_midpoint()` | |
There was a problem hiding this comment.
Isn't this all done in the worksheets rather than through a function.
| |-------|------|---------|---------| | ||
| | L1 | Foundational utility | Low-level missing data, cleaning, pattern detection | `any_missing()`, `clean_variables()`, `assign_missing()` | | ||
| | L2 | Midpoint mapping | Convert categorical ranges to continuous values via lookup table | `smkg_age_midpoint()` | | ||
| | L3 | Single-source pass-through | Wrap and clean a single input, worksheet handles routing | `calculate_age_start_smoking()` | |
There was a problem hiding this comment.
Isn't this all possible in the worksheet without the need for a custom function?
| | L1 | Foundational utility | Low-level missing data, cleaning, pattern detection | `any_missing()`, `clean_variables()`, `assign_missing()` | | ||
| | L2 | Midpoint mapping | Convert categorical ranges to continuous values via lookup table | `smkg_age_midpoint()` | | ||
| | L3 | Single-source pass-through | Wrap and clean a single input, worksheet handles routing | `calculate_age_start_smoking()` | | ||
| | L4 | Categorical-to-continuous conversion | Apply midpoint imputation with domain logic | `calculate_SMK_06A_cont()` | |
There was a problem hiding this comment.
The example mentioned is not in the codebase.
| | L1 | Foundational utility | Low-level missing data, cleaning, pattern detection | `any_missing()`, `clean_variables()`, `assign_missing()` | | ||
| | L2 | Midpoint mapping | Convert categorical ranges to continuous values via lookup table | `smkg_age_midpoint()` | | ||
| | L3 | Single-source pass-through | Wrap and clean a single input, worksheet handles routing | `calculate_age_start_smoking()` | | ||
| | L4 | Categorical-to-continuous conversion | Apply midpoint imputation with domain logic | `calculate_SMK_06A_cont()` | |
There was a problem hiding this comment.
Can you go over what you mean by domain logic? Because otherwise it's the same as L2?
There was a problem hiding this comment.
Can you explain the reason behind this file? As opposed to limiting the examples for each pattern to the definition for each one? I can see this file getting out of sync really fast.
yulric
left a comment
There was a problem hiding this comment.
This reviews the files in the .claude/skills/cchsflow-worksheets folder.
| Derived variable (`Func::`) blocks — **no `NA::b else` row ever**: | ||
| - Continuous output: exactly 3 rows — Func:: + NA::a + NA::b | ||
| - Continuous output: up to 3 rows — Func:: + NA::a + NA::b | ||
| - Categorical output: Func:: row, then N category rows (ascending recEnd), then NA::a + NA::b |
There was a problem hiding this comment.
The categorical Func:: bullet reads Func:: row, then N category rows (ascending recEnd), then NA::a + NA::b with no up to qualifier, implying both NA rows are mandatory. But the exception text on line 108 (NA::a / NA::b may be omitted if the function never returns the corresponding tagged_na) is written generically and applies to both continuous and categorical Func:: blocks. Line 105 (continuous) was updated to up to 3 rows, but line 106 was left unchanged — same asymmetry exists in csv-conventions.md:171-180. Suggest rewording line 106 to make the NA::a/NA::b rows optional (parallel to line 105), or making the exception paragraph explicitly cover both row types.
|
|
||
| **Union rule** — `variables.csv databaseStart` = union of all era block `databaseStart` values; `variables.csv variableStart` = union of all explicit tokens and `[VAR]`/`DerivedVar::` patterns across all era blocks. | ||
|
|
||
| **Derived variable variableStart format (§9)** — when every era block in `variable_details.csv` uses multi-variable inputs (`DerivedVar::[X, Y, …]` Func:: rows OR `cycle::[X, Y, …]` per-cycle tokens) with no `[VAR]` defaults and no singular `cycle::SOURCE` mappings, `variables.csv variableStart` should collapse to a single `DerivedVar::[union of all feeders]` token. Otherwise keep cycle-specific tokens. |
There was a problem hiding this comment.
The §9 reference is ambiguous because SKILL.md links into multiple docs (csv-conventions.md, pumf-master-harmonization.md, etc.) and §9 alone doesn't say which file's section 9 is meant. Suggest converting it to a markdown link pointing at docs/csv-conventions.md §9 (Derived variable inputs in variables.csv variableStart).
| Full canonical order: | ||
| ``` | ||
| cchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, | ||
| cchs2009_2010_p, cchs2010_p, | ||
| cchs2011_2012_p, cchs2012_p, | ||
| cchs2013_2014_p, cchs2014_p, | ||
| cchs2015_2016_p, cchs2017_2018_p, cchs2019_2020_p, cchs2022_p, cchs2023_p, | ||
| cchs2001_m, cchs2003_m, cchs2005_m, cchs2007_2008_m, | ||
| cchs2009_2010_m, cchs2009_m, cchs2010_m, | ||
| cchs2011_2012_m, cchs2012_m, | ||
| cchs2013_2014_m, cchs2014_m, | ||
| cchs2015_2016_m, cchs2017_2018_m, cchs2019_2020_m, cchs2021_m, cchs2022_m, cchs2023_m | ||
| ``` |
There was a problem hiding this comment.
Keeping a canonical list is useful but would require work to maintain everytime the list changes. I also think it can be built by the LLMs by following the rules outlined in this section and with the full list of supported databases. Keeping the reasoning in mind, I would instead clearly outline the three rules and provide a minimal example that showcases them.
- The list of supported databases should live at
inst/metadata/documentation/database_metadata.yamlreferenced from here. This would fix a dangling reference ininst/metadata/documentation/metadata_registry.yaml(line 168) that declares thatdatabase_metadata.yamlexists at that path, but the file itself is missing. As well, I can see the metadata being useful for other purposes like programmatically checking if the sheets are using the right databases. - To be clear, the three rules are,
- PUMF databases (_p suffix) come before the Master databases (_m) and
both should be listed in chronological order; - Child databases come after parent databases; and
- cchs2021_p is not a valid database name
- PUMF databases (_p suffix) come before the Master databases (_m) and
| | Parent cycle | Child cycle(s) | | ||
| |---|---| | ||
| | `cchs2009_2010_p` | `cchs2010_p` | | ||
| | `cchs2011_2012_p` | `cchs2012_p` | | ||
| | `cchs2013_2014_p` | `cchs2014_p` | | ||
| | `cchs2009_2010_m` | `cchs2009_m`, `cchs2010_m` | | ||
| | `cchs2011_2012_m` | `cchs2012_m` | | ||
| | `cchs2013_2014_m` | `cchs2014_m` | | ||
|
|
||
| ### Rules | ||
|
|
||
| | Condition | Action | | ||
| |---|---| | ||
| | `cchs2009_2010_m` in databaseStart | Also add `cchs2009_m` and `cchs2010_m` | | ||
| | `cchs2011_2012_m` in databaseStart | Also add `cchs2012_m` | | ||
| | `cchs2013_2014_m` in databaseStart | Also add `cchs2014_m` | | ||
| | `cchs2009_2010_p` in databaseStart | Also add `cchs2010_p` | | ||
| | `cchs2011_2012_p` in databaseStart | Also add `cchs2012_p` | | ||
| | `cchs2013_2014_p` in databaseStart | Also add `cchs2014_p` | | ||
| | Child present **without** parent | Remove the child | |
There was a problem hiding this comment.
Same issue as the canonical cycle list. The parent → child mapping table (lines 41-48) and the per-case rule table (lines 50-60) both need to be updated every time a new parent-child pair is added.
- Move the parent → child mapping itself to
inst/metadata/documentation/database_metadata.yamlalongside the canonical cycle list, so it becomes a single source of truth that R code and validation tooling can also consume. - In this doc, replace both tables with the two underlying rules plus a
minimal example:- If a parent cycle is in
databaseStart, also include all of its
children (as declared indatabase_metadata.yaml). - If a child appears without its parent, remove it.
# Parent present → add children databaseStart: cchs2009_2010_m → add cchs2009_m, cchs2010_m # Child present without parent → remove child databaseStart: cchs2012_m (no cchs2011_2012_m) → remove cchs2012_m - If a parent cycle is in
- Keep the conceptual sentence on line 39 (children are sub-releases sharing
source variables with the parent) and the variableStart child-inheritance example on lines 62-72 — those are authoring behaviors, not reference data.
| - `variables.csv databaseStart` = union of all `variable_details.csv databaseStart` values for that variable | ||
| - `variables.csv variableStart` = union of all explicit `cycle::SOURCE` tokens and `[VAR]` / `DerivedVar::` patterns across all era blocks | ||
|
|
||
| No cycle should appear in `variables.csv databaseStart` without a corresponding era block in `variable_details.csv` (exception: cycles explicitly deferred as "UNMATCHED" pending new era blocks). |
There was a problem hiding this comment.
The parenthetical (exception: cycles explicitly deferred as "UNMATCHED" pending new era blocks) references a convention that isn't defined anywhere. "UNMATCHED" does not appear in any other skill doc, in the worksheets (variables.csv / variable_details.csv), or in the R package. Without a concrete definition the exception clause can't be applied or validated. Either:
- Remove the clause if no such convention exists, simplifying the rule to
"every cycle indatabaseStartmust have a matching era block". - Or define it explicitly — what does "UNMATCHED" look like in a worksheet
(a comment? a placeholder era block? a specificrecEndvalue?), how should validation tooling treat it, and how does a cycle transition out of the UNMATCHED state once an era block is authored?
| calculate_SMKG040_cont <- function(SMKG203_cont, SMKG207_cont, ...) | ||
|
|
||
| # Domain logic with well-known source names | ||
| calculate_cigs_per_day <- function(SMKDSTY_A, SMK_204, SMK_208, ...) |
There was a problem hiding this comment.
The example function signature on this line:
calculate_cigs_per_day <- function(SMKDSTY_A, SMK_204, SMK_208, ...)uses SMKDSTY_A as an acceptable "well-known source name" — but the same name appears in the "Avoid (source-specific)" column of the parameter naming table on line 102, where smoking_status is the recommended replacement. The same name is in both the "use" and "avoid" lists.
On top of the contradiction, SMKDSTY_A isn't a real variable (see the entire-file comment above). The actual calculate_cigs_per_day signature uses SMKDSTY_original:
calculate_cigs_per_day <- function(SMKDSTY_original, ...) # R/smoke-intensity.R:121Fix both issues by updating the example to match reality and reconcile the table — e.g., have the line 102 "Avoid" column show a truly avoidable example (a raw CCHS source name like SMKADSTY from 2001), and keep SMKDSTY_original here as an example of an acceptable well-known harmonized source name.
| | Pattern | Example function | File | | ||
| |---------|-----------------|------| | ||
| | Pass-through | `calculate_age_start_smoking()` | `R/smoke-start.R` | | ||
| | Domain routing | `calculate_cigs_per_day()` | `R/smoke-intensity.R` | | ||
| | Multi-input calculation | `calculate_pack_years()` | `R/smoke-pack-years.R` | | ||
| | Categorical binning | `calculate_pack_years_categorical()` | `R/smoke-pack-years.R` | | ||
| | Source combining | `calculate_time_quit_smoking()` | `R/smoking-cessation.R` | | ||
| | Doc stub | `calculate_SMK_204()` | `R/smoke-intensity.R` | |
There was a problem hiding this comment.
The "Pattern" column uses category names that don't match the cchsflow-derive/ docs/patterns/ filenames, so a reader who wants to drill into a pattern can't trivially find its full documentation.
| This doc's name | cchsflow-derive/docs/patterns/ filename |
|---|---|
| Pass-through | pass-through.md ✓ |
| Domain routing | multi-source-routing.md |
| Multi-input calculation | formula-calculation.md |
| Categorical binning | category-grouping.md or cat-to-continuous.md (ambiguous) |
| Source combining | (no direct equivalent in cchsflow-derive) |
| Doc stub | (not in cchsflow-derive) |
Two skills using different vocabularies for the same patterns is the kind of drift the line 41 consolidation comment is meant to prevent. If this table stays in cchsflow-worksheets rather than moving to cchsflow-derive, at least align the pattern names with the cchsflow-derive filenames and link each row to the corresponding pattern doc.
| | Domain routing | `calculate_cigs_per_day()` | `R/smoke-intensity.R` | | ||
| | Multi-input calculation | `calculate_pack_years()` | `R/smoke-pack-years.R` | | ||
| | Categorical binning | `calculate_pack_years_categorical()` | `R/smoke-pack-years.R` | | ||
| | Source combining | `calculate_time_quit_smoking()` | `R/smoking-cessation.R` | |
There was a problem hiding this comment.
The row reads:
| Source combining | calculate_time_quit_smoking() | R/smoking-cessation.R |
but calculate_time_quit_smoking() actually lives in R/smoking.R:851, not R/smoking-cessation.R. Worse, that function is flagged as legacy / v2 to be deprecated in cchsflow-derive/docs/function-inventory.md:49-60, where it's listed with Bronze tier and modern replacements (calculate_time_quit_smoking_complete/daily — both in R/smoking-cessation.R).
So the table points to a deprecated function while naming the new location. Suggest updating to the modern equivalent so readers don't copy a legacy pattern as a "reference implementation":
| Source combining | calculate_time_quit_smoking_complete() | R/smoking-cessation.R |
| } | ||
| ``` | ||
|
|
||
| The helper (`R/utility-functions.R`) handles NULL, empty input, and `clean_variables()` in one call. Use this when: |
There was a problem hiding this comment.
derive_passthrough() is defined in R/clean-variables.R (line 493), not R/utility-functions.R.
| - **Step 1** looks up *input* variable patterns — e.g., `SMKDSTY_A` has valid range 1-6 with codes 7/8/9 as missing | ||
| - **Step 3** looks up *output* variable patterns — e.g., `pack_years_der` has valid range 0-165 | ||
|
|
||
| A function that skips Step 3 is a bug, not a simplification. The only exception is `derive_passthrough()`, where input and output are the same variable. |
There was a problem hiding this comment.
The current wording ("The only exception is derive_passthrough(), where input and output are the same variable") describes the condition but not why it matters. It also reads as if pass-through functions skip Step 1 / Step 3, which contradicts the immediately preceding sentence ("A function that skips Step 3 is a bug, not a simplification.").
Clarify that the helper encapsulates Step 1 and Step 3 rather than skipping them — the work still happens, it's just inside the helper instead of in the caller's function body. Suggested rewrite:
A function that skips Step 3 is a bug, not a simplification. The
only exception is derived pass-through variables that use the
derive_passthrough()helper — in that case Step 1 and Step 3 are
still happening, just encapsulated inside the helper rather than
written explicitly in the function body. Since input and output are
the same variable, a singleclean_variables()lookup is enough.
Hello @yulric,
I have brought all the relevant skills and validation files from PR #181 for your review, so there should not be anymore merge conflicts.
Here is Doug's text from the original PR:
Summary
check_worksheet(),fix_worksheet(),scope_worksheets(), YAML schemas, and CLI scripts (exec/check-worksheets.R,exec/fix-worksheets.R)Test plan
Rscript exec/check-worksheets.Rruns without errorsRscript exec/fix-worksheets.Rruns without errorsdevtools::load_all()loads successfullyRscript exec/check-worksheets.R --variables "SDCGCGT"Please let me know what you think or if you need anything else.
Sincerely,
Rafidul