Linkml conversion tooling by yarikoptic · Pull Request #387 · dandi/dandi-schema

yarikoptic · 2026-03-20T21:59:17Z

This is an extract with amends from

WiP: Branch with auto converted linkml model #381

which (branch linkml-auto-converted) would keep merging this branch into itself while reflecting on changes in the branch which could be rebased or gain merges from the master, and also can accumulate or drop "patch branches" from within its script defining what to patch with.

This way linkml-auto-converted would represent reflection of current state of conversion

TODO/PLAN

Specify Hatch-managed env for auto converting `dandischema.models` to LinkML schema and back to Pydantic models

Provide script to translate `dandischema.models` in to a LinkML schema and overly it with definition provided by an overlay file.

Provide script to translate `dandischema/models.yaml` back to Pydantic models and store them in `dandischema/models.py`

These prefixes are copied from https://github.com/dandi/schema/blob/master/releases/0.7.0/context.json

The previous BRE pattern used `\+` (GNU sed extension) which silently fails on macOS BSD sed. Switch to `-E` (extended regex) with POSIX character class `[^[:space:]]` instead of `\S` (also unsupported by BSD sed), making the normalization work on both macOS and Linux. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Expand comment for linkml-auto-converted hatch env with usage instructions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

There is no prefix defined as `dandi_default`. The intended default prefix is `dandi`

…ed and some symbols from _orig for now we do it so it does not overlay models.py since then git is unable to track renames

…into linkml-auto-converted

… flake8 issues

we had to maintain original filename for models.py to apply patches easily

yarikoptic · 2026-03-20T22:00:56Z

+# Poor man patch queue implementation
+# Edit this list if you want to merge or drop PRs branches to be patched with.
+# Order matters
+branches_to_merge=( remove-discriminated-unions )


that is where we define branches from PRs to merge!

codecov · 2026-03-20T22:10:49Z

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.83%. Comparing base (4b89e4f) to head (07e7d47).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
dandischema/models_importstab.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
- Coverage   97.92%   97.83%   -0.09%     
==========================================
  Files          18       19       +1     
  Lines        2405     2407       +2     
==========================================
  Hits         2355     2355              
- Misses         50       52       +2

Flag	Coverage Δ
unittests	`97.83% <0.00%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Provide a partial schema to be merged with the generated schema

…nator `dandischema.models` use `schemaKey` in each Pydantic as a de facto type designator in LinkML. However, director translation to LinkML based on individual model's defintion is not possible. This override provided in the merge file completes the translation

This note is specified in the `models_merge.yaml` file to be added to the resulting LinkML translation of `dandischema.models`.

…tools Move `remove_impossible_slot_usage_notes.py` and `sanitize-yaml` into `tools/linkml_conversion_tools/` alongside the other linkml conversion helpers, and update the `2linkml` script in `pyproject.toml` accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tterns Rename remove_impossible_slot_usage_notes.py to remove_notes_by_pattern.py and switch from a single substring match to a list of regex patterns matched via re.search, so additional note families can be stripped in the same pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rings The translation of the max length constraint on strings by specifying the constraint in the pattern of the string is the best expression available in LinkML. There is no direct expression in LinkML for a max length constraint for string range

… definitions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ensures `enums.LicenseType.permissible_values` entries are sorted by key for stable, readable diffs in the auto-converted LinkML YAML output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ATM needed by genschemata helper. potentially we could just patch there.

…uickly

Lock to the versions currently resolved via pydantic2linkml to prevent unintended changes when new linkml versions are released. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

These three Typer scripts under .claude/skills/dandi-linkml-validation-report/scripts/ make up a reproducible pipeline for validating dandiset metadata against dandischema/models.yaml using the linkml-validate CLI: - fetch_metadata.py: download raw metadata for the draft and every published version of every dandiset on a DANDI instance, plus a small info.json per version capturing schemaVersion / status / modified. - validate_metadata.py: shell out to linkml-validate per version with the right target class (Dandiset for drafts, PublishedDandiset otherwise), writing validation.txt, validation.json and SUMMARY.md alongside each metadata.json. - generate_report.py: aggregate the per-version JSON records into a top-level REPORT.md grouped by class x schemaVersion, with most-common problem patterns and links to per-version summaries. The dandi client is added as a dependency of the linkml-auto-converted hatch env so the fetch script can run inside it.

Previously a partial run could leave a corrupted metadata.json or info.json behind, and a subsequent run with the default refresh=False would happily skip the version because both files looked present. Now the function does all network work first, pre-renders both JSON payloads, and then writes them via temp files plus os.replace() so the destination paths only appear once both have been written successfully -- even under SIGKILL or other abrupt termination. Drive-by simplifications addressed inline review notes: - drop the parent.mkdir() side effect from the JSON writer; the caller now creates version_dir explicitly, - drop json.dumps(default=str) since we already isoformat() datetimes before stashing them into info, - use VersionStatus.value directly and skip the modified-is-None branch (per the dandi client's Version model both fields are non-optional), - use PEP 604 `int | None` for the --limit option type.

Renames _normalise_problem -> _normalize_problem (and updates the caller), and replaces 'organised' with 'organized' in a comment.

`validate_metadata.py` now drives validation through the LinkML `Validator` Python API (configured with the same `JsonschemaValidationPlugin(closed=True)` the CLI uses by default) instead of shelling out. One run produces both a structured `validation.json` (carrying the `ValidationResult` fields plus `source.validator` / `validator_value` for grouping) and a `validation.txt` transcript byte-equivalent to what `linkml-validate` would have printed. `@context` is stripped from the data instance before validation so the JSON-schema check doesn't flag the JSON-LD framing key as an unexpected property — see linkml/linkml#3442. `generate_report.py` is updated in lockstep: problem patterns are now grouped by `[severity] <validator> message`, sourced from the structured record rather than regex-scrubbed CLI text. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The column carries the per-version `status` field returned by the DANDI Archive API (e.g. `VALID`, `Published`), not anything the LinkML validator computed. Labelling it "API Status" makes that distinction explicit so a reader doesn't mistake it for a validation outcome. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Authors the skill following the agentskills.io specification: - SKILL.md carries the required `name` / `description` frontmatter plus a `compatibility` note for the hatch env, and a tight workflow body (when to use, prerequisites, three-stage pipeline with exact `hatch run` commands). - references/OUTPUT.md documents the on-disk layout and JSON field shapes the pipeline emits, so callers don't have to read the scripts to understand the output. - references/DESIGN.md captures the rationale behind non-obvious choices (LinkML Python API instead of CLI, closed=True plugin, `@context` strip per linkml/linkml#3442, byte-equivalent CLI transcript, all-or-nothing fetch writes, resume semantics). Split this way to take advantage of progressive disclosure: SKILL.md stays under ~100 lines for the activation tier, with the deeper material loaded only when the agent actually needs it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds an `allowed-tools` declaration to SKILL.md so the three hatch-run pipeline commands and the supporting `git rev-parse`/`git show`/`git restore` calls don't prompt for permission while the skill is active. The patterns are scoped (`Bash(git:*) Bash(hatch:*)`), so unrelated shell commands still go through the normal approval flow. `allowed-tools` is marked experimental in the agentskills.io spec, so behavior may vary between agent implementations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Naming the aggregated report `README.md` means GitHub renders it automatically as the landing view when the output directory is presented as a repo, so a reader lands on the validation report without having to click into a file. The on-disk layout is otherwise unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`references/DESIGN.md` and `references/OUTPUT.md` largely duplicated content already present in the three scripts' module docstrings and inline comments — the maintenance cost of two parallel sources outweighed the progressive-disclosure benefit at this scale. Anyone needing the on-disk layout or design rationale can read the relevant script directly. SKILL.md's "Further reading" section is replaced with a one-line pointer to the script docstrings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`validate_metadata.py` now runs ``dandischema.metadata.migrate`` on each version's raw metadata before validating. The migrated instance is persisted as ``metadata_migrated.json`` (the verbatim ``metadata.json`` is left untouched) and is what the LinkML validator sees. Versions whose migration fails are recorded with the error and skipped for validation — the validator never sees something the migrator couldn't handle. `validation.json` gains ``migration_status`` (``"success"`` / ``"failed"``) and ``migration_error`` fields. On migration failure ``problems`` is empty and ``exit_code`` is null. The CLI-equivalent transcript is replaced by a one-line ``Migration failed: …`` notice and ``SUMMARY.md`` calls the failure out instead of rendering a problems block. `generate_report.py` distinguishes migration failures in both the overall headline (``N valid / M migration-failed / P with problems``) and per-bucket sections (extra ``Migration failed:`` count when non-zero) and renders an ``[migration failed]`` cell in the per-version table linking to the version's ``SUMMARY.md`` for the failure detail. Migration-failed versions are excluded from problem pattern grouping since validation never ran for them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The output tree no longer namespaces under a `<short-sha>/` directory. All artifacts live under one flat root: linkml-validation-reports/ ├── README.md └── data/<dandiset>/<version>/{metadata.json, info.json, metadata_migrated.json, validation.json, validation.txt, SUMMARY.md} Raw metadata is schema-independent and only fetched once; subsequent runs against a different schema reuse it. `validate_metadata.py`'s resume guard is now schema-aware. Each `validation.json` is stamped with the SHA-256 of the schema file's bytes (`schema_sha256` field). On a re-run the guard skips a version only when its stamp matches the current schema, so a schema-content change — committed *or* uncommitted — re-runs migration and validation automatically without `--refresh`. `--refresh` is now documented as a forceful override only. Per-version logging moved into `_validate_one`. The function used to return a `(target_class, migration_status, n_problems)` tuple consumed *only* by the orchestrator's per-version log line. With the log emitted at the decision site (resumed / migration failed / migrated and validated), the return value carried no information and has been dropped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ot in LinkML

- patch -p1 --forward: skip hunks that are already applied (default `Assume -R?` would silently revert them since stdin is the diff). - After the patch loop, inspect captured output: abort on real conflicts ("failed" / "FAILED"), tolerate non-zero exit only when a skip indicator ("previously applied" / "reversed" / "skipping patch") is present. - Delete .rej files left by --forward. - Replace `git commit -a` with `git add -A; git commit` so newly introduced files from patched branches are included in the merge commit.

yarikoptic · 2026-05-08T19:24:04Z

+    if [ "$status" -ne 0 ]; then
+        if grep -qi 'failed' <<<"$out"; then
+            echo "patch FAILED for branch $b — see rejects" >&2
+            exit 1
+        fi
+        if ! grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then
+            echo "patch exited $status for branch $b without a recognized skip indicator; aborting" >&2
+            exit 1
+        fi
+    fi


Suggested change

if [ "$status" -ne 0 ]; then

if grep -qi 'failed' <<<"$out"; then

echo "patch FAILED for branch $b — see rejects" >&2

exit 1

fi

if ! grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then

echo "patch exited $status for branch $b without a recognized skip indicator; aborting" >&2

exit 1

fi

fi

if [ "$status" -ne 0 ]; then

if grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then

echo "patch exited $status for branch $b but it was about applied patch, ignoring"

else

echo "patch FAILED for branch $b with exit $status — see rejects" >&2

exit $status

fi

fi

candleindark and others added 17 commits March 13, 2026 17:01

build(hatch-env): specify env for conversion to LinkML and back

cffee39

Specify Hatch-managed env for auto converting `dandischema.models` to LinkML schema and back to Pydantic models

build(hatch-env): provide script to translate dandischema.models

b5644c1

Provide script to translate `dandischema.models` in to a LinkML schema and overly it with definition provided by an overlay file.

build(hatch-env): provide script to translate dandischema/models.yaml

cfe3bea

Provide script to translate `dandischema/models.yaml` back to Pydantic models and store them in `dandischema/models.py`

Now we will sanitize the converted model addresses

73c46b6

more sanitization

c79b1e2

feat: provide prefixes definition in the overlay file

0b2b51d

These prefixes are copied from https://github.com/dandi/schema/blob/master/releases/0.7.0/context.json

Added linkml: prefix and sorted all the entries

6cb2f82

Provide default prefix to be custom dandi_default

76526a9

fix: correct default_prefix to dandi

fad6879

There is no prefix defined as `dandi_default`. The intended default prefix is `dandi`

Adding models_importstab.py so we could use it to import from convert…

5f78b87

…ed and some symbols from _orig for now we do it so it does not overlay models.py since then git is unable to track renames

Adding the tool to assist in establishing "merges" with regeneration …

0208c9c

…into linkml-auto-converted

We are doomed to use --no-verify for now as converted python code has…

488e2e9

… flake8 issues

BF the commit version specification

0ba5ad7

Implement poor man patch queue

2faf932

reflect that I rewritten how we handle patching

c177479

we had to maintain original filename for models.py to apply patches easily

Inform on what we are doing

54cf276

yarikoptic commented Mar 20, 2026

View reviewed changes

Now do patch diff of master in as well for good measure

5eb4a91

Removing reporting branches - did not work

76a9d77

yarikoptic mentioned this pull request Mar 20, 2026

WiP: Branch with auto converted linkml model #381

Draft

yarikoptic and others added 4 commits March 20, 2026 18:25

ENH: patch converted to pydantic model to strip dciteCOLO

c6172c4

do pre-commit

e2d124c

feat: provide partial schema

d8b1bea

Provide a partial schema to be merged with the generated schema

feat: enable 2linkml to merge partial schema in models_merge.yaml

c0fbd02

candleindark force-pushed the linkml-conversion branch 2 times, most recently from 59b0587 to c0fbd02 Compare March 31, 2026 00:59

yarikoptic commented Mar 31, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

candleindark and others added 29 commits April 6, 2026 18:48

feat: add note regarding mismatch of schemaKey default and model name

e404946

This note is specified in the `models_merge.yaml` file to be added to the resulting LinkML translation of `dandischema.models`.

style: attach MANUAL_NOTE prefix to manually added notes

7dd492d

doc: claude report on analysis for schemaKey divergence for some classes

aeae21f

feat: add script to remove schemaKey entries from slot_usage in class…

2c38801

… definitions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add script to sort LicenseType permissible values alphabetically

7773f75

Ensures `enums.LicenseType.permissible_values` entries are sorted by key for stable, readable diffs in the auto-converted LinkML YAML output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Import also original DANDI_NSKEY and get_schema_version into models.py

b688b3b

ATM needed by genschemata helper. potentially we could just patch there.

ENH: add 2 hatch specs for conversions into jsonschema for comparisons

db7be48

Do also conversions to jsonschema for comparisons

3b22cd7

Harmonize converted files

14c5177

Split into two conversion script so we could play with linkml model q…

05576e9

…uickly

Stage after all pre-commit changes

33dd458

Pin linkml and linkml-runtime in linkml-auto-converted hatch env

982491a

Lock to the versions currently resolved via pydantic2linkml to prevent unintended changes when new linkml versions are released. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat: update notes regarding length restrictions to be removed

4ec6cda

Use American English spelling in skill scripts

896d4fc

Renames _normalise_problem -> _normalize_problem (and updates the caller), and replaces 'organised' with 'organized' in a comment.

feat: ensure title in JSON schema is generated from title meta sl…

cc12ce3

…ot in LinkML

yarikoptic commented May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkml conversion tooling#387

Linkml conversion tooling#387
yarikoptic wants to merge 57 commits into
masterfrom
linkml-conversion

yarikoptic commented Mar 20, 2026 •

edited

Loading

Uh oh!

yarikoptic Mar 20, 2026

Uh oh!

codecov Bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

yarikoptic May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yarikoptic commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarikoptic Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

yarikoptic May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yarikoptic commented Mar 20, 2026 •

edited

Loading

codecov Bot commented Mar 20, 2026 •

edited

Loading