Skip to content

Linkml conversion tooling#387

Draft
yarikoptic wants to merge 57 commits into
masterfrom
linkml-conversion
Draft

Linkml conversion tooling#387
yarikoptic wants to merge 57 commits into
masterfrom
linkml-conversion

Conversation

@yarikoptic
Copy link
Copy Markdown
Member

@yarikoptic yarikoptic commented Mar 20, 2026

This is an extract with amends from

which (branch linkml-auto-converted) would keep merging this branch into itself while reflecting on changes in the branch which could be rebased or gain merges from the master, and also can accumulate or drop "patch branches" from within its script defining what to patch with.

This way linkml-auto-converted would represent reflection of current state of conversion

TODO/PLAN

  • Establish branch linkml-auto-converted -- that one in WiP: Branch with auto converted linkml model #381
  • Made ‘hatch’ script (you could add pydantic2linkml as dependency there) to convert orig_models.py into dandischema/models.yaml : hatch ... TODO
  • Translated the original models.py into dandischema/models.yaml and overlaid with an [dandischema/models_overlay.yaml] overlay file.
  • script tools/linkml_conversion to convert into ‘linkml-auto-converted’
  • define model_instances.yaml (or alike) which would define pre-populated records such as standards (bids, nwb, ...). aim for potentially multiple classes there.
  • add a github workflow here which would react to changes into 'master' and this branch and with manual dispatch, which would first merge master into this branch, then run the script, and push results to linkml-auto-converted branch. This way we would always have 'up to date' and automatically updated state of that branch.
  • address "notes" about failed conversions one way (changing current dandi-schema pydantic model) or another (pydantic2linkml) or !
    • we can add a custom script to "enhance" auto generate linkml model to address any changes needed programmatically!
    • we can have a branch (or just a .patch file) with changes to perform on top of converted linkml
  • ...
  • There you produce pydantic model out of this patched model sufficient (although potentially more relaxed) to replace current pydantic model.

candleindark and others added 17 commits March 13, 2026 17:01
 Specify Hatch-managed env for auto converting
 `dandischema.models` to LinkML schema and
 back to Pydantic models
Provide script to translate `dandischema.models`
in to a LinkML schema and overly it with
definition provided by an overlay file.
Provide script to translate `dandischema/models.yaml`
back to Pydantic models and store them in
`dandischema/models.py`
The previous BRE pattern used `\+` (GNU sed extension) which silently
fails
on macOS BSD sed. Switch to `-E` (extended regex) with POSIX character
class
`[^[:space:]]` instead of `\S` (also unsupported by BSD sed), making the
normalization work on both macOS and Linux.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Expand comment for linkml-auto-converted hatch env with usage instructions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There is no prefix defined as `dandi_default`.
The intended default prefix is `dandi`
…ed and some symbols from _orig for now

we do it so it does not overlay models.py since then git
is unable to track renames
we had to maintain original filename for models.py to apply patches
easily
Comment thread tools/linkml_conversion Outdated
# Poor man patch queue implementation
# Edit this list if you want to merge or drop PRs branches to be patched with.
# Order matters
branches_to_merge=( remove-discriminated-unions )
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is where we define branches from PRs to merge!

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.83%. Comparing base (4b89e4f) to head (07e7d47).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
dandischema/models_importstab.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
- Coverage   97.92%   97.83%   -0.09%     
==========================================
  Files          18       19       +1     
  Lines        2405     2407       +2     
==========================================
  Hits         2355     2355              
- Misses         50       52       +2     
Flag Coverage Δ
unittests 97.83% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@candleindark candleindark force-pushed the linkml-conversion branch 2 times, most recently from 59b0587 to c0fbd02 Compare March 31, 2026 00:59
…nator

`dandischema.models` use `schemaKey` in each
Pydantic as a de facto type designator in
LinkML. However, director translation
to LinkML based on individual model's
defintion is not possible. This override
provided in the merge file completes the
translation
Comment thread pyproject.toml Outdated
candleindark and others added 29 commits April 6, 2026 18:48
This note is specified in the
`models_merge.yaml` file to be
added to the resulting LinkML
translation of
`dandischema.models`.
…tools

Move `remove_impossible_slot_usage_notes.py` and `sanitize-yaml` into
`tools/linkml_conversion_tools/` alongside the other linkml conversion
helpers, and update the `2linkml` script in `pyproject.toml` accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tterns

Rename remove_impossible_slot_usage_notes.py to remove_notes_by_pattern.py
and switch from a single substring match to a list of regex patterns
matched via re.search, so additional note families can be stripped in
the same pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rings

The translation of the max length constraint on strings
by specifying the constraint in the pattern of the string
is the best expression available in LinkML. There is no
direct expression in LinkML for a max length constraint
for string range
… definitions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ensures `enums.LicenseType.permissible_values` entries are sorted by key
for stable, readable diffs in the auto-converted LinkML YAML output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ATM needed by genschemata helper. potentially we could just patch there.
Lock to the versions currently resolved via pydantic2linkml to prevent
unintended changes when new linkml versions are released.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
These three Typer scripts under .claude/skills/dandi-linkml-validation-report/scripts/
make up a reproducible pipeline for validating dandiset metadata
against dandischema/models.yaml using the linkml-validate CLI:

- fetch_metadata.py: download raw metadata for the draft and every
  published version of every dandiset on a DANDI instance, plus a small
  info.json per version capturing schemaVersion / status / modified.
- validate_metadata.py: shell out to linkml-validate per version with
  the right target class (Dandiset for drafts, PublishedDandiset
  otherwise), writing validation.txt, validation.json and SUMMARY.md
  alongside each metadata.json.
- generate_report.py: aggregate the per-version JSON records into a
  top-level REPORT.md grouped by class x schemaVersion, with
  most-common problem patterns and links to per-version summaries.

The dandi client is added as a dependency of the linkml-auto-converted
hatch env so the fetch script can run inside it.
Previously a partial run could leave a corrupted metadata.json or
info.json behind, and a subsequent run with the default refresh=False
would happily skip the version because both files looked present.

Now the function does all network work first, pre-renders both JSON
payloads, and then writes them via temp files plus os.replace() so the
destination paths only appear once both have been written successfully
-- even under SIGKILL or other abrupt termination.

Drive-by simplifications addressed inline review notes:
- drop the parent.mkdir() side effect from the JSON writer; the caller
  now creates version_dir explicitly,
- drop json.dumps(default=str) since we already isoformat() datetimes
  before stashing them into info,
- use VersionStatus.value directly and skip the modified-is-None branch
  (per the dandi client's Version model both fields are non-optional),
- use PEP 604 `int | None` for the --limit option type.
Renames _normalise_problem -> _normalize_problem (and updates the
caller), and replaces 'organised' with 'organized' in a comment.
`validate_metadata.py` now drives validation through the LinkML
`Validator` Python API (configured with the same
`JsonschemaValidationPlugin(closed=True)` the CLI uses by default)
instead of shelling out. One run produces both a structured
`validation.json` (carrying the `ValidationResult` fields plus
`source.validator` / `validator_value` for grouping) and a
`validation.txt` transcript byte-equivalent to what `linkml-validate`
would have printed.

`@context` is stripped from the data instance before validation so
the JSON-schema check doesn't flag the JSON-LD framing key as an
unexpected property — see linkml/linkml#3442.

`generate_report.py` is updated in lockstep: problem patterns are now
grouped by `[severity] <validator> message`, sourced from the
structured record rather than regex-scrubbed CLI text.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The column carries the per-version `status` field returned by the
DANDI Archive API (e.g. `VALID`, `Published`), not anything the LinkML
validator computed. Labelling it "API Status" makes that distinction
explicit so a reader doesn't mistake it for a validation outcome.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Authors the skill following the agentskills.io specification:

- SKILL.md carries the required `name` / `description` frontmatter
  plus a `compatibility` note for the hatch env, and a tight
  workflow body (when to use, prerequisites, three-stage pipeline
  with exact `hatch run` commands).
- references/OUTPUT.md documents the on-disk layout and JSON field
  shapes the pipeline emits, so callers don't have to read the
  scripts to understand the output.
- references/DESIGN.md captures the rationale behind non-obvious
  choices (LinkML Python API instead of CLI, closed=True plugin,
  `@context` strip per linkml/linkml#3442, byte-equivalent CLI
  transcript, all-or-nothing fetch writes, resume semantics).

Split this way to take advantage of progressive disclosure: SKILL.md
stays under ~100 lines for the activation tier, with the deeper
material loaded only when the agent actually needs it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds an `allowed-tools` declaration to SKILL.md so the three hatch-run
pipeline commands and the supporting `git rev-parse`/`git show`/`git
restore` calls don't prompt for permission while the skill is active.
The patterns are scoped (`Bash(git:*) Bash(hatch:*)`), so unrelated
shell commands still go through the normal approval flow.

`allowed-tools` is marked experimental in the agentskills.io spec, so
behavior may vary between agent implementations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Naming the aggregated report `README.md` means GitHub renders it
automatically as the landing view when the output directory is
presented as a repo, so a reader lands on the validation report
without having to click into a file. The on-disk layout is otherwise
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`references/DESIGN.md` and `references/OUTPUT.md` largely duplicated
content already present in the three scripts' module docstrings and
inline comments — the maintenance cost of two parallel sources
outweighed the progressive-disclosure benefit at this scale. Anyone
needing the on-disk layout or design rationale can read the relevant
script directly.

SKILL.md's "Further reading" section is replaced with a one-line
pointer to the script docstrings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`validate_metadata.py` now runs ``dandischema.metadata.migrate`` on
each version's raw metadata before validating. The migrated instance
is persisted as ``metadata_migrated.json`` (the verbatim
``metadata.json`` is left untouched) and is what the LinkML validator
sees. Versions whose migration fails are recorded with the error and
skipped for validation — the validator never sees something the
migrator couldn't handle.

`validation.json` gains ``migration_status`` (``"success"`` /
``"failed"``) and ``migration_error`` fields. On migration failure
``problems`` is empty and ``exit_code`` is null. The CLI-equivalent
transcript is replaced by a one-line ``Migration failed: …`` notice
and ``SUMMARY.md`` calls the failure out instead of rendering a
problems block.

`generate_report.py` distinguishes migration failures in both the
overall headline (``N valid / M migration-failed / P with problems``)
and per-bucket sections (extra ``Migration failed:`` count when
non-zero) and renders an ``[migration failed]`` cell in the
per-version table linking to the version's ``SUMMARY.md`` for the
failure detail. Migration-failed versions are excluded from problem
pattern grouping since validation never ran for them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The output tree no longer namespaces under a `<short-sha>/` directory.
All artifacts live under one flat root:

    linkml-validation-reports/
    ├── README.md
    └── data/<dandiset>/<version>/{metadata.json, info.json,
                                   metadata_migrated.json,
                                   validation.json, validation.txt,
                                   SUMMARY.md}

Raw metadata is schema-independent and only fetched once; subsequent
runs against a different schema reuse it.

`validate_metadata.py`'s resume guard is now schema-aware. Each
`validation.json` is stamped with the SHA-256 of the schema file's
bytes (`schema_sha256` field). On a re-run the guard skips a version
only when its stamp matches the current schema, so a schema-content
change — committed *or* uncommitted — re-runs migration and
validation automatically without `--refresh`. `--refresh` is now
documented as a forceful override only.

Per-version logging moved into `_validate_one`. The function used to
return a `(target_class, migration_status, n_problems)` tuple
consumed *only* by the orchestrator's per-version log line. With the
log emitted at the decision site (resumed / migration failed /
migrated and validated), the return value carried no information and
has been dropped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- patch -p1 --forward: skip hunks that are already applied (default
  `Assume -R?` would silently revert them since stdin is the diff).
- After the patch loop, inspect captured output: abort on real
  conflicts ("failed" / "FAILED"), tolerate non-zero exit only when a
  skip indicator ("previously applied" / "reversed" / "skipping patch")
  is present.
- Delete .rej files left by --forward.
- Replace `git commit -a` with `git add -A; git commit` so newly
  introduced files from patched branches are included in the merge
  commit.
Comment thread tools/linkml_conversion
Comment on lines +64 to +73
if [ "$status" -ne 0 ]; then
if grep -qi 'failed' <<<"$out"; then
echo "patch FAILED for branch $b — see rejects" >&2
exit 1
fi
if ! grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then
echo "patch exited $status for branch $b without a recognized skip indicator; aborting" >&2
exit 1
fi
fi
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [ "$status" -ne 0 ]; then
if grep -qi 'failed' <<<"$out"; then
echo "patch FAILED for branch $b — see rejects" >&2
exit 1
fi
if ! grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then
echo "patch exited $status for branch $b without a recognized skip indicator; aborting" >&2
exit 1
fi
fi
if [ "$status" -ne 0 ]; then
if grep -qiE 'previously applied|reversed|skipping patch' <<<"$out"; then
echo "patch exited $status for branch $b but it was about applied patch, ignoring"
else
echo "patch FAILED for branch $b with exit $status — see rejects" >&2
exit $status
fi
fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants