Add multilingual entity alias guard#422
Conversation
|
@algora-pbc /claim #17 |
|
Hardening update pushed in 1c90584: multilingual alias normalization now applies Unicode NFKC normalization and collapses internal whitespace before alias lookup. This prevents trusted translated aliases from being suppressed only because an ontology export used decomposed accents or a manuscript had extra spacing. I added a regression that failed before the fix with Validation refreshed locally:
|
|
Hardening update pushed in b912229: language tags are now normalized for alias lookup while the original mention language tag is still preserved in the decision packet. This prevents trusted translated aliases from being suppressed just because a corpus/export used Verification refreshed:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Why this matters:
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Why this matters:
Validation refreshed locally:
|
|
Hardening update pushed in This closes a graph-safety gap where the extractor could propose one canonical entity while multilingual alias lookup resolved the text to a different canonical entity. The guard now holds that disagreement as Fresh validation from
This keeps #422 distinct from #379: #422 protects multilingual alias/entity recommendation correctness, while #379 covers geospatial field-sample provenance and safe location-edge publication. |
|
Hardening update pushed in This closes two malformed-evidence crash paths in the multilingual alias guard:
Fresh validation from
This keeps #422 focused on multilingual alias/entity recommendation correctness and distinct from #379 geospatial provenance and #515 organism/strain boundary work. |
|
Pushed focused hardening commit New regression now holds |
|
Hardened malformed top-level corpus packets in What changed:
Verified locally:
|
/claim #17
@algora-pbc /claim #17
Summary
Adds a distinct
multilingual-entity-alias-guard/slice for Scientific Knowledge Graph Integration.The guard normalizes multilingual scientific mentions before they become graph nodes, entity-page aliases, or recommendation signals. It accepts trusted translated aliases only when numeric confidence evidence is present, preserves original language tags, normalizes language-tag casing plus underscore and hyphen regional subtags for lookup, emits JSON-LD/schema.org-style entity packets, holds homographs, false friends, same-language alias collisions, extractor-candidate/alias conflicts, regional-tag homographs, malformed top-level corpus packets, malformed alias evidence, malformed mention entries or mention text, and mixed-script Latin-language lookalikes including lowercase Greek or Cyrillic confusables for curator review, suppresses low-confidence or missing-confidence aliases before graph recommendations are shown, and handles sparse or malformed ontology/corpus exports without runtime failures when corpus shape, localized names, mentions, or homograph policies are omitted or malformed.
Hardening Updates
44c44c3: malformed top-level corpus packets such asevaluateAliasGuard(null)now emit a high-prioritymalformed-corpus-packetcurator hold andreview-multilingual-malformed-corpusaction instead of crashing atcorpus.entitiesbefore graph evidence is produced.malformed-corpus-packet.jsonand refreshed the generated Markdown evidence so reviewers can inspect the held malformed-corpus path directly.mentions: [null]now emitmalformed-mention-entrycurator holds with high-priorityreview-multilingual-malformed-mentionactions instead of crashing before graph packets are produced.aliasEvidenceIssuespreserved in the reviewer packet instead of crashing alias indexing.malformed-mention-textcurator holds with high-priorityreview-multilingual-malformed-mentionactions instead of crashing normalization or reaching recommendation-safe IDs.es-MXand underscore tags such ases_MXfall back to base-language alias and homograph policy while preserving the original tag.localizedNames,mentions, orhomographsare omitted, instead of throwing before curator policy can run.Non-overlap
This is scoped to multilingual scientific alias quality before graph nodes and recommendations are produced. It does not duplicate broad entity extraction/navigation, ontology deprecation or synonym migration, recommendation visibility/diversity, geospatial provenance, organism/strain boundaries, clinical trial, biological accession, software runtime, protocol deviation, sample custody, or temporal validity guards.
Validation
TypeError: Cannot read properties of null (reading 'entities')whenevaluateAliasGuard(null)ran before graph packet generation.TypeError: Cannot read properties of null (reading 'language')whenmentionscontained a null row.cd multilingual-entity-alias-guard && npm testpassed: 22 tests.cd multilingual-entity-alias-guard && npm run checkpassed test, demo, and video generation.node --checkpassed forindex.js,demo.js, andtest.js.malformed-corpus-packet.json.ffprobeverifiedmultilingual-entity-alias-guard/reports/demo.mp4as H.264, 1280x720, 4s, 30fps, 120 frames, 50,413 bytes.git diff --checkandgit diff --cached --checkpassed; Git only reported Windows line-ending normalization warnings before staging.multilingual-entity-alias-guard/.CLEAN; no check contexts are reported for this branch.Demo Artifacts
multilingual-entity-alias-guard/reports/alias-guard-packet.jsonmultilingual-entity-alias-guard/reports/sparse-alias-guard-packet.jsonmultilingual-entity-alias-guard/reports/candidate-alias-conflict-packet.jsonmultilingual-entity-alias-guard/reports/malformed-alias-evidence-packet.jsonmultilingual-entity-alias-guard/reports/malformed-mention-text-packet.jsonmultilingual-entity-alias-guard/reports/malformed-mention-entry-packet.jsonmultilingual-entity-alias-guard/reports/malformed-corpus-packet.jsonmultilingual-entity-alias-guard/reports/alias-guard-report.mdmultilingual-entity-alias-guard/reports/summary.svgmultilingual-entity-alias-guard/reports/demo.mp4Synthetic data only. No credentials, private corpora, live ontology calls, search indexes, recommendation systems, or external APIs are used.
AI-assisted with OpenAI Codex; I reviewed and locally verified the diff before submitting.