Add multilingual entity alias guard by KoiosSG · Pull Request #422 · SCIBASE-AI/SCIBASE.AI

KoiosSG · 2026-05-28T06:30:41Z

Summary

Adds a distinct multilingual-entity-alias-guard/ slice for Scientific Knowledge Graph Integration.

The guard normalizes multilingual scientific mentions before they become graph nodes, entity-page aliases, or recommendation signals. It accepts trusted translated aliases only when numeric confidence evidence is present, preserves original language tags, normalizes language-tag casing plus underscore and hyphen regional subtags for lookup, emits JSON-LD/schema.org-style entity packets, holds homographs, false friends, same-language alias collisions, extractor-candidate/alias conflicts, regional-tag homographs, malformed top-level corpus packets, malformed alias evidence, malformed mention entries or mention text, and mixed-script Latin-language lookalikes including lowercase Greek or Cyrillic confusables for curator review, suppresses low-confidence or missing-confidence aliases before graph recommendations are shown, and handles sparse or malformed ontology/corpus exports without runtime failures when corpus shape, localized names, mentions, or homograph policies are omitted or malformed.

Hardening Updates

44c44c3: malformed top-level corpus packets such as evaluateAliasGuard(null) now emit a high-priority malformed-corpus-packet curator hold and review-multilingual-malformed-corpus action instead of crashing at corpus.entities before graph evidence is produced.
Added malformed-corpus-packet.json and refreshed the generated Markdown evidence so reviewers can inspect the held malformed-corpus path directly.
Malformed mention rows such as mentions: [null] now emit malformed-mention-entry curator holds with high-priority review-multilingual-malformed-mention actions instead of crashing before graph packets are produced.
Malformed localized-name evidence is omitted from alias lookup and JSON-LD alternate names, with aliasEvidenceIssues preserved in the reviewer packet instead of crashing alias indexing.
Malformed mention text values emit malformed-mention-text curator holds with high-priority review-multilingual-malformed-mention actions instead of crashing normalization or reaching recommendation-safe IDs.
Extractor candidate IDs that disagree with trusted multilingual alias lookup are held for curator review instead of silently overriding either signal.
Same-language translated alias collisions are held for curator review instead of silently attaching a mention to the wrong canonical entity.
Unicode NFKC normalization and whitespace collapsing prevent trusted translated aliases from being suppressed because of composed/decomposed accents or spacing differences.
Language-tag casing is normalized for alias lookup while the original tag remains preserved in the decision packet.
Regional language tags such as es-MX and underscore tags such as es_MX fall back to base-language alias and homograph policy while preserving the original tag.
Missing or non-numeric confidence evidence suppresses candidate alias recommendations instead of allowing an unproven canonical entity mapping.
Mixed-script Latin-language aliases with Cyrillic or Greek lookalike characters are held for curator review before graph edges or recommendations are produced.
Sparse ontology/corpus exports emit deterministic empty review or entity-alias evidence when localizedNames, mentions, or homographs are omitted, instead of throwing before curator policy can run.

Non-overlap

This is scoped to multilingual scientific alias quality before graph nodes and recommendations are produced. It does not duplicate broad entity extraction/navigation, ontology deprecation or synonym migration, recommendation visibility/diversity, geospatial provenance, organism/strain boundaries, clinical trial, biological accession, software runtime, protocol deviation, sample custody, or temporal validity guards.

Validation

Latest malformed top-level corpus regression failed before implementation with TypeError: Cannot read properties of null (reading 'entities') when evaluateAliasGuard(null) ran before graph packet generation.
Prior malformed-row regression failed before implementation with TypeError: Cannot read properties of null (reading 'language') when mentions contained a null row.
Earlier red regressions covered malformed localized-name evidence and malformed mention text crashes before implementation.
cd multilingual-entity-alias-guard && npm test passed: 22 tests.
cd multilingual-entity-alias-guard && npm run check passed test, demo, and video generation.
node --check passed for index.js, demo.js, and test.js.
Parsed all seven generated JSON packets successfully, including malformed-corpus-packet.json.
ffprobe verified multilingual-entity-alias-guard/reports/demo.mp4 as H.264, 1280x720, 4s, 30fps, 120 frames, 50,413 bytes.
git diff --check and git diff --cached --check passed; Git only reported Windows line-ending normalization warnings before staging.
Staged allowlist validation confirmed all staged paths were under multilingual-entity-alias-guard/.
Focused payout/contact, credential, and token scan returned no matches.
GitHub PR merge state after push: CLEAN; no check contexts are reported for this branch.

Demo Artifacts

multilingual-entity-alias-guard/reports/alias-guard-packet.json
multilingual-entity-alias-guard/reports/sparse-alias-guard-packet.json
multilingual-entity-alias-guard/reports/candidate-alias-conflict-packet.json
multilingual-entity-alias-guard/reports/malformed-alias-evidence-packet.json
multilingual-entity-alias-guard/reports/malformed-mention-text-packet.json
multilingual-entity-alias-guard/reports/malformed-mention-entry-packet.json
multilingual-entity-alias-guard/reports/malformed-corpus-packet.json
multilingual-entity-alias-guard/reports/alias-guard-report.md
multilingual-entity-alias-guard/reports/summary.svg
multilingual-entity-alias-guard/reports/demo.mp4

Synthetic data only. No credentials, private corpora, live ontology calls, search indexes, recommendation systems, or external APIs are used.

AI-assisted with OpenAI Codex; I reviewed and locally verified the diff before submitting.

KoiosSG · 2026-05-28T06:34:32Z

@algora-pbc /claim #17

KoiosSG · 2026-05-28T15:52:35Z

Hardening update pushed in 1c90584: multilingual alias normalization now applies Unicode NFKC normalization and collapses internal whitespace before alias lookup. This prevents trusted translated aliases from being suppressed only because an ontology export used decomposed accents or a manuscript had extra spacing.

I added a regression that failed before the fix with suppress-recommendation == accept-canonical-entity for a Spanish gene-therapy mention containing composed/decomposed accent and whitespace differences, and now passes.

Validation refreshed locally:

npm test -> 7 multilingual entity alias guard tests passed
npm run check -> tests, demo, and demo video regenerated successfully
ffprobe on reports/demo.mp4 -> H.264, 1280x720, 4s, 30fps, 46,481 bytes
git diff --check
sensitive-term scan with rg -n "(password|secret|wallet|paypal|bank|passport|private key|api key)" multilingual-entity-alias-guard returned no matches

KoiosSG · 2026-05-29T15:58:22Z

Hardening update pushed in b912229: language tags are now normalized for alias lookup while the original mention language tag is still preserved in the decision packet. This prevents trusted translated aliases from being suppressed just because a corpus/export used ES while ontology aliases were keyed as es.

Verification refreshed:

Red regression first: npm test failed on the uppercase language-tag alias case (suppress-recommendation vs accept-canonical-entity).
Green: npm test passes with 8 multilingual entity alias guard tests.
npm run check passes: tests, demo packet/report/SVG, and demo MP4 generation.
ffprobe confirms reports/demo.mp4 is H.264, 1280x720, 30fps, 4s, 46,481 bytes.
git diff --check and git diff --cached --check pass.
Credential/payout-focused scan across changed code/docs/reports returned no matches.

KoiosSG · 2026-05-29T17:55:02Z

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 90f1648:

Added coverage for regional language tags such as es-MX falling back to base-language alias lookup (es) while preserving the original regional tag in decision output.
The same base-language fallback is used for homograph/false-friend policy, so es-MX:control is still held for curator review under the existing Spanish homograph rule instead of bypassing it.
Updated README, requirements map, and acceptance notes to make regional-subtag handling part of the reviewer-visible contract.

Validation refreshed locally:

Confirmed the regional-alias regression failed before implementation with suppress-recommendation instead of accept-canonical-entity.
npm test -> 10 multilingual entity alias guard tests passed.
npm run demo -> regenerated JSON/Markdown/SVG artifacts.
npm run video -> regenerated reports/demo.mp4.
npm run check -> test, demo, and video generation passed.
node --check on index/demo/test passed.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
git diff --check and git diff --cached --check passed; only Git line-ending normalization warnings appeared on Windows.
Credential/payout-focused scan returned no matches.

KoiosSG · 2026-05-29T19:31:03Z

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 8011d06:

Added a regression for trusted translated aliases that omit confidence evidence.
Missing or non-numeric confidence now suppresses recommendations instead of accepting a canonical entity mapping.
Candidate entity evidence remains auditable, but the entity is not included in safe recommendation IDs or entity-page mentions until confidence evidence is present.
README, requirements map, and acceptance notes now explicitly cover missing-confidence alias suppression.

Validation refreshed locally:

Confirmed the new regression failed before implementation with accept-canonical-entity instead of suppress-recommendation.
npm test -> 11 multilingual entity alias guard tests passed.
npm run check -> test, demo, and video generation passed.
npm run demo -> regenerated alias packet/report/SVG artifacts.
npm run video -> regenerated reports/demo.mp4.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
git diff --check and git diff --cached --check passed; only Git line-ending normalization warnings appeared on Windows.
Sensitive-term scan returned no payout or credential strings.
GitHub PR merge state after push: CLEAN.

KoiosSG · 2026-05-29T21:48:52Z

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in f90337e:

Normalized underscore regional language tags such as es_MX to the same lookup path as hyphenated tags such as es-MX.
Preserved the original mention language tag in decision output while applying the normalized lookup key internally.
Added regression coverage showing both trusted alias acceptance and homograph/false-friend holds still apply under underscore regional tags.
Updated README, requirements map, and acceptance notes so this separator normalization is reviewer-visible contract, not incidental behavior.

Validation refreshed locally:

Confirmed the new regression failed before implementation with suppress-recommendation instead of accept-canonical-entity for es_MX.
npm test -> 12 multilingual entity alias guard tests passed.
npm run check -> test, demo, and video generation passed.
node --check on index/demo/test passed.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
git diff --check and git diff --cached --check passed.
Sensitive-term scan returned no payout or credential strings.

KoiosSG · 2026-05-29T23:46:43Z

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in e7ffd6c:

Added detection for Latin-language scientific aliases that mix normal Latin text with Cyrillic or Greek lookalike characters.
These visually confusable aliases are now held for curator review instead of becoming quiet unknowns or trusted graph mappings.
The candidate entity evidence remains visible in the decision packet, but the mention is blocked from entity-page/recommendation outputs until reviewed.
README, requirements map, acceptance notes, and generated demo artifacts now make this mixed-script guard part of the reviewer-visible contract.

Validation refreshed locally:

Confirmed the new regression failed before implementation with suppress-recommendation instead of hold-for-curator-review.
npm test -> 13 multilingual entity alias guard tests passed.
npm run check -> test, demo, and video generation passed.
node --check on index/demo/test passed.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
git diff --check and git diff --cached --check passed.
Sensitive-term scan returned no payout, credential, or token strings.
GitHub PR merge state after push: CLEAN.

KoiosSG · 2026-05-30T08:07:29Z

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 2e4c822:

Added a regression for Latin-language aliases using lowercase Greek lookalike characters, e.g. CRISPR-Cαs9 with Greek alpha.
The script-confusable detector now covers lowercase Greek confusables, so visually spoofed Latin-language scientific aliases are held for curator review instead of quietly becoming suppressed/unknown aliases.
The demo corpus now includes mention-crispr-greek-alpha-spoof, and README/requirements/acceptance notes plus reviewer artifacts were refreshed so this gate is visible.

Why this matters:

Lowercase Greek letters are common visual spoofing characters in scientific names and identifiers. A graph alias guard that catches only uppercase Greek leaves a realistic mixed-script bypass.
This keeps PR Add multilingual entity alias guard #422 focused on multilingual alias quality while strengthening the reviewer-facing graph/recommendation safety contract.

Validation refreshed locally:

Confirmed the new regression failed before implementation with suppress-recommendation instead of hold-for-curator-review.
npm test -> 14 multilingual entity alias guard tests passed.
npm run check -> test, demo, and video generation passed.
npm run demo -> regenerated alias packet/report/SVG with held curator-review mentions now 3.
npm run video -> regenerated reports/demo.mp4.
node --check passed for index.js, demo.js, and test.js.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
git diff --check and git diff --cached --check passed; Git only reported Windows line-ending normalization warnings.
Focused sensitive scan returned no payout, credential, or token strings.
Expanded private-term scan only matched explicit safety-boundary wording in docs/report.
GitHub PR merge state after push: CLEAN; no checks are reported for this branch.

KoiosSG · 2026-05-30T10:13:58Z

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 2b9700d:

Added sparse ontology/corpus regressions for omitted localizedNames, mentions, and homographs.
Entity packets now emit empty localized-name/alternate-name evidence instead of crashing on partial ontology exports.
Missing mention lists produce deterministic empty review evidence, and missing homograph policy defaults to an empty policy.
Demo/docs now include reports/sparse-alias-guard-packet.json for reviewer inspection.

Why this matters:

Knowledge graph imports often arrive from partial ontology exports or incremental corpus batches. The alias guard should emit auditable graph evidence rather than crash before curator/recommendation policy can run.
This keeps PR Add multilingual entity alias guard #422 focused on multilingual alias quality while making it more robust than a normalization-only slice.

Validation refreshed locally:

Confirmed localized-name regression failed before implementation with TypeError: Cannot convert undefined or null to object at Object.entries(entity.localizedNames).
Confirmed missing-homograph regression failed before implementation with TypeError: Cannot read properties of undefined.
npm test -> 17 multilingual entity alias guard tests passed.
npm run demo, npm run video, and npm run check passed.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 50,395 bytes.
git diff --check and git diff --cached --check passed; only Windows line-ending normalization warnings appeared.
Sensitive-term scan returned no payout, credential, or token strings.
GitHub PR merge state after push: CLEAN; no checks are reported for this branch.

KoiosSG · 2026-05-30T12:43:39Z

Hardening update pushed in 449d7d8.

This closes a graph-safety gap where the extractor could propose one canonical entity while multilingual alias lookup resolved the text to a different canonical entity. The guard now holds that disagreement as candidate-alias-conflict, emits a high-priority review-multilingual-candidate-alias-conflict curator action, and keeps the conflicted mention out of recommendation-safe entity IDs.

Fresh validation from multilingual-entity-alias-guard/:

Red/green regression: before implementation, the conflicting candidate/alias fixture returned accept-canonical-entity; after implementation it returns hold-for-curator-review.
npm test passed: 18 tests.
npm run check passed, including demo and video generation.
npm run demo added reports/candidate-alias-conflict-packet.json for reviewer inspection.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 4s, 30fps, 50,395 bytes.
Parsed all JSON reports successfully: main packet 6 accepted/3 held/1 suppressed, conflict packet 0 accepted/1 held/0 suppressed, sparse packet 0/0/0.
git diff --check and git diff --cached --check passed; only Windows line-ending normalization warnings appeared.
Restricted-term scan of the module returned no matches, and report scanning found no unexpected private fixture terms.

This keeps #422 distinct from #379: #422 protects multilingual alias/entity recommendation correctness, while #379 covers geospatial field-sample provenance and safe location-edge publication.

KoiosSG · 2026-05-31T10:41:57Z

Hardening update pushed in 58d6051.

This closes two malformed-evidence crash paths in the multilingual alias guard:

malformed localized-name entries are omitted from alias lookup and JSON-LD alternate names, with aliasEvidenceIssues preserved for review instead of crashing alias indexing;
malformed mention text now emits a malformed-mention-text curator hold with high-priority review-multilingual-malformed-mention, keeping it out of recommendation-safe IDs.

Fresh validation from multilingual-entity-alias-guard/:

Red regressions first: malformed localized-name and malformed mention-text fixtures both failed before the fix with TypeError: term.normalize is not a function.
npm test passed: 20 tests.
npm run check passed, including demo and video generation.
Added reviewer packets: reports/malformed-alias-evidence-packet.json and reports/malformed-mention-text-packet.json.
Parsed all JSON reports successfully.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 4s, 30fps, 50,413 bytes.
node --check, git diff --check, git diff --cached --check, and the focused credential/payout/token scan passed.

This keeps #422 focused on multilingual alias/entity recommendation correctness and distinct from #379 geospatial provenance and #515 organism/strain boundary work.

KoiosSG · 2026-06-01T19:03:28Z

Pushed focused hardening commit ad8ee9b for malformed alias mention rows.

New regression now holds mentions: [null] with malformed-mention-entry, high-priority review-multilingual-malformed-mention, and malformed-mention-entry-packet.json instead of crashing before graph packets are produced. Verification passed: red regression captured first, npm test (21), npm run check, node --check, 6 JSON packet parses, ffprobe H.264 1280x720 30fps 4s, diff checks, staged allowlist check, and focused restricted-string scan. PR body has the refreshed evidence list.

KoiosSG · 2026-06-03T23:43:21Z

Hardened malformed top-level corpus packets in 44c44c3.

What changed:

evaluateAliasGuard(null) now emits a deterministic malformed-corpus-packet curator hold instead of crashing at corpus.entities before graph evidence generation.
Added reports/malformed-corpus-packet.json plus README/report/acceptance-note coverage so the malformed-corpus path is reviewable.

Verified locally:

Red regression first reproduced TypeError: Cannot read properties of null (reading 'entities').
npm test passed: 22 tests.
npm run check passed test, demo, and video generation.
node --check passed for index.js, demo.js, and test.js.
Parsed all 7 generated JSON packets, including malformed-corpus-packet.json.
ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4s, 120 frames, 50,413 bytes.
git diff --check, git diff --cached --check, staged allowlist validation, and focused credential/payout/token scan passed.

Add multilingual entity alias guard

c2e6b97

KoiosSG mentioned this pull request May 28, 2026

Scientific Knowledge Graph Integration #17

Open

taherdhanera mentioned this pull request May 28, 2026

Add geospatial sample provenance guard #379

Open

Harden multilingual alias collision handling

f8eb8dd

algora-pbc Bot added the 🙋 Bounty claim label May 28, 2026

Harden multilingual alias normalization

1c90584

Harden multilingual language tag lookup

b912229

Support regional language alias lookup

90f1648

Require alias confidence evidence

8011d06

Normalize underscored language regions

f90337e

Hold mixed-script multilingual aliases

e7ffd6c

Detect lowercase Greek alias spoofs

2e4c822

Handle sparse multilingual alias payloads

2b9700d

Harden multilingual alias candidate conflicts

449d7d8

Harden malformed alias evidence

58d6051

Harden malformed alias mention entries

ad8ee9b

Harden malformed alias corpus packets

44c44c3

Conversation

KoiosSG commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Hardening Updates

Non-overlap

Validation

Demo Artifacts

Uh oh!

KoiosSG commented May 28, 2026

Uh oh!

KoiosSG commented May 28, 2026

Uh oh!

KoiosSG commented May 29, 2026

Uh oh!

KoiosSG commented May 29, 2026

Uh oh!

KoiosSG commented May 29, 2026

Uh oh!

KoiosSG commented May 29, 2026

Uh oh!

KoiosSG commented May 29, 2026

Uh oh!

KoiosSG commented May 30, 2026

Uh oh!

KoiosSG commented May 30, 2026

Uh oh!

KoiosSG commented May 30, 2026

Uh oh!

KoiosSG commented May 31, 2026

Uh oh!

KoiosSG commented Jun 1, 2026

Uh oh!

KoiosSG commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KoiosSG commented May 28, 2026 •

edited

Loading