feat(pattern): dictionary regrouping + 14-country jurisdiction-scoped pattern pack#282
Merged
Conversation
Dictionaries are now grouped by scope rather than thrown into
one `world/` bucket:
- `assets/dictionaries/world/` — language-agnostic brand-name
lists. Now contains only `cryptocurrencies` (BTC, Ethereum,
Tether, … — the same strings in every language).
- `assets/dictionaries/en/` — terms written in English that
translate when the document language changes:
- `finance/currencies` (USD, US Dollar, …)
- `personal/languages` (English, en, French, fr, …)
- `personal/nationalities` (American, French, Japanese, …)
- `personal/religions` (Christian, Muslim, Buddhist, …)
Each English dictionary TOML declares `languages = ["en"]`, so
the recognizer filters them out at runtime when the caller
asserts a non-English document via `RecognizerInput.language`.
Future Spanish, French, German variants land at
`assets/dictionaries/<lang>/` without touching the runtime.
Scores: lowered the dictionary baseline from `0.85` to `0.4`
(`[0.85, 0.30, 0.85]` → `[0.4, 0.2, 0.4]` on the multi-column
languages list) to match Microsoft Presidio's typical
deny-list-recognizer baseline. At 0.85 every "American" or
"Christian" mention in any English document flooded results;
at 0.4 the context-enhancer pipeline lifts them when nearby
keywords like "nationality" / "religion" / "currency" fire.
`shipped::dictionaries::{en,world}` module re-shuffled to
match the asset tree.
Dropped count-asserting tests on the `all()` functions in
both `shipped::patterns` and `shipped::dictionaries` — they
break every time a new pattern or dictionary ships without
catching any real bug; replaced with a single
`every_shipped_pattern_has_variants` / `every_shipped_dictionary_parses`
loop.
Asset docs:
- Added `assets/README.md`: walks the directory tree, lists
the score-tier conventions, names the built-in validators,
points to PRESIDIO.md.
- Renamed `assets/NOTICE.md` to `assets/PRESIDIO.md` and
retitled it from "third-party attribution" to "Presidio
attribution" — the file always was just the upstream MIT
license requirement; the new name says so.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports 13 DE pattern TOMLs (identity, health, vehicle, contact, finance) and 9 checksum validators covering BSNR, LANR, KVNR (§290 SGB V), nPA + legacy ID card (ICAO 7-3-1), passport, RVNR (VKVV §4), Steuer-IdNr (ISO 7064 Mod 11,10), USt-IdNr, and PLZ sentinel-range rejection. Introduces a `company_id` builtin label for public company registers (Handelsregister, Companies House, etc.) — `internal_id` described operator-defined IDs, which misrepresented these. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the Check/Clippy job layout. Broken intra-doc links and stale paths slipped through prior PRs because no CI step ran rustdoc — this closes the gap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…+ validators Ports 28 pattern TOMLs across 7 countries with 21 new checksum validators covering the well-documented national identifiers: - ES: NIF/DNI (Mod-23), NIE (XYZ prefix → Mod-23), CIF (entity-class control char), passport, postal code. - IT: Codice Fiscale (odd/even-mapped Mod-26 with omocodia-aware regex), Partita IVA (Luhn-like), carta d'identità (paper + CIE 2.0/3.0), passport, patente. - PL: PESEL (weighted Mod-10), NIP (Mod-11), REGON (9/14-digit Mod-11), kod pocztowy. - AU: ABN (Mod-89 with leading-digit adjustment), ACN (Mod-10), Medicare (Mod-10 weighted), TFN (Mod-11). - CA: SIN (Luhn, 0/8 reserved prefix), postal code. - FI: HETU (Mod-31 control-char lookup, century separator). - SE: personnummer (Luhn + date validity incl. samordningsnummer), organisationsnummer (Luhn + third-digit rule), postnummer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports 11 pattern TOMLs across India and South Korea, with 7 new checksum validators including the first non-Western algorithms: - IN: Aadhaar (Verhoeff dihedral checksum + palindrome rejection + leading-digit >= 2), PAN (10-char structural format with entity-class letter at position 4), GSTIN (15-char state + PAN + base-36 weighted check), passport, voter ID (EPIC), vehicle registration. - KR: RRN (13-digit weighted Mod-11 with region code + date), FRN (RRN shape with gender 5-8 and (13 - sum) Mod-10), BRN (magic-keys [1,3,7,1,3,7,1,3,5] Mod-10 with special 9th-digit handling), passport, driver license (region-code allowlist). The IN module is declared as `r#in` since `in` is a Rust keyword; the dotted validator name (`"in.aadhaar"`) and the asset directory (`in/`) keep the ISO 3166 alpha-2 convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports 11 pattern TOMLs across Singapore, Turkey, Nigeria, and Thailand with 7 new checksum validators, closing the deferred country list: - SG: NRIC/FIN (weighted Mod-11 with prefix-specific letter table — S/T, F/G, M each get their own), UEN (three formats A/B/C with format-specific weight vectors and alphabets; format C enforces an entity-type allowlist), 6-digit postal code. - TR: TC Kimlik No (TCKN — 11-digit two-step weighted checksum per Nüfus ve Vatandaşlık İşleri spec), license plate (province 01-81 + Turkish-alphabet letters + serial), 5-digit posta kodu. - NG: NIN (11-digit Verhoeff checksum per NIMC), 2011-format vehicle plate. - TH: National ID (13-digit weighted Mod-11; sourced directly from Department of Provincial Administration spec — no Presidio recognizer existed), 5-digit postal code. Promotes Verhoeff helper from `validators/in/verhoeff.rs` to shared `validators/verhoeff.rs`; both IN Aadhaar and NG NIN use the same dihedral-group tables. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous validator only checked the ISO 13616 mod-97 checksum, accepting any country code with arbitrary length. iban_validate ships the SWIFT IBAN registry on top of the same checksum, so per-country length and BBAN structure are enforced (e.g. DE = 22 chars, GB = 22 chars, the right alphabet for the bank code, etc.) and the country list stays current as the registry adds new members. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
world/for universal brand names,en/for English-only personal/finance terms), lowers their baseline confidence to 0.4iban_validatecrateDocsCI job to gate onRUSTDOCFLAGS=\"-D warnings\" cargo doc --workspace --no-depsValidators implementing real published checksums
iban_validate), phone (alreadyphonenumber), date, Bitcoin (Base58Check/Bech32), Verhoeff (shared by IN/NG)A new
company_identity label was added for public company-registry identifiers (Handelsregister, ACRA UEN, Bolagsverket, etc.), replacing the misleadinginternal_idmapping.Test plan
cargo test --workspace— greencargo clippy --workspace --all-targets -- -D warnings— cleancargo deny check— advisories, bans, licenses, sources all OKtestdata/inputs/<country>/) +tests/builtin_<country>.rs🤖 Generated with Claude Code