refactor + feat: code quality + Presidio-aligned shipped pattern set + chunk context propagation by martsokha · Pull Request #279 · nvisycom/runtime

martsokha · 2026-06-14T19:27:24Z

Summary

13 commits split into three loose groups; full list at the bottom.

Architecture / refactor

Extract nvisy-context crate. Engine wires the per-label keyword-boost enhancer as a wrapping layer around recognizers. Context owns BoostRule, Enhancer, SubstringMatcher, LemmaMatcher, ContextEnhanced<R>, Tokens/Token.
nvisy-context module split. enhancer/{mod,context,window}.rs, matching/{mod,matcher,lemma}.rs, io/{mod,tokens,wrapper}.rs. Public surface stays at the crate root via re-exports.
Rename Detector→Pattern→Regex in nvisy-pattern. Detection rule is Regex (with Vec<Variant>), matching Presidio's recognizer shape. Inline PatternRegistry into PatternRecognizerBuilder.
Country scope. Add CountryCode to nvisy-core::primitive (ISO 3166-1 alpha-2 via celes); RecognizerInput.country: Option<CountryCode> + applies_to_country; per-call filtering inside PatternRecognizer::recognize. Shipped pattern + dictionary assets reorganized into world/, us/, uk/ subtrees.
Per-language context keywords with primary-subtag matching via LanguageTag::matches.
SuppressionLayer in nvisy-toolkit::deduplication — allow-list false-positive filter slotted between fuse and resolve.

Presidio-aligned shipped pattern set

Audited every shipped pattern (15 world + 10 US + 6 UK) against Microsoft Presidio. Findings written up and addressed.
Tier 1 correctness bugs: IBAN regex accepts UK/IE/MT/hyphenated forms; private_key matches the full PEM block; us/medical_license gains \b anchors; uk.nino rejects O at position 0; us/passport gains context; us/postal_code drops score 0.5→0.1 + gains context + us.postal_code validator that rejects 00000.
Tier 2 coverage: bitcoin Taproot (Bech32m up to 62 chars) + Base58Check validator via bs58; credit_card adds Mastercard 2-series + score 0.5→0.3 (Presidio parity); aws_key broadens to 7 prefixes + secret-key variant; github_pat_; generic_api_key accepts whitespace separator (Authorization: Bearer); uk DL + vehicle-reg validators (Presidio parity).
world/phone validator now uses the phonenumber crate (region-aware libphonenumber port) instead of a regex + length check.
3 UK validators added: uk.driving_licence, uk.vehicle_registration, plus existing uk.nhs + uk.nino extended.
Drive-by: rig 0.38's Gemini provider now implements CompletionClient, so the gemini::completion::CompletionModel::new workaround is gone.

Chunk-level context propagation

RecognizerInput.context_hints: Vec<String> — out-of-band context strings the enhancer should treat as in-context. Threaded through ContextEnhanced wrapper.
Enhancer::enhance takes a Context bundle (text + tokens + language + hints) instead of four loose args. Hint-path runs as a fallback when the in-text word window doesn't fire; at most one boost per rule per entity.
Chunk<M> gains pub hints: Vec<String> populated by next_chunk alongside data and location. No new Handler trait method, no double traversal.
CSV handler populates from chunk.location.column_name.
HTML handler populates from the nearest block-level ancestor's text (curated is_block_element set — html5ever/scraper don't expose one). <p>...the payment card <code>4111…</code> is on file</p> gives the <code> chunk a hint of "the payment card is on file".
JSON handler populates from the enclosing object key (threaded through parse_value/parse_array; array elements inherit the containing key).
Toolkit detect() reads chunk.hints directly and forwards to RecognizerInput::with_context_hints.

Tests + style

tests/shipped_detection.rs split into per-region binaries (builtin.rs, builtin_us.rs, builtin_uk.rs) with shared tests/fixtures/mod.rs. 11 region+domain tests; substring + label assertions with #[track_caller].
testdata/inputs/ reorganized to mirror the asset tree: inputs/{contact,credentials,finance,network,personal}.txt (world) + inputs/us/{identity,finance,health}.txt + inputs/uk/{identity,contact,vehicle}.txt.
Workspace-wide import-style refactor: hoisted ~94 inline fully-qualified paths (std + crate paths) into use statements across 51 files. Exceptions for #[async_trait::async_trait] and tracing::* macros/attrs preserved.

Test plan

cargo test --workspace — all green
cargo clippy --workspace --all-targets -- -D warnings
RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
cargo test -p nvisy-pattern --doc
e2e codec tests: CSV, HTML, JSON, XLSX, DOCX, PDF, TXT — all detect payment_card (CC=0.3) via the new context-hint path
17 enhancer unit tests (including 2 new hint-path tests)
11 shipped_detection e2e tests across 3 region binaries

Commits

e34babce refactor: hoist fully-qualified paths into use imports
c63a0e40 feat(codec,toolkit): chunk-level context hints for HTML + JSON
2137db58 feat(pattern,context): Presidio-aligned audit fixes + tabular context propagation
ae6a6409 test(pattern): split shipped_detection into per-region e2e binaries
85940d9f feat(pattern,core): country scope + Presidio-aligned shipped pattern set
40d49750 feat(toolkit): SuppressionLayer with allow-list false-positive filter
756a9fa0 feat(pattern,context): per-language context keywords + primary-subtag matching
ae12d261 refactor(pattern,context): rename Boosting→ContextEnhanced, split build()
628227cf refactor(pattern): inline VariantBuilder, drop Terms, normalize all docs
111bbac0 refactor(pattern,context): rename Pattern→Regex, inline registries, normalize docs
682913cb style: cargo fmt
534c1c73 chore(deps): clean up workspace deps + normalize per-crate manifests
f25f21cb refactor(context): extract nvisy-context crate + wire engine enhancer pass

🤖 Generated with Claude Code

… pass Lifts crate::context out of nvisy-core into a sibling nvisy-context crate so the SDK base stays primitives-only for third-party recognizer authors. Adds NerRecognizer::context_registry (mirroring PatternRegistry::context_registry) and wires ContextEnhancer into DetectionPhase: build_for_request now returns DetectionResources { recognizers, enhancer }, the enhancer runs in block-local coordinates between recognizer dispatch and modality lifting, and the substring path runs by default (Tokens artifact wiring follows when an NlpEngine is plumbed in). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drops 9 unused workspace deps (hmac, include_dir, quick-xml, reqwest, serde_with, smallvec, stop-words, walkdir, zip) and reorders the root [workspace.dependencies] foundation-first: primitives → runtime → domain (text/document/image/audio) → integration (HTTP, AI, server, CLI) → storage → utilities. Removes per-crate machete-flagged deps and aligns every crate manifest with the new group names and order. Keeps calamine and unicode-segmentation in workspace deps for upcoming xlsx + word-boundary work. Marks humantime-serde as ignored in nvisy-llm/nvisy-engine/nvisy-server where it's used via serde `with =` strings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ormalize docs - nvisy-pattern: rename Detector→Pattern→Regex; inline PatternRegistry into PatternRecognizerBuilder; split CompiledPattern out; export built-in validators by bare-noun names (luhn, iban, ssn, phone, date); add Scoring::get + per-column resolution; convert pattern assets to TOML; normalize module/function docs (returns-form for predicates, reference-form doc-links, # Errors + code examples for public types). - nvisy-context: extract registry/declaration into rule + wrapper; trim enhancer/matcher/tokens surface. - nvisy-toolkit: drop stale PatternRegistry usage in pipeline example; fix broken rustdoc links in redaction module. - nvisy-engine, nvisy-ner: knock-on updates for the new pattern and context surfaces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Variant: replace derive_builder with `new(regex)?` + `with_score` / `with_validator` chain (matches Term::new style). - Drop the Terms newtype; Dictionary::terms is `Vec<Term>` and the parsers move to associated fns on Term: - Term::from_text(&str) -> Vec<Term> (infallible) - Term::from_csv(&str) -> Result<Vec<Term>, Error> Signatures now match Regex::from_toml / Dictionary::from_toml. - Rewrite every public-item docblock in nvisy-pattern for a consistent style: noun-phrase openers for types, imperative for constructors/setters, returns-form for predicates, reference-form doc-links at the bottom, `# Errors` only where fallible, code examples on top-level types. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ld() - nvisy-context: `Boosting<R>` → `ContextEnhanced<R>` (more self-descriptive; reads as "an R that's been context-enhanced"). - nvisy-pattern: `PatternRecognizerBuilder::build()` now returns the bare `PatternRecognizer`; the wrapped form moves to `build_context_enhanced() -> ContextEnhanced<PatternRecognizer>`. Callers opt into the keyword-boost layer explicitly. - Engine config, shipped-detection / user-rules / enhancer roundtrip tests, toolkit fixtures + example flipped to `build_context_enhanced()` to preserve prior behavior. - README + module/struct docs rewritten to describe both methods without historical framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… matching - nvisy-pattern: new `Context` enum (Global | PerLanguage) replaces `Vec<String>` on Regex and Dictionary. Untagged serde keeps the flat TOML form working unchanged; new form is `[context.en] = [...]`. Shipped phone, credit_card, date_of_birth, datetime patterns now carry EN/ES/DE/FR keyword sets. - nvisy-context: BoostRule gains `language: Option<LanguageTag>`. Enhancer storage flips to `HashMap<Label, Vec<BoostRule>>` — one bucket per label, distinct language scopes inside. Enhancer::enhance takes a language hint; ContextEnhanced threads input.language through. - nvisy-core: LanguageTag::new returns nvisy_core::Error; LanguageTag::matches compares primary subtags case-insensitively so `en` matches `en-US` / `en-GB`. BoostRule::applies_to_language and RecognizerInput::applies_to_language switch from `==` to `matches()` so language-scoped rules fire under regional variants. - Tests: TOML round-trip both forms; per-language boost fires for matching language; no boost for non-matching language; no-hint unions all per-language keywords; regional variants (`en-US`) trigger `en`-scoped rules. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- New `nvisy-toolkit::deduplication::suppress::{SuppressionLayer, SuppressionParams}`. Three independent allow-list shapes apply by union: - `allow_values` — exact, ASCII case-insensitive - `allow_values_substring` — entity text contains the value - `allow_values_regex` — regex matched against entity text - Operates on the entity's resolved text via `TextAt::text_at`. Fail-open when the resolver returns `None`: keep the entity rather than silently drop something we can't verify. - Empty entries are filtered at construction (otherwise an empty substring would drop every entity via `str::contains("")`). - `LayerParams` gains a nested `suppression: SuppressionParams` field; `LayerPipeline::from_params` becomes fallible and inserts the new layer between fuse and resolve, growing the canonical recipe to: calibrate → filter → fuse → suppress → resolve. - Six `from_params` call sites updated (engine pipeline, engine tests, toolkit fixtures, toolkit example). - 15 unit tests in `suppress::tests` (exact / substring / regex modes, case insensitivity, partial-overlap semantics, empty- entry filtering, union across modes, unresolved-location fail-open, invalid-regex error). Plus a pipeline-order test in `pipeline::tests` that pins the architectural intent that fuse collapses before suppress sees, suppress drops before resolve adjudicates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- nvisy-core: new `CountryCode` (ISO 3166-1 alpha-2, validated via `celes`). `RecognizerInput.country: Option<CountryCode>` + `applies_to_country` mirror the existing language scoping. - nvisy-pattern: `Regex.countries` / `Dictionary.countries` (Vec<CountryCode>, empty = world). `PatternRecognizer::recognize` honours per-call jurisdiction hints alongside language hints. - Asset tree reorganized into `world/`, `us/`, `uk/` subtrees; shipped accessors split into per-region modules (`shipped::patterns::{world,us,uk}`, dictionaries::world). Macro helpers exported as `__shipped_pattern` / `__shipped_dictionary` so sub-modules resolve their own include_str! paths. - Pattern count grows 23 → 34: world unchanged at 18; us 5 → 10 (+itin, npi, mbi, bank_account, medical_license); uk added at 6 (nhs, nino, driving_licence, postcode, vehicle_registration, passport). - Validators split into per-country sub-modules with dotted names (`us.ssn`, `us.aba_routing`, `us.npi`, `us.dea_number`, `uk.nhs`, `uk.nino`). Shared `luhn`, `iban`, `phone`, `date` stay flat. World pattern set extended (brand-aware credit_card, RFC5322-loose email, Cisco-form MAC, IPv4 CIDR, comprehensive IPv6 alternation set). - Pattern scores normalized to a single conservative-baseline scheme (most regex-only matches land at 0.1–0.5 before context boost). Confidence threshold in the toolkit test fixture lowered to 0.35 to match. - `assets/NOTICE.md` documents third-party regex provenance for the shipped pattern assets. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Restructure pattern crate end-to-end tests: - Replace single tests/shipped_detection.rs with three per-region binaries: tests/builtin.rs (5 world tests), tests/builtin_us.rs (3 US tests), tests/builtin_uk.rs (3 UK tests). 11 tests total. - Move shared scan + assert_match + assert_label_present helpers to tests/fixtures/mod.rs, declared via mod fixtures; from each binary. Both helpers carry #[track_caller] for better failure attribution. - Reshape testdata/inputs/ to mirror the asset tree: - world fixtures move from monolithic domain files into inputs/{contact,credentials,finance,network,personal}.txt - inputs/us/{identity,finance,health}.txt - inputs/uk/{identity,contact,vehicle}.txt (split from old uk.txt) - Each test scans one fixture, asserting substring + label matches against a recognizer loaded with every shipped pattern and dictionary via build_context_enhanced. - builtin_uk_identity asserts NATIONALITY (world dictionary firing on "British") to keep assert_label_present reachable across all three binaries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… propagation Tier 1 (correctness bugs): - world/iban regex: extend middle groups from \d{4} to [A-Z0-9]{4} and separator \s? to [\s\-]?; accepts IBANs with letters past position 8 (UK NWBK, IE, MT, MK, GI) and hyphenated forms that the mod-97 validator was already prepared to handle. - world/private_key: match the full BEGIN..END PEM block instead of only the header line; add ENCRYPTED PRIVATE KEY, PGP, SSH2, and PuTTY-User-Key-File-{2,3} variants. - us/medical_license: add \b anchors to both DEA variants; prior pattern matched inside longer alphanumeric tokens. - uk.nino validator: reject O as the position-0 letter (HMRC reserved); the character class blocks D/F/I/Q/U/V but allows O via the j-p range. - us/passport: add Presidio's context = [passport, passport#, travel document, us passport, united states passport] so the 0.1-base pattern can boost above threshold. - us/postal_code: drop score 0.5 -> 0.1, add context, ship a us.postal_code validator that rejects 00000. Tier 2 (coverage + scoring): - world/bitcoin_address: split legacy (Base58) from Bech32; bump Bech32 cap {25,39}->{25,59} for Taproot (bc1p...). Add a crypto.btc validator using bs58::decode_check. - world/credit_card: add Mastercard 2-series (2221-2720); drop score 0.5 -> 0.3 to match Presidio's deliberate baseline that expects context boost to do the rest. - world/aws_key: broaden access-key ID prefix to also catch ASIA/AIDA/AROA/ANPA/AGPA/AIPA; add a second variant for the 40-char secret access key; ship Presidio-style context. - world/github_token: add github_pat_[A-Z0-9_]{82} variant for the fine-grained PAT format introduced in 2022. - world/generic_api_key: accept whitespace separator alongside [:=] so `Authorization: Bearer <token>` matches. - uk/driving_licence + uk/vehicle/registration: add the Presidio validators we'd left on the table (99999 surname rejection, age-ID range 02-29 / 51-79 for current-format plates). Validator infrastructure: - world/phone: replace the regex+length validator with a phonenumber-crate-backed region-aware validator. The validator parses E.164 directly and falls back to the caller-specified country (via RecognizerInput.country) when present. Introduces a workspace-wide phonenumber = 0.3 dependency. - ValidatorRegistry::with_simple convenience for the ten context-free validators; with() stays the canonical entry point for ctx-aware validators (only phone today). Context-enhancer architecture: - Add RecognizerInput.context_hints: Vec<String> for out-of-band context strings (CSV column headers, JSON keys, log field names) the caller wants treated as in-context. - nvisy-context::Enhancer::enhance now takes a Context bundle (text + tokens + language + hints) instead of four loose arguments. The hint path runs as a fallback when the in-text word window doesn't fire; at most one boost per rule per entity. - LiftedFromText in nvisy-toolkit gains chunk_hints; Tabular surfaces column_name as a hint so a `card` column header lifts a per-cell CC=0.3 match to ~0.65 via the existing boost pipeline (no synthetic score patching to clear the threshold). nvisy-context module split: - enhancer.rs -> enhancer/{mod, context, window}.rs - matcher.rs -> matching/{mod, matcher, lemma}.rs - tokens.rs + wrapper.rs -> io/{mod, tokens, wrapper}.rs - Public surface (Context, Enhancer, BoostRule, KeywordMatcher, SubstringMatcher, LemmaMatcher, ContextEnhanced, Token, Tokens) stays at the crate root via re-exports. - Drop 3 redundant enhancer tests (suffix-symmetry duplicate, unicode-too-distant duplicate, token-window symmetry); keep the 13 unique behaviors plus 2 new hint-path tests. Known gap: html_codec_e2e payment_card assertion fails because HTML chunks at text-node boundary and `<code>4111...</code>` loses the surrounding "payment card" context. The fix requires moving chunk_hints from LiftedFromText onto the Handler trait and overriding it on HtmlHandler to emit parent-element text as a hint. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Extend the tabular-cell hint mechanism to all chunked formats by making hints a first-class field on Chunk<M>, populated alongside data + location during next_chunk. Architecture: - Chunk<M> gains pub hints: Vec<String>. Hints are metadata the chunk's structural neighbours surface — CSV column headers, HTML parent-element text, JSON object keys — for downstream context-aware recognizers. - nvisy-toolkit detect() reads chunk.hints directly. The earlier LiftedFromText::chunk_hints and the briefly-added Handler::chunk_hints methods are both removed; the field on Chunk avoids a second handler call to recompute information next_chunk already had. - Handlers without useful out-of-band metadata (TXT, PDF, image, audio) initialise the field with Vec::new(). CSV handler: - Populates chunk.hints from chunk.location.column_name. Replaces the prior LiftedFromText::chunk_hints<TabularLocation> override, which had the same effect via a less-direct path. HTML handler: - RedactableItem gains pub hints: Vec<String>. The DOM walk in build_items computes a per-text-node hint by collecting the text of the node's nearest block-level ancestor (excluding the node's own text). nearest_block_ancestor walks parents until it finds a tag in is_block_element's curated set (p, div, li, td, th, h1-h6, blockquote, dt, dd, section, article, aside, header, footer, main, nav, figcaption, caption). - Stopping at the immediate inline parent would yield only the chunk's own text — the surrounding sentence lives in the enclosing block. `<p>...the payment card <code>4111…</code> is on file</p>` gives the <code> chunk a hint of "the payment card is on file", which lifts CC=0.3 above threshold via the existing context-enhancer. - Note: neither html5ever, markup5ever, nor scraper exposes a block/inline classifier; the curated list is the simplest honest implementation. Future HTML elements not in the list graciously degrade (walk continues to root, hints stay empty) rather than corrupting detection. JSON handler: - Leaf gains pub hints: Vec<String>. parse_value and parse_array now thread an Option<&str> key_context; parse_object captures the just-parsed key and passes it to parse_value for the value. Array elements inherit the containing object's key so {"cards": ["4111…", "5555…"]} gives both PANs the "cards" hint. Top-level scalars and keys themselves stay hint-less. - The leaf's hint is copied onto Chunk.hints in next_chunk so recognizers see it via input.context_hints. Resolves codec_e2e_html and codec_e2e_json payment_card assertions without touching scores. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Apply the prefer-use-imports style across the workspace: replace inline `foo::bar::Baz::method(...)` and `impl std::fmt::Debug ...` patterns with `use` lines at the top of each file, then refer to the imported name directly. Touches 51 files across nvisy-core, nvisy-context, nvisy-codec, nvisy-engine, nvisy-fake, nvisy-llm, nvisy-pattern, nvisy-server, nvisy-toolkit. Hoisted both crate paths (axum, aide, tower_http, rig, image, symphonia, nvisy_core::*) and std paths (std::fmt, std::cmp::{Ordering, Reverse}, std::marker::PhantomData, std::any::type_name, std::io::ErrorKind, std::path::Path, std::slice, std::vec). Exceptions preserved: - `#[async_trait::async_trait]` attributes stay inlined. - `tracing::*` macros and attributes stay inlined. - `*macros.rs` files keep fully-qualified paths for macro hygiene. - String literals inside `#[serde(...)]` attributes are not code-position paths. For collisions with locally-defined types (e.g. `Path`, `Json` in the server crate), imports use aliasing — `axum::extract::Path as AxumPath` rather than inlining. Drive-by simplification in nvisy-llm/src/backend/rig/mod.rs: rig 0.38 now implements CompletionClient for Gemini, so the `gemini::completion::CompletionModel::new(...)` workaround from rig-core 0.31 is gone and the build site uses `client.completion_model(...)` like the other providers. No behavior changes; workspace test, clippy, doc all clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CI clippy on Rust 1.95 flags the nested if let Node::Element(e) = node.value() { if is_block_element(...) { ... } } as `clippy::collapsible-if`. Combine the two conditions with `&&` so the chained let pattern + boolean check live on one expression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

martsokha and others added 5 commits June 14, 2026 13:29

style: cargo fmt

682913c

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

martsokha self-assigned this Jun 14, 2026

martsokha added pattern refactor code restructuring without behavior change labels Jun 14, 2026

martsokha and others added 2 commits June 14, 2026 23:40

martsokha added the toolkit label Jun 15, 2026

martsokha and others added 6 commits June 15, 2026 06:51

martsokha changed the title ~~refactor: rename Pattern→Regex, extract nvisy-context, normalize pattern docs~~ refactor + feat: code quality + Presidio-aligned shipped pattern set + chunk context propagation Jun 16, 2026

martsokha merged commit 5ade4ef into main Jun 16, 2026
5 checks passed

martsokha deleted the refactor/code-quality branch June 16, 2026 09:19

martsokha added recognition pattern, NER, LLM, and OCR backends (elide::recognition::*) engine redaction engine, pipeline runtime, orchestration, configuration and removed pattern labels Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor + feat: code quality + Presidio-aligned shipped pattern set + chunk context propagation#279

refactor + feat: code quality + Presidio-aligned shipped pattern set + chunk context propagation#279
martsokha merged 14 commits into
mainfrom
refactor/code-quality

martsokha commented Jun 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

martsokha commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture / refactor

Presidio-aligned shipped pattern set

Chunk-level context propagation

Tests + style

Test plan

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

martsokha commented Jun 14, 2026 •

edited

Loading