refactor + feat: code quality + Presidio-aligned shipped pattern set + chunk context propagation#279
Merged
Merged
Conversation
… pass
Lifts crate::context out of nvisy-core into a sibling nvisy-context crate so
the SDK base stays primitives-only for third-party recognizer authors. Adds
NerRecognizer::context_registry (mirroring PatternRegistry::context_registry)
and wires ContextEnhancer into DetectionPhase: build_for_request now returns
DetectionResources { recognizers, enhancer }, the enhancer runs in block-local
coordinates between recognizer dispatch and modality lifting, and the
substring path runs by default (Tokens artifact wiring follows when an
NlpEngine is plumbed in).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops 9 unused workspace deps (hmac, include_dir, quick-xml, reqwest, serde_with, smallvec, stop-words, walkdir, zip) and reorders the root [workspace.dependencies] foundation-first: primitives → runtime → domain (text/document/image/audio) → integration (HTTP, AI, server, CLI) → storage → utilities. Removes per-crate machete-flagged deps and aligns every crate manifest with the new group names and order. Keeps calamine and unicode-segmentation in workspace deps for upcoming xlsx + word-boundary work. Marks humantime-serde as ignored in nvisy-llm/nvisy-engine/nvisy-server where it's used via serde `with =` strings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ormalize docs - nvisy-pattern: rename Detector→Pattern→Regex; inline PatternRegistry into PatternRecognizerBuilder; split CompiledPattern out; export built-in validators by bare-noun names (luhn, iban, ssn, phone, date); add Scoring::get + per-column resolution; convert pattern assets to TOML; normalize module/function docs (returns-form for predicates, reference-form doc-links, # Errors + code examples for public types). - nvisy-context: extract registry/declaration into rule + wrapper; trim enhancer/matcher/tokens surface. - nvisy-toolkit: drop stale PatternRegistry usage in pipeline example; fix broken rustdoc links in redaction module. - nvisy-engine, nvisy-ner: knock-on updates for the new pattern and context surfaces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Variant: replace derive_builder with `new(regex)?` + `with_score` / `with_validator` chain (matches Term::new style). - Drop the Terms newtype; Dictionary::terms is `Vec<Term>` and the parsers move to associated fns on Term: - Term::from_text(&str) -> Vec<Term> (infallible) - Term::from_csv(&str) -> Result<Vec<Term>, Error> Signatures now match Regex::from_toml / Dictionary::from_toml. - Rewrite every public-item docblock in nvisy-pattern for a consistent style: noun-phrase openers for types, imperative for constructors/setters, returns-form for predicates, reference-form doc-links at the bottom, `# Errors` only where fallible, code examples on top-level types. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ld() - nvisy-context: `Boosting<R>` → `ContextEnhanced<R>` (more self-descriptive; reads as "an R that's been context-enhanced"). - nvisy-pattern: `PatternRecognizerBuilder::build()` now returns the bare `PatternRecognizer`; the wrapped form moves to `build_context_enhanced() -> ContextEnhanced<PatternRecognizer>`. Callers opt into the keyword-boost layer explicitly. - Engine config, shipped-detection / user-rules / enhancer roundtrip tests, toolkit fixtures + example flipped to `build_context_enhanced()` to preserve prior behavior. - README + module/struct docs rewritten to describe both methods without historical framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… matching - nvisy-pattern: new `Context` enum (Global | PerLanguage) replaces `Vec<String>` on Regex and Dictionary. Untagged serde keeps the flat TOML form working unchanged; new form is `[context.en] = [...]`. Shipped phone, credit_card, date_of_birth, datetime patterns now carry EN/ES/DE/FR keyword sets. - nvisy-context: BoostRule gains `language: Option<LanguageTag>`. Enhancer storage flips to `HashMap<Label, Vec<BoostRule>>` — one bucket per label, distinct language scopes inside. Enhancer::enhance takes a language hint; ContextEnhanced threads input.language through. - nvisy-core: LanguageTag::new returns nvisy_core::Error; LanguageTag::matches compares primary subtags case-insensitively so `en` matches `en-US` / `en-GB`. BoostRule::applies_to_language and RecognizerInput::applies_to_language switch from `==` to `matches()` so language-scoped rules fire under regional variants. - Tests: TOML round-trip both forms; per-language boost fires for matching language; no boost for non-matching language; no-hint unions all per-language keywords; regional variants (`en-US`) trigger `en`-scoped rules. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- New `nvisy-toolkit::deduplication::suppress::{SuppressionLayer,
SuppressionParams}`. Three independent allow-list shapes apply
by union:
- `allow_values` — exact, ASCII case-insensitive
- `allow_values_substring` — entity text contains the value
- `allow_values_regex` — regex matched against entity text
- Operates on the entity's resolved text via `TextAt::text_at`.
Fail-open when the resolver returns `None`: keep the entity
rather than silently drop something we can't verify.
- Empty entries are filtered at construction (otherwise an empty
substring would drop every entity via `str::contains("")`).
- `LayerParams` gains a nested `suppression: SuppressionParams`
field; `LayerPipeline::from_params` becomes fallible and inserts
the new layer between fuse and resolve, growing the canonical
recipe to: calibrate → filter → fuse → suppress → resolve.
- Six `from_params` call sites updated (engine pipeline, engine
tests, toolkit fixtures, toolkit example).
- 15 unit tests in `suppress::tests` (exact / substring / regex
modes, case insensitivity, partial-overlap semantics, empty-
entry filtering, union across modes, unresolved-location
fail-open, invalid-regex error). Plus a pipeline-order test in
`pipeline::tests` that pins the architectural intent that fuse
collapses before suppress sees, suppress drops before resolve
adjudicates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- nvisy-core: new `CountryCode` (ISO 3166-1 alpha-2, validated via
`celes`). `RecognizerInput.country: Option<CountryCode>` +
`applies_to_country` mirror the existing language scoping.
- nvisy-pattern: `Regex.countries` / `Dictionary.countries`
(Vec<CountryCode>, empty = world). `PatternRecognizer::recognize`
honours per-call jurisdiction hints alongside language hints.
- Asset tree reorganized into `world/`, `us/`, `uk/` subtrees;
shipped accessors split into per-region modules
(`shipped::patterns::{world,us,uk}`, dictionaries::world).
Macro helpers exported as `__shipped_pattern` /
`__shipped_dictionary` so sub-modules resolve their own
include_str! paths.
- Pattern count grows 23 → 34: world unchanged at 18; us 5 → 10
(+itin, npi, mbi, bank_account, medical_license); uk added at 6
(nhs, nino, driving_licence, postcode, vehicle_registration,
passport).
- Validators split into per-country sub-modules with dotted names
(`us.ssn`, `us.aba_routing`, `us.npi`, `us.dea_number`,
`uk.nhs`, `uk.nino`). Shared `luhn`, `iban`, `phone`, `date`
stay flat. World pattern set extended (brand-aware credit_card,
RFC5322-loose email, Cisco-form MAC, IPv4 CIDR, comprehensive
IPv6 alternation set).
- Pattern scores normalized to a single conservative-baseline
scheme (most regex-only matches land at 0.1–0.5 before context
boost). Confidence threshold in the toolkit test fixture
lowered to 0.35 to match.
- `assets/NOTICE.md` documents third-party regex provenance for
the shipped pattern assets.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restructure pattern crate end-to-end tests:
- Replace single tests/shipped_detection.rs with three per-region
binaries: tests/builtin.rs (5 world tests), tests/builtin_us.rs
(3 US tests), tests/builtin_uk.rs (3 UK tests). 11 tests total.
- Move shared scan + assert_match + assert_label_present helpers to
tests/fixtures/mod.rs, declared via mod fixtures; from each binary.
Both helpers carry #[track_caller] for better failure attribution.
- Reshape testdata/inputs/ to mirror the asset tree:
- world fixtures move from monolithic domain files into
inputs/{contact,credentials,finance,network,personal}.txt
- inputs/us/{identity,finance,health}.txt
- inputs/uk/{identity,contact,vehicle}.txt (split from old uk.txt)
- Each test scans one fixture, asserting substring + label matches
against a recognizer loaded with every shipped pattern and
dictionary via build_context_enhanced.
- builtin_uk_identity asserts NATIONALITY (world dictionary firing
on "British") to keep assert_label_present reachable across all
three binaries.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… propagation
Tier 1 (correctness bugs):
- world/iban regex: extend middle groups from \d{4} to [A-Z0-9]{4}
and separator \s? to [\s\-]?; accepts IBANs with letters past
position 8 (UK NWBK, IE, MT, MK, GI) and hyphenated forms that
the mod-97 validator was already prepared to handle.
- world/private_key: match the full BEGIN..END PEM block instead
of only the header line; add ENCRYPTED PRIVATE KEY, PGP, SSH2,
and PuTTY-User-Key-File-{2,3} variants.
- us/medical_license: add \b anchors to both DEA variants; prior
pattern matched inside longer alphanumeric tokens.
- uk.nino validator: reject O as the position-0 letter (HMRC
reserved); the character class blocks D/F/I/Q/U/V but allows
O via the j-p range.
- us/passport: add Presidio's context = [passport, passport#,
travel document, us passport, united states passport] so the
0.1-base pattern can boost above threshold.
- us/postal_code: drop score 0.5 -> 0.1, add context, ship a
us.postal_code validator that rejects 00000.
Tier 2 (coverage + scoring):
- world/bitcoin_address: split legacy (Base58) from Bech32; bump
Bech32 cap {25,39}->{25,59} for Taproot (bc1p...). Add a
crypto.btc validator using bs58::decode_check.
- world/credit_card: add Mastercard 2-series (2221-2720); drop
score 0.5 -> 0.3 to match Presidio's deliberate baseline that
expects context boost to do the rest.
- world/aws_key: broaden access-key ID prefix to also catch
ASIA/AIDA/AROA/ANPA/AGPA/AIPA; add a second variant for the
40-char secret access key; ship Presidio-style context.
- world/github_token: add github_pat_[A-Z0-9_]{82} variant for
the fine-grained PAT format introduced in 2022.
- world/generic_api_key: accept whitespace separator alongside
[:=] so `Authorization: Bearer <token>` matches.
- uk/driving_licence + uk/vehicle/registration: add the Presidio
validators we'd left on the table (99999 surname rejection,
age-ID range 02-29 / 51-79 for current-format plates).
Validator infrastructure:
- world/phone: replace the regex+length validator with a
phonenumber-crate-backed region-aware validator. The validator
parses E.164 directly and falls back to the caller-specified
country (via RecognizerInput.country) when present.
Introduces a workspace-wide phonenumber = 0.3 dependency.
- ValidatorRegistry::with_simple convenience for the ten
context-free validators; with() stays the canonical entry
point for ctx-aware validators (only phone today).
Context-enhancer architecture:
- Add RecognizerInput.context_hints: Vec<String> for out-of-band
context strings (CSV column headers, JSON keys, log field
names) the caller wants treated as in-context.
- nvisy-context::Enhancer::enhance now takes a Context bundle
(text + tokens + language + hints) instead of four loose
arguments. The hint path runs as a fallback when the in-text
word window doesn't fire; at most one boost per rule per
entity.
- LiftedFromText in nvisy-toolkit gains chunk_hints; Tabular
surfaces column_name as a hint so a `card` column header
lifts a per-cell CC=0.3 match to ~0.65 via the existing boost
pipeline (no synthetic score patching to clear the threshold).
nvisy-context module split:
- enhancer.rs -> enhancer/{mod, context, window}.rs
- matcher.rs -> matching/{mod, matcher, lemma}.rs
- tokens.rs + wrapper.rs -> io/{mod, tokens, wrapper}.rs
- Public surface (Context, Enhancer, BoostRule, KeywordMatcher,
SubstringMatcher, LemmaMatcher, ContextEnhanced, Token,
Tokens) stays at the crate root via re-exports.
- Drop 3 redundant enhancer tests (suffix-symmetry duplicate,
unicode-too-distant duplicate, token-window symmetry); keep
the 13 unique behaviors plus 2 new hint-path tests.
Known gap: html_codec_e2e payment_card assertion fails because
HTML chunks at text-node boundary and `<code>4111...</code>`
loses the surrounding "payment card" context. The fix requires
moving chunk_hints from LiftedFromText onto the Handler trait
and overriding it on HtmlHandler to emit parent-element text
as a hint. Tracked as follow-up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extend the tabular-cell hint mechanism to all chunked formats by
making hints a first-class field on Chunk<M>, populated alongside
data + location during next_chunk.
Architecture:
- Chunk<M> gains pub hints: Vec<String>. Hints are metadata the
chunk's structural neighbours surface — CSV column headers,
HTML parent-element text, JSON object keys — for downstream
context-aware recognizers.
- nvisy-toolkit detect() reads chunk.hints directly. The earlier
LiftedFromText::chunk_hints and the briefly-added
Handler::chunk_hints methods are both removed; the field on
Chunk avoids a second handler call to recompute information
next_chunk already had.
- Handlers without useful out-of-band metadata (TXT, PDF, image,
audio) initialise the field with Vec::new().
CSV handler:
- Populates chunk.hints from chunk.location.column_name.
Replaces the prior LiftedFromText::chunk_hints<TabularLocation>
override, which had the same effect via a less-direct path.
HTML handler:
- RedactableItem gains pub hints: Vec<String>. The DOM walk in
build_items computes a per-text-node hint by collecting the
text of the node's nearest block-level ancestor (excluding
the node's own text). nearest_block_ancestor walks parents
until it finds a tag in is_block_element's curated set (p,
div, li, td, th, h1-h6, blockquote, dt, dd, section, article,
aside, header, footer, main, nav, figcaption, caption).
- Stopping at the immediate inline parent would yield only the
chunk's own text — the surrounding sentence lives in the
enclosing block. `<p>...the payment card <code>4111…</code>
is on file</p>` gives the <code> chunk a hint of "the payment
card is on file", which lifts CC=0.3 above threshold via the
existing context-enhancer.
- Note: neither html5ever, markup5ever, nor scraper exposes a
block/inline classifier; the curated list is the simplest
honest implementation. Future HTML elements not in the list
graciously degrade (walk continues to root, hints stay empty)
rather than corrupting detection.
JSON handler:
- Leaf gains pub hints: Vec<String>. parse_value and parse_array
now thread an Option<&str> key_context; parse_object captures
the just-parsed key and passes it to parse_value for the
value. Array elements inherit the containing object's key so
{"cards": ["4111…", "5555…"]} gives both PANs the "cards"
hint. Top-level scalars and keys themselves stay hint-less.
- The leaf's hint is copied onto Chunk.hints in next_chunk so
recognizers see it via input.context_hints.
Resolves codec_e2e_html and codec_e2e_json payment_card
assertions without touching scores.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply the prefer-use-imports style across the workspace: replace
inline `foo::bar::Baz::method(...)` and `impl std::fmt::Debug ...`
patterns with `use` lines at the top of each file, then refer to
the imported name directly.
Touches 51 files across nvisy-core, nvisy-context, nvisy-codec,
nvisy-engine, nvisy-fake, nvisy-llm, nvisy-pattern, nvisy-server,
nvisy-toolkit. Hoisted both crate paths (axum, aide, tower_http,
rig, image, symphonia, nvisy_core::*) and std paths
(std::fmt, std::cmp::{Ordering, Reverse}, std::marker::PhantomData,
std::any::type_name, std::io::ErrorKind, std::path::Path,
std::slice, std::vec).
Exceptions preserved:
- `#[async_trait::async_trait]` attributes stay inlined.
- `tracing::*` macros and attributes stay inlined.
- `*macros.rs` files keep fully-qualified paths for macro hygiene.
- String literals inside `#[serde(...)]` attributes are not
code-position paths.
For collisions with locally-defined types (e.g. `Path`, `Json` in
the server crate), imports use aliasing — `axum::extract::Path
as AxumPath` rather than inlining.
Drive-by simplification in nvisy-llm/src/backend/rig/mod.rs: rig
0.38 now implements CompletionClient for Gemini, so the
`gemini::completion::CompletionModel::new(...)` workaround from
rig-core 0.31 is gone and the build site uses
`client.completion_model(...)` like the other providers.
No behavior changes; workspace test, clippy, doc all clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI clippy on Rust 1.95 flags the nested
if let Node::Element(e) = node.value() {
if is_block_element(...) { ... }
}
as `clippy::collapsible-if`. Combine the two conditions with `&&`
so the chained let pattern + boolean check live on one expression.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
13 commits split into three loose groups; full list at the bottom.
Architecture / refactor
nvisy-contextcrate. Engine wires the per-label keyword-boost enhancer as a wrapping layer around recognizers. Context ownsBoostRule,Enhancer,SubstringMatcher,LemmaMatcher,ContextEnhanced<R>,Tokens/Token.nvisy-contextmodule split.enhancer/{mod,context,window}.rs,matching/{mod,matcher,lemma}.rs,io/{mod,tokens,wrapper}.rs. Public surface stays at the crate root via re-exports.Detector→Pattern→Regexinnvisy-pattern. Detection rule isRegex(withVec<Variant>), matching Presidio's recognizer shape. InlinePatternRegistryintoPatternRecognizerBuilder.CountryCodetonvisy-core::primitive(ISO 3166-1 alpha-2 viaceles);RecognizerInput.country: Option<CountryCode>+applies_to_country; per-call filtering insidePatternRecognizer::recognize. Shipped pattern + dictionary assets reorganized intoworld/,us/,uk/subtrees.LanguageTag::matches.nvisy-toolkit::deduplication— allow-list false-positive filter slotted between fuse and resolve.Presidio-aligned shipped pattern set
\banchors;uk.ninorejectsOat position 0; us/passport gains context; us/postal_code drops score 0.5→0.1 + gains context +us.postal_codevalidator that rejects00000.bs58; credit_card adds Mastercard 2-series + score 0.5→0.3 (Presidio parity); aws_key broadens to 7 prefixes + secret-key variant; github_pat_; generic_api_key accepts whitespace separator (Authorization: Bearer); uk DL + vehicle-reg validators (Presidio parity).world/phonevalidator now uses thephonenumbercrate (region-aware libphonenumber port) instead of a regex + length check.uk.driving_licence,uk.vehicle_registration, plus existinguk.nhs+uk.ninoextended.gemini::completion::CompletionModel::newworkaround is gone.Chunk-level context propagation
RecognizerInput.context_hints: Vec<String>— out-of-band context strings the enhancer should treat as in-context. Threaded throughContextEnhancedwrapper.Enhancer::enhancetakes aContextbundle (text + tokens + language + hints) instead of four loose args. Hint-path runs as a fallback when the in-text word window doesn't fire; at most one boost per rule per entity.Chunk<M>gainspub hints: Vec<String>populated bynext_chunkalongsidedataandlocation. No newHandlertrait method, no double traversal.chunk.location.column_name.is_block_elementset —html5ever/scraperdon't expose one).<p>...the payment card <code>4111…</code> is on file</p>gives the<code>chunk a hint of "the payment card is on file".parse_value/parse_array; array elements inherit the containing key).detect()readschunk.hintsdirectly and forwards toRecognizerInput::with_context_hints.Tests + style
tests/shipped_detection.rssplit into per-region binaries (builtin.rs,builtin_us.rs,builtin_uk.rs) with sharedtests/fixtures/mod.rs. 11 region+domain tests; substring + label assertions with#[track_caller].testdata/inputs/reorganized to mirror the asset tree:inputs/{contact,credentials,finance,network,personal}.txt(world) +inputs/us/{identity,finance,health}.txt+inputs/uk/{identity,contact,vehicle}.txt.usestatements across 51 files. Exceptions for#[async_trait::async_trait]andtracing::*macros/attrs preserved.Test plan
cargo test --workspace— all greencargo clippy --workspace --all-targets -- -D warningsRUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-depscargo test -p nvisy-pattern --docpayment_card(CC=0.3) via the new context-hint pathCommits
🤖 Generated with Claude Code