Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,34 @@ jobs:
- name: Clippy
run: cargo clippy --workspace -- -D warnings

docs:
name: Docs
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v6

- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@master
with:
toolchain: ${{ env.RUST_TOOLCHAIN }}

- name: Cache cargo registry and build
uses: actions/cache@v5
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-docs-${{ hashFiles('**/Cargo.lock') }}
restore-keys: ${{ runner.os }}-cargo-docs-

- name: Cargo doc
run: cargo doc --workspace --no-deps
env:
RUSTDOCFLAGS: "-D warnings"

test:
name: Test
runs-on: ubuntu-latest
Expand Down
10 changes: 10 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ unicode-normalization = { version = "0.1", features = [] }

# Checksum / encoding
bs58 = { version = "0.5", features = ["check"] }
iban_validate = { version = "5.0" }
phonenumber = { version = "0.3", default-features = false }

# Tabular document parsing
Expand Down
2 changes: 1 addition & 1 deletion crates/nvisy-context/src/io/tokens.rs
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ impl Token {
/// borrows the underlying slice via [`as_slice`] and walks it by
/// count when scoring the entity's neighbourhood.
///
/// [`Enhancer`]: super::Enhancer
/// [`Enhancer`]: crate::Enhancer
/// [`as_slice`]: Tokens::as_slice
#[derive(Debug, Clone, Default, PartialEq, Eq)]
pub struct Tokens(Vec<Token>);
Expand Down
2 changes: 2 additions & 0 deletions crates/nvisy-core/src/entity/label/builtins.rs
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ label!(pub SIGNATURE, "signature","Handwritten signature.", ["visual", "pii"]);
label!(pub LOGO, "logo","Brand or organisation logo.", ["visual"]);
label!(pub BARCODE, "barcode","Barcode or QR code.", ["visual"]);
label!(pub ORGANIZATION_NAME, "organization_name","Organization or company name.", ["organization"]);
label!(pub COMPANY_ID, "company_id","Public company-registry identifier (Handelsregisternummer, Companies House number, etc.).", ["organization"]);
label!(pub DEPARTMENT_NAME, "department_name","Department or business-unit name.", ["organization"]);
label!(pub FACILITY_NAME, "facility_name","Physical facility or location name.", ["organization"]);
label!(pub CASE_NUMBER, "case_number","Case, matter, or docket number.", ["organization"]);
Expand Down Expand Up @@ -150,6 +151,7 @@ pub(super) static BUILT_INS: &[&LazyLock<EntityLabel>] = &[
&LOGO,
&BARCODE,
&ORGANIZATION_NAME,
&COMPANY_ID,
&DEPARTMENT_NAME,
&FACILITY_NAME,
&CASE_NUMBER,
Expand Down
9 changes: 4 additions & 5 deletions crates/nvisy-pattern/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,16 @@ derive_more = { workspace = true, features = ["from"] }
# Async runtime and parallelism
async-trait = { workspace = true, features = [] }

# Text processing (regex + Aho-Corasick literal matching)
# Text processing
regex = { workspace = true, features = [] }
aho-corasick = { workspace = true, features = [] }

# Tabular document parsing (dictionary loading from CSV)
# Tabular document parsing
csv = { workspace = true, features = [] }

# Base58Check decoder for the crypto.btc validator
# Checksum / encoding
bs58 = { workspace = true, features = ["check"] }

# Region-aware phone-number parsing for the phone validator
iban_validate = { workspace = true }
phonenumber = { workspace = true }

[dev-dependencies]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
# Third-party attribution: shipped pattern assets
# Presidio attribution

Several shipped pattern TOMLs under this directory carry regular
expressions adapted from [Microsoft Presidio][presidio]
(`microsoft/presidio`, MIT licensed), specifically the
Several shipped pattern TOMLs under `patterns/` carry regular
expressions ported or adapted from [Microsoft Presidio][presidio]
(`microsoft/presidio`, MIT-licensed) specifically the
`presidio-analyzer/presidio_analyzer/predefined_recognizers/`
classes referenced inline in each TOML's leading comment.
classes. Validators (Luhn, IBAN mod-97, ABA, DEA, NPI, NHS,
NINO, etc.) were re-implemented in Rust from the same upstream
algorithms.

The Presidio MIT license text is reproduced below, per its
`Permission notice` clause.
The Presidio MIT license text is reproduced below to satisfy its
"include this permission notice in all copies or substantial
portions" clause.

[presidio]: https://github.com/microsoft/presidio

Expand Down
94 changes: 94 additions & 0 deletions crates/nvisy-pattern/assets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Shipped asset tree

The `nvisy-pattern` crate compiles every TOML and term-source
file under this directory into the binary via `include_str!`,
so adding a pattern or dictionary is as simple as:

1. Drop the asset into the right subtree.
2. Wire a `shipped_pattern!` / `shipped_dictionary!` accessor
in `src/shipped/{patterns,dictionaries}/<scope>.rs`.
3. Append the accessor to the sub-module's `all()`.

The recognizer's per-call language and country fields filter
patterns + dictionaries at runtime — see
[`RecognizerInput::applies_to_language`] and
[`RecognizerInput::applies_to_country`].

## Layout

```
assets/
patterns/
world/ jurisdiction-agnostic regex patterns
contact/ email, phone, url
credentials/ aws, github, stripe, generic api, private key
finance/ credit card, iban, swift, btc, eth
network/ ipv4, ipv6, mac
personal/ date of birth, datetime
us/ US-jurisdiction patterns
identity/ ssn, itin, drivers_license, passport, postal_code
finance/ bank_routing, bank_account
health/ npi, mbi, medical_license (DEA)
uk/ UK-jurisdiction patterns
identity/ nhs, nino, driving_licence, passport
contact/ postcode
vehicle/ registration

dictionaries/
world/ universal: brand names + codes
finance/ cryptocurrencies (BTC, ETH, Bitcoin, …)
en/ English-language terms
finance/ currencies (USD, US Dollar, EUR, …)
personal/ languages, nationalities, religions
```

Each pattern is a TOML file (`<name>.toml`). Each dictionary
pairs a TOML metadata sidecar with a term source:
`.csv` for multi-column lists (term + alias columns with
per-column scores), `.txt` for one-per-line lists.

## Scoring conventions

Scores are baseline confidence — the context enhancer (in
`nvisy-context`) lifts them when configured keywords appear
nearby. The toolkit's default confidence threshold is `0.35`;
anything below needs context boost or an out-of-band hint
(CSV column header, JSON object key, HTML parent text) to
clear it.

| Tier | Score | Use |
|------|-------|-----|
| Strong | 0.95–0.98 | Branded credential headers (`AKIA…`, `-----BEGIN PRIVATE KEY-----`, `gh[pousr]_…`) |
| Solid | 0.4–0.5 | Format with a checksum or restrictive structure (IBAN, NHS, NPI, MBI, IPv4, MAC, IPv6) |
| Loose | 0.3 | Brand-aware with weak structural specificity (credit_card, dictionaries) |
| Weak | 0.1 | Generic shape that *requires* context to clear threshold (passport, postal_code, DoB) |
| Trace | 0.05 | Last-resort generic regex (bank_account `\b\d{8,17}\b`) |

The targets mirror Microsoft Presidio's deliberately-conservative
baselines — most of Presidio's predefined recognizers sit in
0.1–0.5 because the context enhancer is expected to lift hits
to 0.6+ when surrounding tokens match.

## Validators

A pattern variant can declare `validator = "<name>"` to drop
matches that fail a post-match structural check. Built-in
names (resolved via `ValidatorRegistry::builtin`):

- Universal: `luhn`, `iban`, `phone`, `date`, `crypto.btc`
- US: `us.ssn`, `us.aba_routing`, `us.npi`, `us.dea_number`,
`us.postal_code`
- UK: `uk.nhs`, `uk.nino`, `uk.driving_licence`,
`uk.vehicle_registration`

Each lives in `src/validators/` under the matching submodule.

## Attribution

Many patterns + validators are ports of upstream Microsoft
Presidio recognizers. See [`PRESIDIO.md`](PRESIDIO.md) for the
MIT-license attribution and the upstream class references that
each adapted TOML's leading comment links to.

[`RecognizerInput::applies_to_language`]: ../../nvisy-core/src/recognition/input.rs
[`RecognizerInput::applies_to_country`]: ../../nvisy-core/src/recognition/input.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
name = "currencies"
label = "currency"
score = 0.85
languages = ["en"]
score = 0.4
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
name = "languages"
label = "language"
languages = ["en"]

# column 0 = long-form names (`English`, `Spanish`, ...)
# column 1 = ISO 639-1 codes (`en`, `es`, ...)
# column 2 = alternate long-form names (`Farsi` for Persian)
score = [0.85, 0.30, 0.85]
score = [0.4, 0.2, 0.4]
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
name = "nationalities"
label = "nationality"
score = 0.85
languages = ["en"]
score = 0.4
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
name = "religions"
label = "religion"
score = 0.85
languages = ["en"]
score = 0.4
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
name = "cryptocurrencies"
label = "currency"
score = 0.85
score = 0.4
25 changes: 25 additions & 0 deletions crates/nvisy-pattern/assets/patterns/au/finance/abn.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Australian Business Number (ABN): 11-digit ID issued by the
# Australian Business Register. The leading digit is rewritten
# (-1 with 0→9) before a weighted sum that must be divisible by
# 89. The conventional rendering is `NN NNN NNN NNN`.

name = "au-abn"
label = "company_id"
countries = ["AU"]
languages = ["en"]
context = [
"australian business number",
"abn",
"abr",
"australian business register",
]

[[variants]]
regex = '\b\d{2}\s\d{3}\s\d{3}\s\d{3}\b'
score = 0.4
validator = "au.abn"

[[variants]]
regex = '\b\d{11}\b'
score = 0.2
validator = "au.abn"
24 changes: 24 additions & 0 deletions crates/nvisy-pattern/assets/patterns/au/finance/acn.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Australian Company Number (ACN): 9-digit ID issued by ASIC.
# Weighted-sum mod-10 checksum, complement-to-10. Conventional
# rendering is `NNN NNN NNN`.

name = "au-acn"
label = "company_id"
countries = ["AU"]
languages = ["en"]
context = [
"australian company number",
"acn",
"asic",
"australian securities and investments commission",
]

[[variants]]
regex = '\b\d{3}\s\d{3}\s\d{3}\b'
score = 0.4
validator = "au.acn"

[[variants]]
regex = '\b\d{9}\b'
score = 0.1
validator = "au.acn"
25 changes: 25 additions & 0 deletions crates/nvisy-pattern/assets/patterns/au/health/medicare.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Australian Medicare card number: 10-digit ID where the first
# digit is in 2-6 and the 9th digit is a mod-10 weighted check.
# Rendering is `NNNN NNNNN N` (card number + individual reference)
# or `NNNN NNNNN N N` (with issue number).

name = "au-medicare"
label = "insurance_id"
countries = ["AU"]
languages = ["en"]
context = [
"medicare",
"medicare card",
"medicare number",
"services australia",
]

[[variants]]
regex = '\b[2-6]\d{3}\s\d{5}\s\d\b'
score = 0.4
validator = "au.medicare"

[[variants]]
regex = '\b[2-6]\d{9}\b'
score = 0.2
validator = "au.medicare"
24 changes: 24 additions & 0 deletions crates/nvisy-pattern/assets/patterns/au/identity/tfn.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Australian Tax File Number (TFN): 9-digit ID issued by the
# Australian Taxation Office. Weighted mod-11 checksum.
# Rendering is `NNN NNN NNN`.

name = "au-tfn"
label = "tax_id"
countries = ["AU"]
languages = ["en"]
context = [
"tax file number",
"tfn",
"australian taxation office",
"ato",
]

[[variants]]
regex = '\b\d{3}\s\d{3}\s\d{3}\b'
score = 0.4
validator = "au.tfn"

[[variants]]
regex = '\b\d{9}\b'
score = 0.1
validator = "au.tfn"
20 changes: 20 additions & 0 deletions crates/nvisy-pattern/assets/patterns/ca/contact/postal_code.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Canadian postal code: `A1A 1A1` format where the letter
# alphabet excludes D, F, I, O, Q, U (and W, Z don't appear as
# the first letter — Canada Post Address Standard).

name = "ca-postal-code"
label = "postal_code"
countries = ["CA"]
languages = ["en", "fr"]
context = [
"postal code",
"code postal",
"canada post",
"postes canada",
"mailing address",
"adresse postale",
]

[[variants]]
regex = '\b[ABCEGHJ-NPRSTVXY]\d[A-CEGHJ-NPRSTV-Z][ -]?\d[A-CEGHJ-NPRSTV-Z]\d\b'
score = 0.5
Loading
Loading