Skip to content

feat: add anonymize core crate#217

Open
jan-kubica wants to merge 44 commits into
mainfrom
codex/anonymize-core-redaction
Open

feat: add anonymize core crate#217
jan-kubica wants to merge 44 commits into
mainfrom
codex/anonymize-core-redaction

Conversation

@jan-kubica

@jan-kubica jan-kubica commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a strict Cargo workspace and internal stella-anonymize-core crate
  • model redaction-domain behavior in Rust modules for placeholders, normalization, UTF-16 spans, and result construction
  • add a typed Rust search index over Stella literal, regex, and fuzzy core crates while preserving the existing UTF-16 offset contract
  • wire Rust format, lint, and test checks into the root scripts and CI without changing published TypeScript package exports

CC on behalf of @sok0

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jan-kubica jan-kubica force-pushed the codex/anonymize-core-redaction branch from 341eb7f to b916fa1 Compare June 24, 2026 07:24
@jan-kubica jan-kubica marked this pull request as ready for review June 24, 2026 07:25
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

Dependency Review

The following issues were found:

  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 30 package(s) with unknown licenses.
  • ⚠️ 2 packages with OpenSSF Scorecard issues.

View full job summary

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b916fa1630

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/placeholders.rs Outdated
Comment thread crates/anonymize-core/src/normalize.rs Outdated
Comment thread crates/anonymize-core/src/redact.rs Outdated
Comment thread crates/anonymize-core/src/search.rs Outdated
Comment thread crates/anonymize-core/src/redact.rs Outdated
Comment thread crates/anonymize-core/src/normalize.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c664ed09a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/normalize.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fcbb328f84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/resolution/boundary.rs Outdated
Comment thread crates/anonymize-core/src/redact.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd14f11c0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/scripts/migration-fixture-perf.mjs
Comment thread crates/anonymize-napi/src/lib.rs Outdated
Comment thread crates/anonymize-core/src/resolution/sanitize.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 163ee75fca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/diagnostics.rs Outdated
Comment thread crates/anonymize-napi/src/lib.rs
Comment thread packages/anonymize/src/build-unified-search.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fc897746a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/scripts/migration-fixture-perf.mjs Outdated
Comment thread crates/anonymize-napi/src/lib.rs Outdated
Comment thread crates/anonymize-core/src/false_positives.rs Outdated
Comment thread crates/anonymize-adapter-contract/src/lib.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31dc9df699

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/search.rs
Comment thread packages/data/dictionaries/index.ts Outdated
Comment thread packages/data/dictionaries/index.ts
Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32ef71ca81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread packages/anonymize/src/build-unified-search.ts
Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/prepared.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d6303598b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

boundary_search: literal_search(data.boundary_words)?,
br_cep_cue_search: literal_search(data.br_cep_cue_words)?,
postal_code_re: compile_regex(
r"(?u)(?:\d{3}\s\d{2}|\d{2}[-‐‑‒–—―]\d{3}|\d{5}|\d{5}[-‐‑‒–—―]\d{3}|\d{5}[-‐‑‒–—―]\d{4})",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match hyphenated postal codes before ZIP5

For hyphenated values such as CA 94304-1050 or Brazilian CEP 01001-000, this alternation matches only the leading \d{5} first; postal_boundaries then rejects that prefix because the next character is a dash, and find_iter never retries the longer ZIP+4/CEP alternatives at the same start. That means the native address-seed path skips the us_zip_plus_four_shape_re and br_cep_shape_re branches entirely, so state-qualified ZIP+4 fragments and cue-gated CEPs that the TS detector covers are missed; put the longer hyphenated alternatives before bare \d{5} or remove the bare case from this regex.

Useful? React with 👍 / 👎.

Comment on lines +225 to +228
if rule.requires_exact_case
&& !matches_trigger_case(full_text, &offsets, found, rule)?
{
continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor case-insensitive trigger matches

When a configured trigger is all uppercase and has no spaces, for example CPF, CNPJ, DNI, or CP from the trigger data, the native search still finds lowercase source variants because the trigger index is case-insensitive, but this extra exact-case check drops them before value extraction. The TypeScript trigger path lowercases the patterns and relies on case-insensitive matching, so native-static misses lowercase legal/OCR forms such as cpf: 123... that the existing pipeline redacts; remove or narrow this exact-case gate for configured case-insensitive trigger rules.

Useful? React with 👍 / 👎.

Comment on lines +155 to +157
static_redaction_result_to_utf16_binding(result, full_text)
.map_err(|error| to_py_contract_error(&error))
.map(to_py_static_redaction_result)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return Python-native offsets from PyO3 results

When Python callers use redact_static_entities on text containing an astral character before a match, this converts core byte offsets to UTF-16 offsets; Python strings are indexed by Unicode code points, not UTF-16 code units. For example, after an emoji prefix the returned start is one position too far right for full_text[start:end], so Python users slice/highlight the wrong entity even though JS parity still passes; convert PyO3 class results to Python code-point offsets, or expose the UTF-16 contract only on the JSON/JS compatibility path.

Useful? React with 👍 / 👎.

Comment on lines +896 to +897
if (!meta || !nativeSupportsRegexMeta(meta)) {
continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not drop regexes with unsupported validators

When nativeSupportsRegexMeta returns false, this silently omits the pattern from the native config instead of falling back or failing preparation. Valid identifiers whose regexes have validators outside NATIVE_REGEX_VALIDATOR_IDS, for example CN RIC, crypto wallets, or many EU VAT/NIP/PPS/codice fiscale patterns already present in REGEX_META, are still redacted by the TypeScript pipeline but are never searched by native-static, so those sensitive values remain in the output whenever the native prepared config is used.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant