feat: add anonymize core crate#217
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
341eb7f to
b916fa1
Compare
Dependency ReviewThe following issues were found:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b916fa1630
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c664ed09a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fcbb328f84
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd14f11c0c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 163ee75fca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fc897746a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31dc9df699
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32ef71ca81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d6303598b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| boundary_search: literal_search(data.boundary_words)?, | ||
| br_cep_cue_search: literal_search(data.br_cep_cue_words)?, | ||
| postal_code_re: compile_regex( | ||
| r"(?u)(?:\d{3}\s\d{2}|\d{2}[-‐‑‒–—―]\d{3}|\d{5}|\d{5}[-‐‑‒–—―]\d{3}|\d{5}[-‐‑‒–—―]\d{4})", |
There was a problem hiding this comment.
Match hyphenated postal codes before ZIP5
For hyphenated values such as CA 94304-1050 or Brazilian CEP 01001-000, this alternation matches only the leading \d{5} first; postal_boundaries then rejects that prefix because the next character is a dash, and find_iter never retries the longer ZIP+4/CEP alternatives at the same start. That means the native address-seed path skips the us_zip_plus_four_shape_re and br_cep_shape_re branches entirely, so state-qualified ZIP+4 fragments and cue-gated CEPs that the TS detector covers are missed; put the longer hyphenated alternatives before bare \d{5} or remove the bare case from this regex.
Useful? React with 👍 / 👎.
| if rule.requires_exact_case | ||
| && !matches_trigger_case(full_text, &offsets, found, rule)? | ||
| { | ||
| continue; |
There was a problem hiding this comment.
Honor case-insensitive trigger matches
When a configured trigger is all uppercase and has no spaces, for example CPF, CNPJ, DNI, or CP from the trigger data, the native search still finds lowercase source variants because the trigger index is case-insensitive, but this extra exact-case check drops them before value extraction. The TypeScript trigger path lowercases the patterns and relies on case-insensitive matching, so native-static misses lowercase legal/OCR forms such as cpf: 123... that the existing pipeline redacts; remove or narrow this exact-case gate for configured case-insensitive trigger rules.
Useful? React with 👍 / 👎.
| static_redaction_result_to_utf16_binding(result, full_text) | ||
| .map_err(|error| to_py_contract_error(&error)) | ||
| .map(to_py_static_redaction_result) |
There was a problem hiding this comment.
Return Python-native offsets from PyO3 results
When Python callers use redact_static_entities on text containing an astral character before a match, this converts core byte offsets to UTF-16 offsets; Python strings are indexed by Unicode code points, not UTF-16 code units. For example, after an emoji prefix the returned start is one position too far right for full_text[start:end], so Python users slice/highlight the wrong entity even though JS parity still passes; convert PyO3 class results to Python code-point offsets, or expose the UTF-16 contract only on the JSON/JS compatibility path.
Useful? React with 👍 / 👎.
| if (!meta || !nativeSupportsRegexMeta(meta)) { | ||
| continue; |
There was a problem hiding this comment.
Do not drop regexes with unsupported validators
When nativeSupportsRegexMeta returns false, this silently omits the pattern from the native config instead of falling back or failing preparation. Valid identifiers whose regexes have validators outside NATIVE_REGEX_VALIDATOR_IDS, for example CN RIC, crypto wallets, or many EU VAT/NIP/PPS/codice fiscale patterns already present in REGEX_META, are still redacted by the TypeScript pipeline but are never searched by native-static, so those sensitive values remain in the output whenever the native prepared config is used.
Useful? React with 👍 / 👎.
Summary
stella-anonymize-corecrateCC on behalf of @sok0