Skip to content

Open EntityLabel system + per-request catalog + non-generic Policy + engine split (closes #252)#278

Merged
martsokha merged 3 commits into
mainfrom
refactor/entity-label
Jun 14, 2026
Merged

Open EntityLabel system + per-request catalog + non-generic Policy + engine split (closes #252)#278
martsokha merged 3 commits into
mainfrom
refactor/entity-label

Conversation

@martsokha

@martsokha martsokha commented Jun 13, 2026

Copy link
Copy Markdown
Member

Summary

Three-commit branch that replaces the closed EntityKind enum with an open-vocabulary label system, wires labels per-request through policy catalogs, then restructures the engine into independent detection/redaction halves.

Commit 1 — open EntityLabel system (#252)

  • EntityLabel (name + optional description + free-form tags) replaces the closed EntityKind enum.
  • EntityLabelRef is the HipStr-wrapped name-only handle stored on every detected Entity.
  • EntityLabelCatalog = runtime-constructed name-indexed lookup. The workspace ships 66 built-in labels in entity::builtins as a toolkit/SDK API; EntityLabelCatalog::with_builtins() is the convenience constructor.
  • EntitySelector matches by labels: Vec<EntityLabelRef> + tags: Vec<HipStr>. EntityCategory deleted; category groupings survive as tags on built-in labels.
  • Touches every recognizer (pattern, NER, LLM), nvisy-fake's per-label generator, nvisy-toolkit dedup + redaction registries, all asset TOMLs, and workspace tests.

Commit 2 — per-request catalog supplied via policies

  • New Policy.labels: Vec<EntityLabel> field. Each submitted policy declares the labels its rules operate over.
  • DetectionInput::unify_labels() unions every policy's labels into a per-request EntityLabelCatalog. Conflicts (same name with non-equal (description, tags)) → HTTP 400.
  • DetectionInput::validate_selector_labels() rejects requests where a selector targets a label no policy declares.
  • DetectionConfig::build_for_request(catalog) builds a fresh RecognizerRegistry per request: patterns/dictionaries filtered against the catalog, NER's zero-shot label list sourced from the catalog.
  • EntitySelector::matches(entity, catalog) dereferences tags against the request catalog. BUILTIN_CATALOG: LazyLock deleted.
  • Engine-startup Arc<RecognizerRegistry> and the Detection.labels post-filter both gone; pattern/NER filtering happens at registry construction.

Commit 3 — non-generic Policy + engine split + module restructure

Policy + Action shape:

  • Policy and PolicyRule non-generic. Action::Redact(ModalityRedactions) carries per-modality operator specs in one rule; Action::Suppress(SuppressAction) + Action::Audit(AuditAction) for the other verbs.
  • Audit-side Decision<M>.action is ResolvedAction<M> (per-entity resolved outcome, distinct from policy-author Action).
  • ProjectRedaction trait does per-modality projection of ModalityRedactions::operator_for::<M>(). RedactionConfig.default_operators is the deployment-wide fallback.

Engine + context + plan split:

  • RunContextDetectionContext + RedactionContext with shared PhaseContext trait.
  • PlanDetectionPlan + RedactionPlan; HTTP bodies carry the matching plan.
  • EngineDetectionEngine + RedactionEngine. RedactionEngine::from_detection(&detection) shares the registry, runtime config, optional key provider, and an in-memory DetectionState read-handle for the detect→redact handoff.

Module restructure:

  • pipeline/ deleted. detection/ and redaction/ promoted to crate root.
  • Configs split: core/config.rs (RuntimeConfig + EngineConfig + ResourceLimits); detection/{config,extraction,plan}; redaction/{config,plan}.
  • phases/ split: detection-side phases → detection/phases/, redaction-side phases → redaction/phases/, ingestion helpers → core/ingestion/.
  • Convenience re-exports removed; consumers import from canonical paths.

Test plan

  • cargo build --workspace --all-features — green
  • cargo test --workspace --all-features — green (0 failures)
  • cargo clippy --workspace --all-features --all-targets -- -D warnings — green
  • RUSTDOCFLAGS="-D warnings" cargo doc --workspace --all-features --no-deps — green
  • CI green on the open PR

🤖 Generated with Claude Code

martsokha and others added 2 commits June 13, 2026 17:39
…system

Resolves #252. Switches the workspace from a fixed `EntityKind` enum to
an open-vocabulary label system so deployments can mint custom entity
labels without touching workspace code.

Shape:
- `EntityLabel` carries name + optional description + free-form tags.
- `EntityLabelRef` is a `HipStr`-wrapped name-only handle stored on
  every detected `Entity` (cheap clone, hot-path type).
- `EntityLabelCatalog` is a runtime-constructed name-indexed lookup;
  the workspace ships 66 built-in labels in `entity::builtins`, each
  reachable as a constant (`builtins::PERSON_NAME`, …) and bundled into
  `EntityLabelCatalog::with_builtins()`.
- `EntitySelector` matches by `labels: Vec<EntityLabelRef>` plus
  `tags: Vec<HipStr>` (deferenced via the catalog).

Deletes `EntityCategory` entirely; category groupings (`pii`, `phi`,
`pci`, `personal_identity`, `financial`, …) survive as tags on each
built-in label so selectors keep the same expressive power without an
enum.

Updates every recognizer (pattern, NER, LLM), nvisy-fake's per-label
generator dispatch, nvisy-toolkit dedup + redaction registries, all
asset TOMLs (`entity_kind` → `label`), and workspace-wide tests.
Labels are now per-request rather than per-deployment. Each
submitted `Policy<M>` carries the labels it operates over; the
engine unions them into an `EntityLabelCatalog` driving recognizer
dispatch and selector tag matching for that request.

Engine flow:
- New `Policy::labels: Vec<EntityLabel>` field. `unify_labels`
  unions every policy's labels with conflict detection (same name
  with non-equal description/tags → HTTP 400).
- `validate_selector_labels` ensures every selector label name
  exists in the unioned catalog.
- `DetectionConfig::build_for_request(catalog)` builds a fresh
  `RecognizerRegistry` per request: patterns/dictionaries are
  filtered against the catalog (entries with unregistered labels
  never run), NER's zero-shot label list is sourced from the
  catalog.
- `EntitySelector::matches(entity, catalog)` dereferences tags
  against the request catalog. The `BUILTIN_CATALOG: LazyLock` in
  selector.rs is gone.

Deletions:
- Engine-startup `Arc<RecognizerRegistry>` and
  `DetectionEngineState.recognizer_registry` — replaced by an
  `Arc<DetectionConfig>` template.
- `Detection` plan node + `cfg.labels.contains(...)` post-filter
  in `detection.rs` — superseded by registry-construction-time
  filtering.
- `default_text_labels()` — no more hardcoded label allowlist.

Plus a workspace-wide style sweep: inline `tracing::*` macros and
attributes (`#[tracing::instrument(...)]`, `tracing::Level::INFO`)
and convert remaining collapsed-form rustdoc links `[X][path]` to
the reference form `[X]` + `[X]: path` at the bottom of doc blocks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@martsokha martsokha self-assigned this Jun 13, 2026
@martsokha martsokha added engine redaction engine, pipeline runtime, orchestration, configuration server server, API handlers, middleware ontology entities, policies, contexts refactor code restructuring without behavior change labels Jun 13, 2026
…n, restructure modules

Major restructure of the engine crate driven by three threads of
work:

**Policy + Action shape**

- `Policy` and `PolicyRule` are no longer generic on modality.
- `Action::Redact(ModalityRedactions)` carries per-modality
  operator specs in one rule, so a single rule covers text +
  image + audio without duplication. `Action::Suppress(SuppressAction)`
  carries an optional reason. `Action::Audit(AuditAction)` is a
  new variant that flags entities for review without redacting
  them (the detection pass already produces audit entries; this
  tags them with a severity hint).
- Audit-side `Decision<M>.action` becomes `ResolvedAction<M>` — the
  per-entity resolved outcome, distinct from the policy-author
  `Action`. Naming collision resolved.
- New `ProjectRedaction` trait does per-modality projection of
  `ModalityRedactions::operator_for::<M>()`. `RedactionConfig`
  gains a `default_operators` field for deployment-wide fallback
  when a rule doesn't cover the entity's modality.

**Per-pass context + plan + engine split**

- `RunContext` deleted. Replaced by `DetectionContext` +
  `RedactionContext`, each carrying only the engine resources
  its side actually consumes. Shared surface
  (shared/cancel/concurrency) sits on a new `PhaseContext` trait.
- `Plan` deleted. Replaced by `DetectionPlan` (extraction +
  deduplication knobs) + `RedactionPlan` (redaction + validation
  knobs). `NewDetection` and `NewRedaction` HTTP bodies carry
  the matching plan so callers can't supply irrelevant fields.
- `Engine` deleted. Replaced by `DetectionEngine` and
  `RedactionEngine`, each owning its side's state independently.
  `RedactionEngine::from_detection(&detection)` shares the
  detection engine's registry, runtime config, optional key
  provider, and an in-memory `DetectionState` read-handle for the
  detect→redact handoff.

**Module restructure**

- `pipeline/` module deleted entirely.
- `detection/` and `redaction/` promoted to crate root.
- Configuration split per-side: `core/config.rs` carries
  `RuntimeConfig`, `EngineConfig`, `ResourceLimits`;
  `detection/config/`, `detection/extraction/`, `detection/plan.rs`
  carry detection-side configs; `redaction/config.rs`,
  `redaction/plan.rs` carry redaction-side configs.
- `phases/` split too: detection-side phases (extraction,
  detection, deduplication) move under `detection/phases/`;
  redaction-side phases (redaction, validation) under
  `redaction/phases/`; ingestion helpers move to
  `core/ingestion/`.
- Convenience re-exports (`pub use crate::phases::*`) removed
  from `pipeline/mod.rs`; consumers import from canonical paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@martsokha martsokha changed the title Open EntityLabel system + per-request label catalog (closes #252) Open EntityLabel system + per-request catalog + non-generic Policy + engine split (closes #252) Jun 14, 2026
@martsokha

Copy link
Copy Markdown
Member Author

closes #252

@martsokha martsokha merged commit 4713f9e into main Jun 14, 2026
5 checks passed
@martsokha martsokha deleted the refactor/entity-label branch June 14, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine redaction engine, pipeline runtime, orchestration, configuration ontology entities, policies, contexts refactor code restructuring without behavior change server server, API handlers, middleware

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant