Open EntityLabel system + per-request catalog + non-generic Policy + engine split (closes #252)#278
Merged
Merged
Conversation
…system Resolves #252. Switches the workspace from a fixed `EntityKind` enum to an open-vocabulary label system so deployments can mint custom entity labels without touching workspace code. Shape: - `EntityLabel` carries name + optional description + free-form tags. - `EntityLabelRef` is a `HipStr`-wrapped name-only handle stored on every detected `Entity` (cheap clone, hot-path type). - `EntityLabelCatalog` is a runtime-constructed name-indexed lookup; the workspace ships 66 built-in labels in `entity::builtins`, each reachable as a constant (`builtins::PERSON_NAME`, …) and bundled into `EntityLabelCatalog::with_builtins()`. - `EntitySelector` matches by `labels: Vec<EntityLabelRef>` plus `tags: Vec<HipStr>` (deferenced via the catalog). Deletes `EntityCategory` entirely; category groupings (`pii`, `phi`, `pci`, `personal_identity`, `financial`, …) survive as tags on each built-in label so selectors keep the same expressive power without an enum. Updates every recognizer (pattern, NER, LLM), nvisy-fake's per-label generator dispatch, nvisy-toolkit dedup + redaction registries, all asset TOMLs (`entity_kind` → `label`), and workspace-wide tests.
Labels are now per-request rather than per-deployment. Each submitted `Policy<M>` carries the labels it operates over; the engine unions them into an `EntityLabelCatalog` driving recognizer dispatch and selector tag matching for that request. Engine flow: - New `Policy::labels: Vec<EntityLabel>` field. `unify_labels` unions every policy's labels with conflict detection (same name with non-equal description/tags → HTTP 400). - `validate_selector_labels` ensures every selector label name exists in the unioned catalog. - `DetectionConfig::build_for_request(catalog)` builds a fresh `RecognizerRegistry` per request: patterns/dictionaries are filtered against the catalog (entries with unregistered labels never run), NER's zero-shot label list is sourced from the catalog. - `EntitySelector::matches(entity, catalog)` dereferences tags against the request catalog. The `BUILTIN_CATALOG: LazyLock` in selector.rs is gone. Deletions: - Engine-startup `Arc<RecognizerRegistry>` and `DetectionEngineState.recognizer_registry` — replaced by an `Arc<DetectionConfig>` template. - `Detection` plan node + `cfg.labels.contains(...)` post-filter in `detection.rs` — superseded by registry-construction-time filtering. - `default_text_labels()` — no more hardcoded label allowlist. Plus a workspace-wide style sweep: inline `tracing::*` macros and attributes (`#[tracing::instrument(...)]`, `tracing::Level::INFO`) and convert remaining collapsed-form rustdoc links `[X][path]` to the reference form `[X]` + `[X]: path` at the bottom of doc blocks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…n, restructure modules Major restructure of the engine crate driven by three threads of work: **Policy + Action shape** - `Policy` and `PolicyRule` are no longer generic on modality. - `Action::Redact(ModalityRedactions)` carries per-modality operator specs in one rule, so a single rule covers text + image + audio without duplication. `Action::Suppress(SuppressAction)` carries an optional reason. `Action::Audit(AuditAction)` is a new variant that flags entities for review without redacting them (the detection pass already produces audit entries; this tags them with a severity hint). - Audit-side `Decision<M>.action` becomes `ResolvedAction<M>` — the per-entity resolved outcome, distinct from the policy-author `Action`. Naming collision resolved. - New `ProjectRedaction` trait does per-modality projection of `ModalityRedactions::operator_for::<M>()`. `RedactionConfig` gains a `default_operators` field for deployment-wide fallback when a rule doesn't cover the entity's modality. **Per-pass context + plan + engine split** - `RunContext` deleted. Replaced by `DetectionContext` + `RedactionContext`, each carrying only the engine resources its side actually consumes. Shared surface (shared/cancel/concurrency) sits on a new `PhaseContext` trait. - `Plan` deleted. Replaced by `DetectionPlan` (extraction + deduplication knobs) + `RedactionPlan` (redaction + validation knobs). `NewDetection` and `NewRedaction` HTTP bodies carry the matching plan so callers can't supply irrelevant fields. - `Engine` deleted. Replaced by `DetectionEngine` and `RedactionEngine`, each owning its side's state independently. `RedactionEngine::from_detection(&detection)` shares the detection engine's registry, runtime config, optional key provider, and an in-memory `DetectionState` read-handle for the detect→redact handoff. **Module restructure** - `pipeline/` module deleted entirely. - `detection/` and `redaction/` promoted to crate root. - Configuration split per-side: `core/config.rs` carries `RuntimeConfig`, `EngineConfig`, `ResourceLimits`; `detection/config/`, `detection/extraction/`, `detection/plan.rs` carry detection-side configs; `redaction/config.rs`, `redaction/plan.rs` carry redaction-side configs. - `phases/` split too: detection-side phases (extraction, detection, deduplication) move under `detection/phases/`; redaction-side phases (redaction, validation) under `redaction/phases/`; ingestion helpers move to `core/ingestion/`. - Convenience re-exports (`pub use crate::phases::*`) removed from `pipeline/mod.rs`; consumers import from canonical paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Member
Author
|
closes #252 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three-commit branch that replaces the closed
EntityKindenum with an open-vocabulary label system, wires labels per-request through policy catalogs, then restructures the engine into independent detection/redaction halves.Commit 1 — open EntityLabel system (#252)
EntityLabel(name + optional description + free-form tags) replaces the closedEntityKindenum.EntityLabelRefis theHipStr-wrapped name-only handle stored on every detectedEntity.EntityLabelCatalog= runtime-constructed name-indexed lookup. The workspace ships 66 built-in labels inentity::builtinsas a toolkit/SDK API;EntityLabelCatalog::with_builtins()is the convenience constructor.EntitySelectormatches bylabels: Vec<EntityLabelRef>+tags: Vec<HipStr>.EntityCategorydeleted; category groupings survive as tags on built-in labels.nvisy-fake's per-label generator,nvisy-toolkitdedup + redaction registries, all asset TOMLs, and workspace tests.Commit 2 — per-request catalog supplied via policies
Policy.labels: Vec<EntityLabel>field. Each submitted policy declares the labels its rules operate over.DetectionInput::unify_labels()unions every policy's labels into a per-requestEntityLabelCatalog. Conflicts (same name with non-equal(description, tags)) → HTTP 400.DetectionInput::validate_selector_labels()rejects requests where a selector targets a label no policy declares.DetectionConfig::build_for_request(catalog)builds a freshRecognizerRegistryper request: patterns/dictionaries filtered against the catalog, NER's zero-shot label list sourced from the catalog.EntitySelector::matches(entity, catalog)dereferences tags against the request catalog.BUILTIN_CATALOG: LazyLockdeleted.Arc<RecognizerRegistry>and theDetection.labelspost-filter both gone; pattern/NER filtering happens at registry construction.Commit 3 — non-generic Policy + engine split + module restructure
Policy + Action shape:
PolicyandPolicyRulenon-generic.Action::Redact(ModalityRedactions)carries per-modality operator specs in one rule;Action::Suppress(SuppressAction)+Action::Audit(AuditAction)for the other verbs.Decision<M>.actionisResolvedAction<M>(per-entity resolved outcome, distinct from policy-authorAction).ProjectRedactiontrait does per-modality projection ofModalityRedactions::operator_for::<M>().RedactionConfig.default_operatorsis the deployment-wide fallback.Engine + context + plan split:
RunContext→DetectionContext+RedactionContextwith sharedPhaseContexttrait.Plan→DetectionPlan+RedactionPlan; HTTP bodies carry the matching plan.Engine→DetectionEngine+RedactionEngine.RedactionEngine::from_detection(&detection)shares the registry, runtime config, optional key provider, and an in-memoryDetectionStateread-handle for the detect→redact handoff.Module restructure:
pipeline/deleted.detection/andredaction/promoted to crate root.core/config.rs(RuntimeConfig + EngineConfig + ResourceLimits);detection/{config,extraction,plan};redaction/{config,plan}.phases/split: detection-side phases →detection/phases/, redaction-side phases →redaction/phases/, ingestion helpers →core/ingestion/.Test plan
cargo build --workspace --all-features— greencargo test --workspace --all-features— green (0 failures)cargo clippy --workspace --all-features --all-targets -- -D warnings— greenRUSTDOCFLAGS="-D warnings" cargo doc --workspace --all-features --no-deps— green🤖 Generated with Claude Code