feat(ner): schema-driven GLiNER2 NER service (single engine)#28
Open
martsokha wants to merge 6 commits into
Open
feat(ner): schema-driven GLiNER2 NER service (single engine)#28martsokha wants to merge 6 commits into
martsokha wants to merge 6 commits into
Conversation
Rewrite the NER service as multi-model: load a configurable whitelist of Hugging Face backends at startup and let each request pick one by id. Spans come back with each model's own native labels and the modelId that produced them; mapping labels onto a canonical vocabulary moves to the consumer (the runtime owns that map), keeping this service engine-native. Contract (ner.v1): - NerRequest: drop `kinds`, add optional `model` (HF id; None = default). - Entity.label is the model-native string, not an EntityKind; classProbs keyed by native labels. - NerResponse.modelId reports the model that ran. Service (nvisy-ner): - registry.py wraps GLiNER and HF token-classification pipelines behind one Backend interface; dispatch by id prefix. - Preload NVISY_NER_MODELS (config, not code); 400 on an unloaded id; NVISY_NER_DEFAULT_MODEL (or first listed) serves requests omitting `model`. - Per-model + per-param batched dispatch within a BentoML batch. Remove the taxonomy from this repo: delete nvisy_core.entity (EntityKind / EntityCategory) and nvisy_ner.label_map. Add transformers as a direct dep; regenerate requirements.txt and docs/openapi/ner.json; refresh the NER README, docs, and design doc (gliner.md -> ner.md). Breaking change: ner.v1 wire contract changes with no migration shim. The coordinated runtime change (consume `model`, own the label map) is tracked separately in nvisy-runtime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Audit job flagged three pre-existing CVEs in the workspace lock: - aiohttp 3.13.5 (CVE-2026-34993, CVE-2026-47265): fixed in 3.14.0. aiohttp is transitive via bentoml and not declared directly, so a lock edit alone would be reverted on the next resolve. Add a workspace-level `constraint-dependencies` floor (>=3.14.0) — uv's mechanism for pinning a transitive dep without making it a direct dependency — so the fix sticks. - torch 2.12.0 (CVE-2025-3000): torch.jit.script memory corruption, CVSS 5.3, local; no patched release exists, so no bump resolves it. Ignore it explicitly in the audit step with a comment to drop the flag once upstream ships a fix. Relock + regenerate per-service requirements.txt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three bugs found by checking the registry against the real libraries; all were masked by fakes that matched the wrong assumptions. GLiNER is zero-shot and has NO built-in label list — labels must be supplied at call time. The previous code read a nonexistent `config.labels`, fell back to `[]`, and so silently found zero entities (the default backend was a no-op). Fix: add an optional `labels` field to NerRequest; GLiNER requires it (zero-shot models without labels are rejected with 400 via MissingLabelsError -> InvalidArgument), token-classification models carry their own labels and ignore it. Also switch GLiNER to its batched `inference()` entry point (predict_entities/batch_predict_entities are the per-item / deprecated paths) and read its real span keys (text/label/score/start/end). HF token-classification pipeline (verified against transformers 5.1): - start/end come back as None when the tokenizer has no offset mapping; int(None) would crash. Drop offset-less spans instead of emitting broken entities. - the span label key is "entity_group" under aggregation but "entity" with strategy "none"; read entity_group with an "entity" fallback. Dispatch now groups by (labels, threshold, return_class_probs) so each backend call shares a uniform label set. Tests rewritten to mirror the real APIs, plus direct parser unit tests for the None-offset / fallback-key / threshold paths. Regenerate ner.json; refresh README + design doc to state that GLiNER labels come from the request. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Close out the remaining review corners: - Startup: wrap each backend load so a misconfigured whitelist fails liveness with an error naming the offending model id, instead of half-loading and 500-ing on the first request (#7). - threshold: document in the contract that it is the model's native confidence threshold for GLiNER but a post-inference score filter for token classification — same cutoff, different interaction with span selection (#6). - Resources: README note that every whitelisted model loads into each worker, so cpu/memory and batch size must be tuned to the list (#8). - Real-model coverage (#5): the suite fakes models by repo convention (no CI downloads). Add a manual smoke script that runs the real GLiNER and token-classification backends end to end, and verified locally that both produce correct native-labelled spans with valid offsets — confirming the fakes match the real libraries. README documents how to run it. The token-cls batched return shape (#4) was already verified against the transformers source during the previous fix; the list-normalisation path is unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Faked tests can't catch a fake that encodes the wrong library API — exactly the GLiNER bug that shipped green. Add real-model tests that close that gap without taxing every PR: - New tests/test_real_models.py (marked `real`): load the actual GLiNER and token-classification models and assert native-labelled spans with valid offsets. Replaces the standalone smoke script. - Register the `real` marker and exclude it by default (addopts `-m "not real"`) in both the workspace root and the nvisy-ner package config — pytest uses the nearest config, so it must be declared in both. Default suite stays fast and fully faked (53 passed, 2 deselected, <1s). - New `real-models` CI job runs `pytest -m real`, gated to schedule (weekly) + workflow_dispatch only — never on PRs/pushes, since it downloads weights and runs inference. Verified locally: `pytest -m real` -> 2 passed against real weights. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the multi-engine NER service with a single, SOTA engine — GLiNER2 — and a schema-driven contract. GLiNER2 tops span-level F1 for self-hosted PII extraction (arxiv 2605.09973) and its schema interface (entities + classification + structured records) subsumes what zero-shot NER and fixed-taxonomy token classifiers did separately, so one engine serves the whole contract. Contract (nvisy_core/ner/v1.py) — clean rewrite, schema-driven: - Request: text + Schema(entities | classifications | structures) + threshold. - Response: entities, classifications (single or multi-label), structures (named records of field -> spans), modelId. - Labels stay model-native; the consumer owns taxonomy mapping. Service (nvisy-ner) — single model, self-hosted: - One GLiNER2 model from NVISY_NER_MODEL (default fastino/gliner2-privacy-filter-PII-multi). No whitelist, no per-request model. - engine.py translates the wire schema to gliner2.Schema (incl. per-field regex validators), runs batch_extract, projects the result (confidence -> score). - Collapse registry.py + backends/ into config.py + engine.py. - Drop gliner + transformers direct deps (gliner2 keeps them transitively). Security (the self-hosting differentiator): - HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE baked into the service; verified the model loads + infers fully offline. - Reject inputs over the 512-token limit rather than letting the model silently truncate (and miss PII in the tail). - No payload logging; regression test asserts the hosted GLiNER2API/from_api path is never referenced. Every gliner2 API shape (batch_extract, the result dict for all three task groups, the tokenizer, offline load, quantize) was verified against the real model, not assumed from docs. Tests: faked default suite + real-marked engine tests; README + docs/design/ner.md rewritten. OpenAPI + requirements regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reworks
nvisy-nerinto a self-hosted, schema-driven information-extraction service backed by a single SOTA engine — GLiNER2. The earlier commits on this branch explored a multi-engine design (GLiNER v1 + GLiNER2 + token-classification); the branch converges on GLiNER2 alone, because it tops span-level F1 for self-hosted PII (arxiv 2605.09973) and its schema interface subsumes what the other engines did separately.Contract (
nvisy_core/ner/v1.py) — schema-driven rewritetext+schema+threshold. The schema composes three optional, verified-against-the-real-model groups:entities— zero-shot spans (per-label description steers the model)classifications— single- or multi-label text classificationstructures— named records of fields (each field: dtype, choices, description, regexpattern)entities,classifications,structures,modelId. Labels stay model-native; the runtime consumer owns taxonomy mapping.Service (
nvisy-ner) — single model, self-hostedNVISY_NER_MODEL(defaultfastino/gliner2-privacy-filter-PII-multi). No whitelist, no per-request model, no engine selection.registry.py+backends/→config.py+engine.py.engine.pytranslates the wire schema togliner2.Schema, runsbatch_extract, projects the result (confidence→score).gliner+transformersdirect deps (gliner2 keeps them transitively).Security (the self-hosting differentiator)
HF_HUB_OFFLINE/TRANSFORMERS_OFFLINEbaked into the service — verified the model loads + infers fully offline.NVISY_NER_MAX_TOKENS).GLiNER2API/from_apipath is never referenced.Verification
real-marked tests pass against the actual GLiNER2 weights (entities + classification + structured record, offline load+infer, over-length). Gated to the opt-inreal-modelsCI job.--check, requirements--checkall clean.Every gliner2 API shape was verified against
fastino/gliner2-privacy-filter-PII-multi, not assumed from docs (the docs were wrong several times).Flags for review
ner.v1rewrite, no migration shim (pre-release).spike/ner-multi-model) is now a misnomer; left as-is to preserve PR continuity.🤖 Generated with Claude Code