Skip to content

feat(ner): schema-driven GLiNER2 NER service (single engine)#28

Open
martsokha wants to merge 6 commits into
mainfrom
spike/ner-multi-model
Open

feat(ner): schema-driven GLiNER2 NER service (single engine)#28
martsokha wants to merge 6 commits into
mainfrom
spike/ner-multi-model

Conversation

@martsokha

@martsokha martsokha commented Jun 12, 2026

Copy link
Copy Markdown
Member

Reworks nvisy-ner into a self-hosted, schema-driven information-extraction service backed by a single SOTA engine — GLiNER2. The earlier commits on this branch explored a multi-engine design (GLiNER v1 + GLiNER2 + token-classification); the branch converges on GLiNER2 alone, because it tops span-level F1 for self-hosted PII (arxiv 2605.09973) and its schema interface subsumes what the other engines did separately.

History is kept intentionally — the multi-engine commits show how the design was reached. The final commit (feat(ner): rework as schema-driven GLiNER2 service) is the end state being proposed.

Contract (nvisy_core/ner/v1.py) — schema-driven rewrite

  • Request: text + schema + threshold. The schema composes three optional, verified-against-the-real-model groups:
    • entities — zero-shot spans (per-label description steers the model)
    • classifications — single- or multi-label text classification
    • structures — named records of fields (each field: dtype, choices, description, regex pattern)
  • Response: entities, classifications, structures, modelId. Labels stay model-native; the runtime consumer owns taxonomy mapping.

Service (nvisy-ner) — single model, self-hosted

  • One GLiNER2 model from NVISY_NER_MODEL (default fastino/gliner2-privacy-filter-PII-multi). No whitelist, no per-request model, no engine selection.
  • Collapsed registry.py + backends/config.py + engine.py. engine.py translates the wire schema to gliner2.Schema, runs batch_extract, projects the result (confidencescore).
  • Dropped gliner + transformers direct deps (gliner2 keeps them transitively).

Security (the self-hosting differentiator)

  • Offline: HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE baked into the service — verified the model loads + infers fully offline.
  • Reject, don't truncate: the encoder caps at 512 tokens and silently truncates above it (verified) — which would drop PII in the tail. Over-length input is rejected (NVISY_NER_MAX_TOKENS).
  • No payload logging; a regression test asserts the hosted GLiNER2API/from_api path is never referenced.

Verification

  • 51 default tests pass (faked — contract, config, schema translation, projection, over-length, schema-grouping, security regression).
  • 3 real-marked tests pass against the actual GLiNER2 weights (entities + classification + structured record, offline load+infer, over-length). Gated to the opt-in real-models CI job.
  • Full service boots and serves offline end-to-end against the real model.
  • Lint, format, OpenAPI --check, requirements --check all clean.

Every gliner2 API shape was verified against fastino/gliner2-privacy-filter-PII-multi, not assumed from docs (the docs were wrong several times).

Flags for review

  • Breaking ner.v1 rewrite, no migration shim (pre-release).
  • The runtime-side consumer must adopt the schema-driven request + structured response — coordinated change, not in this PR.
  • Branch name (spike/ner-multi-model) is now a misnomer; left as-is to preserve PR continuity.

🤖 Generated with Claude Code

Rewrite the NER service as multi-model: load a configurable whitelist of
Hugging Face backends at startup and let each request pick one by id. Spans
come back with each model's own native labels and the modelId that produced
them; mapping labels onto a canonical vocabulary moves to the consumer (the
runtime owns that map), keeping this service engine-native.

Contract (ner.v1):
- NerRequest: drop `kinds`, add optional `model` (HF id; None = default).
- Entity.label is the model-native string, not an EntityKind; classProbs
  keyed by native labels.
- NerResponse.modelId reports the model that ran.

Service (nvisy-ner):
- registry.py wraps GLiNER and HF token-classification pipelines behind one
  Backend interface; dispatch by id prefix.
- Preload NVISY_NER_MODELS (config, not code); 400 on an unloaded id;
  NVISY_NER_DEFAULT_MODEL (or first listed) serves requests omitting `model`.
- Per-model + per-param batched dispatch within a BentoML batch.

Remove the taxonomy from this repo: delete nvisy_core.entity (EntityKind /
EntityCategory) and nvisy_ner.label_map. Add transformers as a direct dep;
regenerate requirements.txt and docs/openapi/ner.json; refresh the NER README,
docs, and design doc (gliner.md -> ner.md).

Breaking change: ner.v1 wire contract changes with no migration shim. The
coordinated runtime change (consume `model`, own the label map) is tracked
separately in nvisy-runtime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@martsokha martsokha self-assigned this Jun 12, 2026
@martsokha martsokha added refactor code restructuring without behavior change ner NLP inference service: GLiNER, NER, etc. labels Jun 12, 2026
martsokha and others added 5 commits June 13, 2026 05:40
The Audit job flagged three pre-existing CVEs in the workspace lock:

- aiohttp 3.13.5 (CVE-2026-34993, CVE-2026-47265): fixed in 3.14.0. aiohttp
  is transitive via bentoml and not declared directly, so a lock edit alone
  would be reverted on the next resolve. Add a workspace-level
  `constraint-dependencies` floor (>=3.14.0) — uv's mechanism for pinning a
  transitive dep without making it a direct dependency — so the fix sticks.
- torch 2.12.0 (CVE-2025-3000): torch.jit.script memory corruption, CVSS 5.3,
  local; no patched release exists, so no bump resolves it. Ignore it
  explicitly in the audit step with a comment to drop the flag once upstream
  ships a fix.

Relock + regenerate per-service requirements.txt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three bugs found by checking the registry against the real libraries; all
were masked by fakes that matched the wrong assumptions.

GLiNER is zero-shot and has NO built-in label list — labels must be supplied
at call time. The previous code read a nonexistent `config.labels`, fell back
to `[]`, and so silently found zero entities (the default backend was a
no-op). Fix: add an optional `labels` field to NerRequest; GLiNER requires it
(zero-shot models without labels are rejected with 400 via MissingLabelsError
-> InvalidArgument), token-classification models carry their own labels and
ignore it. Also switch GLiNER to its batched `inference()` entry point
(predict_entities/batch_predict_entities are the per-item / deprecated paths)
and read its real span keys (text/label/score/start/end).

HF token-classification pipeline (verified against transformers 5.1):
- start/end come back as None when the tokenizer has no offset mapping;
  int(None) would crash. Drop offset-less spans instead of emitting broken
  entities.
- the span label key is "entity_group" under aggregation but "entity" with
  strategy "none"; read entity_group with an "entity" fallback.

Dispatch now groups by (labels, threshold, return_class_probs) so each backend
call shares a uniform label set. Tests rewritten to mirror the real APIs, plus
direct parser unit tests for the None-offset / fallback-key / threshold paths.
Regenerate ner.json; refresh README + design doc to state that GLiNER labels
come from the request.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Close out the remaining review corners:

- Startup: wrap each backend load so a misconfigured whitelist fails liveness
  with an error naming the offending model id, instead of half-loading and
  500-ing on the first request (#7).
- threshold: document in the contract that it is the model's native confidence
  threshold for GLiNER but a post-inference score filter for token
  classification — same cutoff, different interaction with span selection (#6).
- Resources: README note that every whitelisted model loads into each worker,
  so cpu/memory and batch size must be tuned to the list (#8).
- Real-model coverage (#5): the suite fakes models by repo convention (no CI
  downloads). Add a manual smoke script that runs the real GLiNER and
  token-classification backends end to end, and verified locally that both
  produce correct native-labelled spans with valid offsets — confirming the
  fakes match the real libraries. README documents how to run it.

The token-cls batched return shape (#4) was already verified against the
transformers source during the previous fix; the list-normalisation path is
unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Faked tests can't catch a fake that encodes the wrong library API — exactly
the GLiNER bug that shipped green. Add real-model tests that close that gap
without taxing every PR:

- New tests/test_real_models.py (marked `real`): load the actual GLiNER and
  token-classification models and assert native-labelled spans with valid
  offsets. Replaces the standalone smoke script.
- Register the `real` marker and exclude it by default (addopts `-m "not real"`)
  in both the workspace root and the nvisy-ner package config — pytest uses the
  nearest config, so it must be declared in both. Default suite stays fast and
  fully faked (53 passed, 2 deselected, <1s).
- New `real-models` CI job runs `pytest -m real`, gated to schedule (weekly) +
  workflow_dispatch only — never on PRs/pushes, since it downloads weights and
  runs inference.

Verified locally: `pytest -m real` -> 2 passed against real weights.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the multi-engine NER service with a single, SOTA engine — GLiNER2 —
and a schema-driven contract. GLiNER2 tops span-level F1 for self-hosted PII
extraction (arxiv 2605.09973) and its schema interface (entities +
classification + structured records) subsumes what zero-shot NER and
fixed-taxonomy token classifiers did separately, so one engine serves the whole
contract.

Contract (nvisy_core/ner/v1.py) — clean rewrite, schema-driven:
- Request: text + Schema(entities | classifications | structures) + threshold.
- Response: entities, classifications (single or multi-label), structures
  (named records of field -> spans), modelId.
- Labels stay model-native; the consumer owns taxonomy mapping.

Service (nvisy-ner) — single model, self-hosted:
- One GLiNER2 model from NVISY_NER_MODEL (default
  fastino/gliner2-privacy-filter-PII-multi). No whitelist, no per-request model.
- engine.py translates the wire schema to gliner2.Schema (incl. per-field regex
  validators), runs batch_extract, projects the result (confidence -> score).
- Collapse registry.py + backends/ into config.py + engine.py.
- Drop gliner + transformers direct deps (gliner2 keeps them transitively).

Security (the self-hosting differentiator):
- HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE baked into the service; verified the
  model loads + infers fully offline.
- Reject inputs over the 512-token limit rather than letting the model silently
  truncate (and miss PII in the tail).
- No payload logging; regression test asserts the hosted GLiNER2API/from_api
  path is never referenced.

Every gliner2 API shape (batch_extract, the result dict for all three task
groups, the tokenizer, offline load, quantize) was verified against the real
model, not assumed from docs. Tests: faked default suite + real-marked engine
tests; README + docs/design/ner.md rewritten. OpenAPI + requirements regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@martsokha martsokha changed the title feat(ner): per-request model selection over multiple backends feat(ner): schema-driven GLiNER2 NER service (single engine) Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ner NLP inference service: GLiNER, NER, etc. refactor code restructuring without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant