v3 collapse closures + tokenizers.Encoding offset API by hallerite · Pull Request #87 · PrimeIntellect-ai/renderers

hallerite · 2026-06-17T23:50:20Z

Context

#95 (remove fastokens) merged into main. With fastokens gone, the offset path collapses to "use the user's tokenizer's backend_tokenizer directly" — no second tokenizer load, no probe-verify, no AutoTokenizer fallback. This PR builds on that with two related refactors.

What this does

1. v3 collapse-or-fallback closures (8 renderers)

emit_text_segments closures in qwen3, qwen35, glm45, glm5, deepseek_v3, nemotron3, laguna_xs2, minimax_m2 get a "collapse adjacent same-label segments, then attribute the rest" pattern:

collapsed: list[tuple[str, bool]] = []
for text, label in segments:
    if not text:
        continue
    if collapsed and collapsed[-1][1] == label:
        collapsed[-1] = (collapsed[-1][0] + text, label)
    else:
        collapsed.append((text, label))
if not collapsed:
    return
if len(collapsed) == 1:
    text, label = collapsed[0]
    emit_text(text, msg_idx, is_sampled=is_sampled, is_content=label)
    return
for tok_id, is_content in attribute_text_segments(self._tokenizer, collapsed):
    tokens.append(tok_id)
    indices.append(msg_idx)
    sampled.append(is_sampled)
    content_mask.append(is_content)

Homogeneous-label runs (most rendering paths after collapse) go through a single emit_text — preserves internal BPE merges, skips the offset attribution path entirely. Only genuinely mixed-label runs hit attribute_text_segments.

2. `attribute_text_segments` rewritten on the `tokenizers.Encoding` API

Previously called tokenizer(text, return_offsets_mapping=True) (transformers dict API). Now uses the Rust library's native Encoding.ids / Encoding.offsets directly:

encoding = offset_tokenizer.encode(full_text, add_special_tokens=False)
token_ids = list(encoding.ids)
offsets = list(encoding.offsets)

This unblocks the future transformers-optional path (issue #31): a BYO tokenizers.Tokenizer (no transformers wrapper) works directly. minimax_m2.emit_token_overlap_body and qwen3_vl._Emitter._flush are updated to the same API.

3. `_get_offset_tokenizer` simplified to two paths

def _get_offset_tokenizer(tokenizer):
    from tokenizers import Tokenizer as RustTokenizer
    if isinstance(tokenizer, RustTokenizer):
        return tokenizer
    backend = getattr(tokenizer, "backend_tokenizer", None)
    if isinstance(backend, RustTokenizer):
        return backend
    raise RuntimeError("…fast tokenizer with a tokenizers.Tokenizer backend…")

Direct tokenizers.Tokenizer (BYO Rust BPE) or extract .backend_tokenizer from a PreTrainedTokenizerFast. No second tokenizer load, no probe-verify, no AutoTokenizer fallback — all of those existed in this PR's pre-rebase version to coordinate with the fastokens shim, which is gone after #95.

4. `tokenizers>=0.20` explicit dep

Already transitive via transformers, but attribute_text_segments imports from tokenizers at module level so we declare it.

Tests

Suite: 2248 passed, 88 skipped, 1 xfailed (baseline parity with Remove fastokens entirely #95).
test_get_offset_tokenizer_rejects_offsetless_byo updated to match the new error message ("fast tokenizer with a tokenizers.Tokenizer backend").

🤖 Generated with Claude Code

Note

Drop `transformers` from the offset encoding path by using `tokenizers.Tokenizer` directly

Replaces the return_offsets_mapping=True tokenizer call in _get_offset_tokenizer (renderers/base.py) with a direct return of a tokenizers.Tokenizer (either passed directly or via backend_tokenizer).
Adds tokenizers>=0.20 as a runtime dependency in pyproject.toml, removing the implicit dependency on transformers for offset-aware encoding.
Updates attribute_text_segments to use tokenizers.Encoding.offsets instead of the HuggingFace fast tokenizer offset mapping API.
Adds segment collapsing (merging adjacent same-label segments, skipping empty segments) and a homogeneous fast-path to emit_text_segments across all renderer implementations.
Behavioral Change: callers must now provide a PreTrainedTokenizerFast (with a Rust backend) or a bare tokenizers.Tokenizer; slow tokenizers without a backend_tokenizer now raise a RuntimeError.

^{Macroscope summarized afa4c9b.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.}

macroscopeapp · 2026-06-17T23:51:38Z

Approvability

Verdict: Needs human review

This PR changes core tokenization functionality by switching from HuggingFace's tokenizer API to the underlying Rust tokenizers library directly, adds segment collapsing optimization logic across multiple renderers, and introduces a new dependency. These runtime behavior changes to core functionality warrant human review.

^{You can customize Macroscope's approvability policy. Learn more.}

Two related refactors of the emit_text_segments / attribute_text_segments pipeline: 1. ``emit_text_segments`` closures across 8 hand-coded renderers (qwen3, qwen35, glm45, glm5, deepseek_v3, nemotron3, laguna_xs2, minimax_m2) get a "collapse-or-fallback" pattern: adjacent same-label segments are folded into one ``emit_text`` call (preserves internal BPE merges, skips the offset path); only genuinely mixed-label runs go through ``attribute_text_segments``. Most rendering paths end up homogeneous after collapse, so the offset machinery only runs when it actually has to. 2. ``attribute_text_segments`` is rewritten to use the Rust ``tokenizers.Encoding`` API directly — ``.encode().ids`` / ``.encode().offsets`` — instead of going through ``transformers``'s ``return_offsets_mapping=True`` dict API. This unblocks the future ``transformers``-optional path (issue #31): a BYO ``tokenizers.Tokenizer`` works without any ``transformers`` wrapper. ``_get_offset_tokenizer`` becomes a 2-path resolver (direct Rust tokenizer, or extract ``.backend_tokenizer`` from a ``PreTrainedTokenizerFast``); no second tokenizer load, no probe-verify, no AutoTokenizer fallback — all of those existed in the previous version of this PR to coordinate with the fastokens shim, which is gone after #95. ``minimax_m2.emit_token_overlap_body`` and ``qwen3_vl._Emitter._flush`` are updated to call the new ``Encoding``-based offset API directly. ``tokenizers>=0.20`` becomes an explicit core dependency — it was already a transitive of ``transformers``, but the new ``attribute_text_segments`` imports from ``tokenizers`` at the module level so we declare it. Tests: 2248 passed, 88 skipped, 1 xfailed (baseline parity with #95). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread renderers/minimax_m2.py

hallerite mentioned this pull request Jun 26, 2026

Remove fastokens entirely #95

Merged

hallerite force-pushed the drop-emit-text-segments branch 2 times, most recently from 7878261 to bf84a66 Compare June 26, 2026 03:30

hallerite changed the title ~~feat(base): drop transformers from the offset path~~ v3 collapse closures + tokenizers.Encoding offset API Jun 26, 2026

hallerite force-pushed the drop-emit-text-segments branch from bf84a66 to afa4c9b Compare June 26, 2026 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v3 collapse closures + tokenizers.Encoding offset API#87

v3 collapse closures + tokenizers.Encoding offset API#87
hallerite wants to merge 1 commit into
mainfrom
drop-emit-text-segments

hallerite commented Jun 17, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hallerite commented Jun 17, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What this does

1. v3 collapse-or-fallback closures (8 renderers)

2. attribute_text_segments rewritten on the tokenizers.Encoding API

3. _get_offset_tokenizer simplified to two paths

4. tokenizers>=0.20 explicit dep

Tests

Drop transformers from the offset encoding path by using tokenizers.Tokenizer directly

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 17, 2026 •

edited by macroscopeapp Bot

Loading

2. `attribute_text_segments` rewritten on the `tokenizers.Encoding` API

3. `_get_offset_tokenizer` simplified to two paths

4. `tokenizers>=0.20` explicit dep

Drop `transformers` from the offset encoding path by using `tokenizers.Tokenizer` directly

macroscopeapp Bot commented Jun 17, 2026 •

edited

Loading