Skip to content

v3 collapse closures + tokenizers.Encoding offset API#87

Open
hallerite wants to merge 1 commit into
mainfrom
drop-emit-text-segments
Open

v3 collapse closures + tokenizers.Encoding offset API#87
hallerite wants to merge 1 commit into
mainfrom
drop-emit-text-segments

Conversation

@hallerite

@hallerite hallerite commented Jun 17, 2026

Copy link
Copy Markdown
Member

Context

#95 (remove fastokens) merged into main. With fastokens gone, the offset path collapses to "use the user's tokenizer's backend_tokenizer directly" — no second tokenizer load, no probe-verify, no AutoTokenizer fallback. This PR builds on that with two related refactors.

What this does

1. v3 collapse-or-fallback closures (8 renderers)

emit_text_segments closures in qwen3, qwen35, glm45, glm5, deepseek_v3, nemotron3, laguna_xs2, minimax_m2 get a "collapse adjacent same-label segments, then attribute the rest" pattern:

collapsed: list[tuple[str, bool]] = []
for text, label in segments:
    if not text:
        continue
    if collapsed and collapsed[-1][1] == label:
        collapsed[-1] = (collapsed[-1][0] + text, label)
    else:
        collapsed.append((text, label))
if not collapsed:
    return
if len(collapsed) == 1:
    text, label = collapsed[0]
    emit_text(text, msg_idx, is_sampled=is_sampled, is_content=label)
    return
for tok_id, is_content in attribute_text_segments(self._tokenizer, collapsed):
    tokens.append(tok_id)
    indices.append(msg_idx)
    sampled.append(is_sampled)
    content_mask.append(is_content)

Homogeneous-label runs (most rendering paths after collapse) go through a single emit_text — preserves internal BPE merges, skips the offset attribution path entirely. Only genuinely mixed-label runs hit attribute_text_segments.

2. attribute_text_segments rewritten on the tokenizers.Encoding API

Previously called tokenizer(text, return_offsets_mapping=True) (transformers dict API). Now uses the Rust library's native Encoding.ids / Encoding.offsets directly:

encoding = offset_tokenizer.encode(full_text, add_special_tokens=False)
token_ids = list(encoding.ids)
offsets = list(encoding.offsets)

This unblocks the future transformers-optional path (issue #31): a BYO tokenizers.Tokenizer (no transformers wrapper) works directly. minimax_m2.emit_token_overlap_body and qwen3_vl._Emitter._flush are updated to the same API.

3. _get_offset_tokenizer simplified to two paths

def _get_offset_tokenizer(tokenizer):
    from tokenizers import Tokenizer as RustTokenizer
    if isinstance(tokenizer, RustTokenizer):
        return tokenizer
    backend = getattr(tokenizer, "backend_tokenizer", None)
    if isinstance(backend, RustTokenizer):
        return backend
    raise RuntimeError("…fast tokenizer with a tokenizers.Tokenizer backend…")

Direct tokenizers.Tokenizer (BYO Rust BPE) or extract .backend_tokenizer from a PreTrainedTokenizerFast. No second tokenizer load, no probe-verify, no AutoTokenizer fallback — all of those existed in this PR's pre-rebase version to coordinate with the fastokens shim, which is gone after #95.

4. tokenizers>=0.20 explicit dep

Already transitive via transformers, but attribute_text_segments imports from tokenizers at module level so we declare it.

Tests

  • Suite: 2248 passed, 88 skipped, 1 xfailed (baseline parity with Remove fastokens entirely #95).
  • test_get_offset_tokenizer_rejects_offsetless_byo updated to match the new error message ("fast tokenizer with a tokenizers.Tokenizer backend").

🤖 Generated with Claude Code

Note

Drop transformers from the offset encoding path by using tokenizers.Tokenizer directly

  • Replaces the return_offsets_mapping=True tokenizer call in _get_offset_tokenizer (renderers/base.py) with a direct return of a tokenizers.Tokenizer (either passed directly or via backend_tokenizer).
  • Adds tokenizers>=0.20 as a runtime dependency in pyproject.toml, removing the implicit dependency on transformers for offset-aware encoding.
  • Updates attribute_text_segments to use tokenizers.Encoding.offsets instead of the HuggingFace fast tokenizer offset mapping API.
  • Adds segment collapsing (merging adjacent same-label segments, skipping empty segments) and a homogeneous fast-path to emit_text_segments across all renderer implementations.
  • Behavioral Change: callers must now provide a PreTrainedTokenizerFast (with a Rust backend) or a bare tokenizers.Tokenizer; slow tokenizers without a backend_tokenizer now raise a RuntimeError.

Macroscope summarized afa4c9b.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.

Comment thread renderers/minimax_m2.py
@macroscopeapp

macroscopeapp Bot commented Jun 17, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR changes core tokenization functionality by switching from HuggingFace's tokenizer API to the underlying Rust tokenizers library directly, adds segment collapsing optimization logic across multiple renderers, and introduces a new dependency. These runtime behavior changes to core functionality warrant human review.

You can customize Macroscope's approvability policy. Learn more.

@hallerite hallerite force-pushed the drop-emit-text-segments branch 2 times, most recently from 7878261 to bf84a66 Compare June 26, 2026 03:30
@hallerite hallerite changed the title feat(base): drop transformers from the offset path v3 collapse closures + tokenizers.Encoding offset API Jun 26, 2026
Two related refactors of the emit_text_segments / attribute_text_segments
pipeline:

1. ``emit_text_segments`` closures across 8 hand-coded renderers
   (qwen3, qwen35, glm45, glm5, deepseek_v3, nemotron3, laguna_xs2,
   minimax_m2) get a "collapse-or-fallback" pattern: adjacent
   same-label segments are folded into one ``emit_text`` call
   (preserves internal BPE merges, skips the offset path); only
   genuinely mixed-label runs go through ``attribute_text_segments``.
   Most rendering paths end up homogeneous after collapse, so the
   offset machinery only runs when it actually has to.

2. ``attribute_text_segments`` is rewritten to use the Rust
   ``tokenizers.Encoding`` API directly — ``.encode().ids`` /
   ``.encode().offsets`` — instead of going through
   ``transformers``'s ``return_offsets_mapping=True`` dict API. This
   unblocks the future ``transformers``-optional path (issue #31): a
   BYO ``tokenizers.Tokenizer`` works without any ``transformers``
   wrapper. ``_get_offset_tokenizer`` becomes a 2-path resolver
   (direct Rust tokenizer, or extract ``.backend_tokenizer`` from a
   ``PreTrainedTokenizerFast``); no second tokenizer load, no
   probe-verify, no AutoTokenizer fallback — all of those existed in
   the previous version of this PR to coordinate with the
   fastokens shim, which is gone after #95.

``minimax_m2.emit_token_overlap_body`` and ``qwen3_vl._Emitter._flush``
are updated to call the new ``Encoding``-based offset API directly.

``tokenizers>=0.20`` becomes an explicit core dependency — it was
already a transitive of ``transformers``, but the new ``attribute_text_segments``
imports from ``tokenizers`` at the module level so we declare it.

Tests: 2248 passed, 88 skipped, 1 xfailed (baseline parity with #95).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the drop-emit-text-segments branch from bf84a66 to afa4c9b Compare June 26, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant