Skip to content

Update tokenizers requirement from ^0.21 to ^0.23#17

Merged
awadell1 merged 2 commits into
mainfrom
dependabot/cargo/tokenizers-tw-0.23
May 11, 2026
Merged

Update tokenizers requirement from ^0.21 to ^0.23#17
awadell1 merged 2 commits into
mainfrom
dependabot/cargo/tokenizers-tw-0.23

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github May 10, 2026

Updates the requirements on tokenizers to permit the latest version.

Release notes

Sourced from tokenizers's releases.

Release v0.23.1

TL;DR

tokenizers 0.23.1 is the first proper stable release in the 0.23 line — 0.23.0 only ever shipped as rc0 because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was on pyo3 0.27 without free-threaded support). 0.23.1 is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded 3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.

There is no functional 0.23.0 published — we tag 0.23.1 directly so users don't accidentally pull a never-shipped version.


🚨 Breaking changes

  • Drop Python 3.9 (#1952) — requires-python = ">=3.10"; 3.9 users stay on 0.22.x.
  • add_tokens normalizes content at insertion (#1995) — re-saved tokenizer.json may differ in the added_tokens block. Existing files load unchanged.
  • Type stubs are precise (#1928, #1997) — methods that returned Any now return real types; mypy --strict may surface previously-hidden errors. Stub layout also moved from tokenizers/<sub>/__init__.pyi to tokenizers/<sub>.pyi. This breaks the surface of some of the processors like RobertaProcessign's __init__ .
  • 3.14t-only: setters/getters return PyResult<T> because of Arc<RwLock<Tokenizer>>; a poisoned lock surfaces as PyException instead of a panic.

⚡ Performance — measured locally on this Mac, not lifted from PRs

Run with cargo bench --bench <name> -- --save-baseline v0_22_2 on v0.22.2, then --baseline v0_22_2 on v0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.

Added-vocabulary deserialize — the headline win (#1995, #1999)

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect of daachorse for the matching automaton plus the normalize-on-insert refactor is enormous on this workload:

benchmark v0.22.2 v0.23.1 change
100k tokens, special, no norm ~410 ms 248 ms −40%
100k tokens, non-special, no norm ~7.1 s 273 ms −96%
100k tokens, special, NFKC ~395 ms 235 ms −40%
100k tokens, non-special, NFKC ~7.4 s 290 ms −96%
400k tokens, special, no norm ~15 s 980 ms −94%

Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".

BPE encode

benchmark v0.22.2 v0.23.1 change
BPE GPT2 encode batch, no cache 530 ms 446 ms −16%
BPE GPT2 encode batch (cached) 690 ms 685 ms noise
BPE GPT2 encode (single) 1.95 s 1.94 s noise
BPE Train (small) 32.6 ms 31.5 ms −3%
BPE Train (big) 1.01 s 988 ms −2%

The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.

Llama-3 encode

... (truncated)

Commits

@dependabot dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update rust code labels May 10, 2026
@awadell1
Copy link
Copy Markdown
Member

@dependabot rebase

Updates the requirements on [tokenizers](https://github.com/huggingface/tokenizers) to permit the latest version.
- [Release notes](https://github.com/huggingface/tokenizers/releases)
- [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md)
- [Commits](huggingface/tokenizers@v0.21.0...v0.23.1)

---
updated-dependencies:
- dependency-name: tokenizers
  dependency-version: 0.23.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/cargo/tokenizers-tw-0.23 branch from 8221924 to bc8c663 Compare May 11, 2026 01:05
@awadell1 awadell1 merged commit 5e849fa into main May 11, 2026
9 checks passed
@awadell1 awadell1 deleted the dependabot/cargo/tokenizers-tw-0.23 branch May 11, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant