You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added a new tokenizer_kwargs argument to chunkerify() allowing users to specify custom keyword arguments to their tokenizers and token counters. tokenizer_kwargs can be used to override the default behavior of treating any encountered special tokens as if they are normal text when using a tiktoken or transformers tokenzier.
Where a tiktoken or transformers tokenizer is used, started treating special tokens as normal text instead of, in the case of tiktoken, raising an error and, in the case of transformers, treating them as special tokens.
Added support for Python 3.14.
Changed
Demoted asterisks in the hierarchy of splitters from sentence terminators to clause separators to better reflect their typical syntactic function.
Dramatically improved performance when handling extremely long sequences of punctuation characters.
All arguments to chunkerify() except for the first two arguments, tokenizer_or_token_counter and chunk_size, are now keyword-only arguments.
All arguments to chunk() except for the first three, text, chunk_size, and token_counter, are now keyword-only arguments.
Significantly improved performance in cases where merge_splits() was the biggest bottleneck by switching from joining splits with splitters to indexing into the original text.
Slightly sped up merge_splits() by switching to the standard library's bisect_left() function which is now faster than the previous implementation.