OKC Dumb Tokenizer (okc-dumb_pre)

The reference implementation for word-level, reversible text tokenization and compression—
serving as the open, auditable baseline for the OKCompressor pipeline.

What is this?

A minimal, readable Python toolkit that:

Tokenizes raw text at the word level
Compresses by mapping each unique word to an integer (dictionary encoding)
Restores the original text perfectly (full round-trip)
Benchmarks against industry standards (e.g., tiktoken)
Publishes all benchmark results for verification and collaboration

Why?

Too many modern NLP and LLM tokenizers are “black boxes.”
OKC Dumb Tokenizer sets the baseline for open, fully auditable text preprocessing—so everyone can see, verify, and build on our results.

Transparent: No hidden steps, 100% reproducible
Collaborative: Designed for research, extension, and real-world testing
Comparable: Benchmarked directly against OpenAI tiktoken (and others)

How does it work?

Transform: Reads text, splits to words, creates a dictionary, encodes the text as integer IDs.
Restore: Uses the dictionary and IDs to reconstruct the exact original text.
Benchmark: Measures token count, compression size, time, and correctness; outputs results as CSV.

Quick Start

git clone https://github.com/OKCompressor/dumb_pre.git
cd dumb_pre
pip install -r requirements.txt

# Place your test corpus in ./data/ (e.g., enwik8)
python bench_dumb.py

🚀 OKC Dumb Tokenizer: Reference Implementation & Benchmarks Released

We’re excited to share the first open-source reference for word-level, reversible tokenization and compression—the new baseline for OKCompressor.

🔍 Why does this matter?

Most tokenizers and compressors used in LLM pipelines are closed-source or difficult to audit.
OKC Dumb Tokenizer is fully open, reproducible, and delivers real numbers:

Method	Gzipped Size (MB)	Unique Tokens	Round-trip Lossless
dumb_pre	35.8	424,268	✅ True
tiktoken	37.0	71,161	✅ True

(enwik8, gzip compressed)

📊 Benchmark Results

🛠️ Features

Python-based, minimal, and readable
100% round-trip guarantee
Benchmark scripts and public CSV results included
Easy to extend, fork, or integrate

📈 Roadmap

Rust implementation (aiming for 10x speed)
Binary output & advanced layers
Multi-lingual and character tokenization
Seamless integration with OKCompressor suite

🤝 Join Us!

Try it out, fork, open an issue, or suggest a feature.
Looking for collaborators, researchers, and partners—especially for the next-gen Rust build and real-world testing.

Transparency is the new baseline. Benchmarks are public. The journey is open.

github.com/OKCompressor/dumb_pre

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benchmark_results_dumb		benchmark_results_dumb
benchmark_results_tiktoken		benchmark_results_tiktoken
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
bench_dumb.py		bench_dumb.py
dumb_pre_v2.py		dumb_pre_v2.py
interface.py		interface.py
main_example.py		main_example.py
redumb_cats.py		redumb_cats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OKC Dumb Tokenizer (okc-dumb_pre)

What is this?

Why?

How does it work?

Quick Start

🚀 OKC Dumb Tokenizer: Reference Implementation & Benchmarks Released

🔍 Why does this matter?

📊 Benchmark Results

🛠️ Features

📈 Roadmap

🤝 Join Us!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OKC Dumb Tokenizer (okc-dumb_pre)

What is this?

Why?

How does it work?

Quick Start

🚀 OKC Dumb Tokenizer: Reference Implementation & Benchmarks Released

🔍 Why does this matter?

📊 Benchmark Results

🛠️ Features

📈 Roadmap

🤝 Join Us!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages