OKCompressor Core

⚠️ dev/scene disclaimer:
This is a personal, in-progress R&D compression playground.
Code, modules, and docs are messy, fluid, and often “notes to self.”
Some submodules are private, drafts or WIP (experimental)—not all code is open right now.
Occasionally, Rust executables or binaries may be dropped in to speed up bottlenecks.
No production support, no guarantees—this repo is “backup-first,” not “release-ready.”
Fork, hack, or lurk at your own risk. PRs welcome but bring patience!

Modular, open corpus compression for the LLM era.

Directory Structure

Click to expand project tree

/OKCompressor
  /core           # Orchestration, main scripts, API/readme, entrypoint

  /dumb_pre       # Baseline reversible word tokenization/dicts
  /redumb         # Rust-based, WIP, dumb_pre

  /ngram_pos      # N-gram positional tools
  /cc_nlp         # NLP & AI transforms, DAWG, codebook
	/ngram-dawg     # Modular DAWG & automata toolkit
	/rengrams       # Rust n-grams/faster routines

  /crux           # Custom compression transforms (BWT, MTF, etc.)

  /stego          # Steganography/watermarking layers (R&D)
  /mapper         # Symbol remapping, index management
  /cypher	  # pgp, aes per file/block for now. cc_PQC later. FHE.

	/mDAWG          # Multi-level DAWG
	/nGPE           # Next-gen prefix encoding (future/experimental)

  /ranking        # Token/symbol ranking modules

  /entrop         # Entropy coding: rANS/ANS/constriction hooks


* /pLM 		  # pseudo LM, statistical word models from the ngrams of OKC

API Table: Inputs, Outputs, Entrypoints

Module	Entrypoint/Script	Input(s)	Output(s)	Notes
dumb_pre	dumb_pre.py / redumb	raw.txt	dict.txt, out.txt	Baseline reversible
cc_nlp	proc_post.py	dicts/	.bwtmtf.txt, .meta.npz	Mocked crux postproc
ngram_pos	aggregate.py	.bwtmtf.txt	.npz, .tsv	N-gram sweep/agg
cc_nlp	ngram_analyzer.py	.tsv	codebook.json	Semantic grouping
cc_nlp	replace_ngrams.py	ngram_db, input_dir	output_dir	N-gram replacement
ngram-dawg	runner.py	token files	.edgelist, .order.npy	DAWG build/export
crux	crux.py	token streams	transformed streams	Custom & transforms
core	bench_logger.py	output_dir	.tsv/.json logs	Benchmarking/summary
...	...	...	...	...

Minimal, Reversible Pipeline (v0.3-minimal-pipeline)

Core flow:

dumb_pre → cc_nlp (category tokenization) → ngram_pos (replacement) → crux2 BWT+MTF+RLE → final 7z archive.

    Archive should only include minimal files for lossless reverse.

Pipeline Stages

    Dumb Preprocessing:

        Input: data/enwik6

        Output: output/enwik6/00_dumb/output.txt, dict.txt

    CC-NLP Category Chunking:

        Output: sub_idxs_*.npy, cats_*.base4, cat{n}_commons.txt, cat{n}_uniqs.txt

    N-gram Aggregation (Replacement):

        Output: ngrams_cc_temp.db, ngrams_cc_dicts.npz

    N-gram Replacement (Codebook):

        Output: repl_subidx_*.npy, ngram_used_codebook.txt/.npz

    BWT → MTF → RLE Chain:

        Output: repl_subidx_*_bwtmtfrle.txt

    Minimal Archive Set for Decompression:

        repl_subidx_*_bwtmtfrle.txt, cats_*.base4, cat{n}_commons.txt, cat{n}_uniqs.txt, ngram_used_codebook.txt/.npz

Reverse:
Unarchive → RLE → MTF → BWT reverse → codebook restore → category dicts → join tokens → original file.

Quickstart

# Python 3.11+ recommended; PyPy optional for speed
pip install -r requirements.txt
# Or, for max speed:
pypy3 -m pip install -r requirements.txt

# If using Rust modules (experimental, for redumb/rengrams):
cargo build --release

cd core
python main.py config.yaml
# Or, for PyPy:
pypy3 main.py config.yaml

# Outputs: output/{your_corpus}/99_dicts_final/

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
_legacy		_legacy
docs		docs
notes		notes
steps		steps
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
clone_modules.py		clone_modules.py
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OKCompressor Core

Modular, open corpus compression for the LLM era.

Directory Structure

API Table: Inputs, Outputs, Entrypoints

Minimal, Reversible Pipeline (v0.3-minimal-pipeline)

Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OKCompressor Core

Modular, open corpus compression for the LLM era.

Directory Structure

API Table: Inputs, Outputs, Entrypoints

Minimal, Reversible Pipeline (v0.3-minimal-pipeline)

Quickstart

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages