⚠️ dev/scene disclaimer:
This is a personal, in-progress R&D compression playground.
Code, modules, and docs are messy, fluid, and often “notes to self.”
Some submodules are private, drafts or WIP (experimental)—not all code is open right now.
Occasionally, Rust executables or binaries may be dropped in to speed up bottlenecks.
No production support, no guarantees—this repo is “backup-first,” not “release-ready.”
Fork, hack, or lurk at your own risk. PRs welcome but bring patience!
Click to expand project tree
/OKCompressor
/core # Orchestration, main scripts, API/readme, entrypoint
/dumb_pre # Baseline reversible word tokenization/dicts
/redumb # Rust-based, WIP, dumb_pre
/ngram_pos # N-gram positional tools
/cc_nlp # NLP & AI transforms, DAWG, codebook
/ngram-dawg # Modular DAWG & automata toolkit
/rengrams # Rust n-grams/faster routines
/crux # Custom compression transforms (BWT, MTF, etc.)
/stego # Steganography/watermarking layers (R&D)
/mapper # Symbol remapping, index management
/cypher # pgp, aes per file/block for now. cc_PQC later. FHE.
/mDAWG # Multi-level DAWG
/nGPE # Next-gen prefix encoding (future/experimental)
/ranking # Token/symbol ranking modules
/entrop # Entropy coding: rANS/ANS/constriction hooks
* /pLM # pseudo LM, statistical word models from the ngrams of OKC
| Module | Entrypoint/Script | Input(s) | Output(s) | Notes |
|---|---|---|---|---|
| dumb_pre | dumb_pre.py / redumb | raw.txt | dict.txt, out.txt | Baseline reversible |
| cc_nlp | proc_post.py | dicts/ | .bwtmtf.txt, .meta.npz | Mocked crux postproc |
| ngram_pos | aggregate.py | .bwtmtf.txt | .npz, .tsv | N-gram sweep/agg |
| cc_nlp | ngram_analyzer.py | .tsv | codebook.json | Semantic grouping |
| cc_nlp | replace_ngrams.py | ngram_db, input_dir | output_dir | N-gram replacement |
| ngram-dawg | runner.py | token files | .edgelist, .order.npy | DAWG build/export |
| crux | crux.py | token streams | transformed streams | Custom & transforms |
| core | bench_logger.py | output_dir | .tsv/.json logs | Benchmarking/summary |
| ... | ... | ... | ... | ... |
Core flow:
dumb_pre → cc_nlp (category tokenization) → ngram_pos (replacement) → crux2 BWT+MTF+RLE → final 7z archive.
Archive should only include minimal files for lossless reverse.
Pipeline Stages
Dumb Preprocessing:
Input: data/enwik6
Output: output/enwik6/00_dumb/output.txt, dict.txt
CC-NLP Category Chunking:
Output: sub_idxs_*.npy, cats_*.base4, cat{n}_commons.txt, cat{n}_uniqs.txt
N-gram Aggregation (Replacement):
Output: ngrams_cc_temp.db, ngrams_cc_dicts.npz
N-gram Replacement (Codebook):
Output: repl_subidx_*.npy, ngram_used_codebook.txt/.npz
BWT → MTF → RLE Chain:
Output: repl_subidx_*_bwtmtfrle.txt
Minimal Archive Set for Decompression:
repl_subidx_*_bwtmtfrle.txt, cats_*.base4, cat{n}_commons.txt, cat{n}_uniqs.txt, ngram_used_codebook.txt/.npz
Reverse:
Unarchive → RLE → MTF → BWT reverse → codebook restore → category dicts → join tokens → original file.
# Python 3.11+ recommended; PyPy optional for speed
pip install -r requirements.txt
# Or, for max speed:
pypy3 -m pip install -r requirements.txt
# If using Rust modules (experimental, for redumb/rengrams):
cargo build --release
cd core
python main.py config.yaml
# Or, for PyPy:
pypy3 main.py config.yaml
# Outputs: output/{your_corpus}/99_dicts_final/