Skip to content

OKCompressor/rusty_repair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mdawg_repair (Rust)

Rust port of the mdawg_repair Re-Pair-style grammar transform used in beta3.

It operates on integer ID streams (one big whitespace-separated list of ints), finds frequent adjacent pairs (X, Y), and replaces them with new macro IDs. The transform is reversible given the rule table and base_vocab_size.

The Rust implementation mirrors the Python API:

  • build_mdawg_repair(seq, cfg) -> (seq_prime, rules, base_vocab_size)
  • decode_mdawg_repair(seq_prime, rules, base_vocab_size) -> seq

and adds a small CLI so you can run it directly against the output/<core>/... assets.


Build

From the repo root:

cd rust/mdawg_repair
cargo build --release

The binary will be at:

target/release/mdawg_repair

CLI usage

Basic:

mdawg_repair \
  --input /path/to/ids.txt \
  --output-dir /path/to/out_dir

This reads a whitespace-separated list of integers from --input, runs the transform, checks a round-trip decode, and writes:

  • out_dir/mdawg_stream.txt : transformed IDs (base + macro IDs)
  • out_dir/mdawg_rules.txt : one rule per line, "left right"
  • out_dir/mdawg_meta.txt : "base_vocab_size num_rules"

Options

  • --min-pair-freq N (default 2) Minimum frequency for a pair (X, Y) to be considered for a rule.

  • --max-rules N (default 0) Maximum number of rules/macros to create. 0 means “no explicit limit”.

  • --max-passes N (default 0) Maximum number of passes over the sequence. 0 means “no explicit limit”.

  • --base-vocab-size N (default 0) If > 0, use this as base_vocab_size. If 0, infer as max(seq) + 1.

  • --block-entropy Enable global H0·N stopping: after each merge, compute total bits for the whole sequence. If a merge increases bits, revert to the best previous state and stop.

  • --verbose Log per-pass stats (length, rule count, entropy if --block-entropy).

  • --progress-every N (default 10) Log every N passes when verbose.

Presets

A. “Entropy-safe” (recommended default)

mdawg_repair \
  --input output/enwik7_core/00_dumb/output.txt \
  --output-dir output/enwik7_core/01_mdawg_dumb_rs \
  --min-pair-freq 2 \
  --max-rules 0 \
  --max-passes 0 \
  --block-entropy \
  --verbose \
  --progress-every 20
  • Runs until a merge would increase H0·N.
  • Reverts to the best (lowest bits) state.
  • Usually yields a moderate token reduction and guaranteed entropy gain.

B. “Full squeeze” (aggressive)

mdawg_repair \
  --input output/enwik7_core/00_dumb/output.txt \
  --output-dir output/enwik7_core/01_mdawg_dumb_rs_full \
  --min-pair-freq 2 \
  --max-rules 200000 \
  --max-passes 0 \
  --verbose \
  --progress-every 20
  # note: no --block-entropy here
  • Keeps merging while there are pairs with freq >= min_pair_freq and until max_rules / max_passes are hit.
  • Can get very high rule counts and shorter streams, at the cost of higher H0 per symbol. Good for “how far can we push this?” experiments.

Refs (Re-Pair):

Larsson & Moffat (Re-Pair / recursive pairing)

Practical and Effective Re-Pair Compression (arXiv)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages