Rust port of the mdawg_repair Re-Pair-style grammar transform used in beta3.
It operates on integer ID streams (one big whitespace-separated list of ints),
finds frequent adjacent pairs (X, Y), and replaces them with new macro IDs.
The transform is reversible given the rule table and base_vocab_size.
The Rust implementation mirrors the Python API:
build_mdawg_repair(seq, cfg) -> (seq_prime, rules, base_vocab_size)decode_mdawg_repair(seq_prime, rules, base_vocab_size) -> seq
and adds a small CLI so you can run it directly against the output/<core>/...
assets.
From the repo root:
cd rust/mdawg_repair
cargo build --releaseThe binary will be at:
target/release/mdawg_repair
Basic:
mdawg_repair \
--input /path/to/ids.txt \
--output-dir /path/to/out_dirThis reads a whitespace-separated list of integers from --input, runs the
transform, checks a round-trip decode, and writes:
out_dir/mdawg_stream.txt: transformed IDs (base + macro IDs)out_dir/mdawg_rules.txt: one rule per line,"left right"out_dir/mdawg_meta.txt:"base_vocab_size num_rules"
-
--min-pair-freq N(default2) Minimum frequency for a pair(X, Y)to be considered for a rule. -
--max-rules N(default0) Maximum number of rules/macros to create.0means “no explicit limit”. -
--max-passes N(default0) Maximum number of passes over the sequence.0means “no explicit limit”. -
--base-vocab-size N(default0) If> 0, use this asbase_vocab_size. If0, infer asmax(seq) + 1. -
--block-entropyEnable global H0·N stopping: after each merge, compute total bits for the whole sequence. If a merge increases bits, revert to the best previous state and stop. -
--verboseLog per-pass stats (length, rule count, entropy if--block-entropy). -
--progress-every N(default10) Log every N passes when verbose.
A. “Entropy-safe” (recommended default)
mdawg_repair \
--input output/enwik7_core/00_dumb/output.txt \
--output-dir output/enwik7_core/01_mdawg_dumb_rs \
--min-pair-freq 2 \
--max-rules 0 \
--max-passes 0 \
--block-entropy \
--verbose \
--progress-every 20- Runs until a merge would increase H0·N.
- Reverts to the best (lowest bits) state.
- Usually yields a moderate token reduction and guaranteed entropy gain.
B. “Full squeeze” (aggressive)
mdawg_repair \
--input output/enwik7_core/00_dumb/output.txt \
--output-dir output/enwik7_core/01_mdawg_dumb_rs_full \
--min-pair-freq 2 \
--max-rules 200000 \
--max-passes 0 \
--verbose \
--progress-every 20
# note: no --block-entropy here- Keeps merging while there are pairs with
freq >= min_pair_freqand untilmax_rules/max_passesare hit. - Can get very high rule counts and shorter streams, at the cost of higher H0 per symbol. Good for “how far can we push this?” experiments.
Refs (Re-Pair):
Larsson & Moffat (Re-Pair / recursive pairing)
Practical and Effective Re-Pair Compression (arXiv)