A modern C++17 implementation of the Bentley–McIlroy lossless compression algorithm, which detects and encodes long repeated substrings using a rolling Rabin–Karp fingerprint.
Implements the algorithm from:
J. Bentley, D. McIlroy
Data Compression Using Long Common Strings
IEEE Data Compression Conference (DCC), 1999.The Bentley–McIlroy scheme compresses a string by finding and encoding long repeated substrings, as described in their 1999 paper. The core idea is to slide a window of size b over the input, storing a fingerprint every b steps, then using those fingerprints to detect repetitions further along the string.
The algorithm in pseudocode, as given in the paper:
initialise fingerprint fp at position 0
for i = b to n-1:
if i % b == 0:
store(fp, i)
update fp: include symbol at i, drop symbol at i-b
checkformatch(fp, i)
checkformatch looks up fp in a hash table and encodes a match if one is found. The sliding window moves forward one symbol at a time; fingerprints are stored only at block boundaries.
Backward and forward extension. A raw fingerprint hit gives a match of exactly b symbols. Two greedy extensions improve it:
- Forward — advance both the current position and the candidate position one symbol at a time for as long as the symbols agree. There is no limit.
- Backward — step both pointers back one symbol at a time, up to a maximum of
b−1steps. The bound exists by construction: ifbor more symbols matched to the left, the block starting there would already have been stored in the hash table and the match would have been found in an earlier iteration.
Example with b = 2 and input xabcdabcd:
xabcdabcd
123456789
When the window "bc" is encountered at positions 7..8, the hash table returns a corresponding occurrence at positions 3..4. The initial match is therefore: <3,2>.
Extending the match one character backward shows that positions 2..4 match 6..8, which improves the match to: <2,3>. Continuing with forward extension, we observe that positions 2..5 match 6..9. The final encoding therefore becomes:
xabcd<2,4>
Note In the examples throughout this README (and in .bm.report files), positions are written using 1-based indexing to make them easier to read and interpret for humans. The actual implementation, however, uses the conventional 0-based indexing used in C++ and most programming languages.
For a deeper dive into the algorithm, including implementation details and theoretical insights, see:
👉 https://angelido.github.io/#/posts/bentley_mcilroy
This implementation follows the paper closely. The compressor makes a single pass over the input:
- A rolling Rabin–Karp fingerprint (64-bit polynomial, low 32 bits exposed) slides one symbol at a time.
- At every canonical boundary (
position % b == 0), the fingerprint is stored in a hash table with FIFO-eviction chains of depthk. - At every position, the hash table is queried for candidate matches. Each candidate is first compared using the stored fingerprint and, if it matches, verified with
memcmpto eliminate potential hash collisions. - Confirmed matches are extended greedily forward (unlimited) and backward (up to
b−1symbols). - A match is emitted only if its encoded size —
FF 01+ two LEB128 varints — is strictly smaller than the literal cost of the same bytes (accounting for0xFFescaping). Otherwise the bytes are kept as literals. - The resulting token stream is serialised into the
.bmbinary format.
.
├── src/
│ ├── bentley_mcilroy.cpp # entry point — CLI dispatch, timing, report footer
│ ├── cmdline.hpp # argument parsing and option validation
│ ├── codec.cpp # compression and decompression implementation
│ └── hash.hpp # Rabin–Karp fingerprint and hash table
│
├── tests/
│ ├── simple/ # minimal sanity tests (tiny.txt, repetition.txt, …)
│ ├── custom/ # synthetic datasets: source code, DNA, repetitive text
│ ├── corpora/
│ │ ├── canterbury/ # Canterbury Corpus
│ │ └── pizza&chili/ # Pizza & Chili Corpus (download separately)
│ ├── out/ # generated outputs — gitignored
│ └── run_tests.sh # automated round-trip test script
│
├── Makefile
├── LICENSE
└── README.md
| Component | Minimum version |
|---|---|
| C++ | C++17 |
| Compiler | GCC ≥ 7 or Clang ≥ 5 |
| Platform | Linux, macOS, or WSL |
| Build tool | GNU Make |
First clone the repository:
git clone https://github.com/Angelido/bentley_mcilroy.git
cd bentley_mcilroyThe project uses a Makefile-based build system. To build the binary, run:
makeThis produces the bm binary in the project root, compiled in release mode (-O3 -flto -DNDEBUG).
The project uses a Makefile that supports multiple build configurations (release, debug, sanitizers, and native CPU optimization).
| Command | Flags | Use case |
|---|---|---|
make |
-O3 -flto -DNDEBUG |
Default release build |
make debug |
-O0 -g3 -DDEBUG |
Debug symbols, forces -q 2 |
make sanitize |
-O1 -g -fsanitize=address,undefined |
Memory and undefined-behaviour checks |
make native |
-O3 -flto -march=native |
Optimised for the host CPU |
make clean |
— | Remove build artifacts and binary |
Run make help to print a summary of all available targets directly from the Makefile.
make run ARGS="-C -b 4096 input.txt"
make run-debug ARGS="-C -b 4096 input.txt" # debug binary, forces -q 2
make run-sanitize ARGS="-C -b 4096 input.txt" # sanitizer binaryCompress a file by specifying the block size (-b):
./bm -C -b 4096 input.txt
# → produces input.bmDecompress a .bm file:
./bm -D input.bm
# → reconstructs input.txtAll compression parameters are recovered automatically from the file header — no extra flags are needed at decompression time.
./bm -C -b <N> [options] <input>
The -C and -b flags are required. All other flags are optional.
| Flag | Description |
|---|---|
-b <N> |
Block size in symbols — minimum match length (required) |
-s <1|2|4|8> |
Stride for match scanning (default: 1) |
-e <enc> |
Input encoding: ascii, utf-8, utf-16, utf-32, iso-8859-1 (default: utf-8) |
-A |
Allow stride smaller than the encoding unit width (may split code units) |
-o <file> |
Output path (default: <input>.bm, replacing the original extension) |
-r |
Remove the input file after a successful compression |
-f |
Overwrite the output file without prompting |
-q <0|1|2> |
Verbosity: 0 normal · 1 errors only · 2 debug + report |
-h |
Show help |
Example:
./bm -C -b 4096 -s 2 -e utf-16 document.txt
# → document.bm./bm -D [options] <input.bm>
Compression parameters (-b, -s, -e) are read from the file header and must not be passed again.
| Flag | Description |
|---|---|
-o <file> |
Output path (default: derived from the header) |
-r |
Remove the input file after a successful decompression |
-f |
Overwrite the output file without prompting |
-q <0|1|2> |
Verbosity: 0 normal · 1 errors only · 2 debug |
-h |
Show help |
Example:
./bm -D document.bm
# → document.txtThe tests/ directory contains several input files for experimenting with the compressor.
The tests/simple/ subdirectory contains small, hand-crafted files designed to make the algorithm's behaviour easy to inspect. We recommend running these with debug output (-q 2) so that the .bm.txt report file is generated, showing the token stream in human-readable form.
the quick brown fox jumps over the lazy dogs the quick brown fox jumps over the lazy dogsRun:
./bm -C -b 10 -q 2 tests/simple/repetition.txtThis produces two files: repetition.bm (the compressed binary) and repetition.bm.txt (the report).
Opening the report:
the quick brown fox jumps over the lazy dogs <1,44>The first 44 bytes are emitted as literals. The second half of the input is identical, so the compressor replaces it with a single back-reference <1,44>: start at position 1, copy 44 symbols.
xyzabcdyzabcdplxxyzabqRun:
./bm -C -b 3 -q 2 tests/simple/tiny.txtReport:
xyzabcd<2,6>plx<1,5>q<2,6> means: go back to position 2, copy 6 symbols (yzabcd).
<1,5> means: go back to position 1, copy 5 symbols (xyzab).
Note: with small block sizes like
b=3, not every short repetition is necessarily replaced with a match. Before emitting a match token the compressor compares its encoded cost (escape byte + two varints) against the literal cost of the same bytes. For very short matches the literal is sometimes cheaper, so it is kept as-is. In practice, this only affects matches whose length is close tobwhenbis very small (e.g.b=3).
An automated round-trip test script is included:
make release
bash tests/run_tests.shThe script scans all files under tests/simple/, tests/custom/, and tests/corpora/, then for each file:
- Compresses it with
./bm -C - Decompresses the result with
./bm -D - Runs
diffto verify bit-perfect reconstruction
Outputs are written to tests/out/compressed/ and tests/out/decompressed/ (both gitignored).
📄 For more details on the test script and how to customise block sizes, see
tests/README.md.
| Location | Contents |
|---|---|
tests/simple/ |
Tiny hand-crafted files for debugging and sanity checking |
tests/custom/ |
Synthetic files: source code, DNA-like sequences, repetitive text |
tests/corpora/canterbury/ |
Canterbury Corpus |
tests/corpora/pizza&chili/ |
Pizza & Chili Corpus — download separately |
📥 Download instructions for the Pizza & Chili datasets are in
tests/corpora/pizza&chili/README.md.
Compressed .bm files begin with an 18-byte fixed header:
| Offset | Size | Field |
|---|---|---|
| 0 | 4 B | Magic number BM01 |
| 4 | 1 B | Flags — bits 0–1 encode the stride (0→1, 1→2, 2→4, 3→8) |
| 5 | 1 B | Encoding ID |
| 6 | 4 B | Block size in symbols (uint32, little-endian) |
| 10 | 8 B | Original file size in bytes (uint64, little-endian) |
The body uses 0xFF as an escape byte:
| Sequence | Meaning |
|---|---|
Any byte ≠ FF |
Literal byte, passed through unchanged |
FF 00 |
Literal 0xFF byte |
FF 01 <dist varint> <len varint> |
Back-reference match (distance and length in symbols, LEB128-encoded) |
The current implementation prioritises clarity and correctness over maximum compression ratio or speed.
- Single-threaded — no parallel block processing
- Static (non-streaming) compression — the entire input is read into memory before compression begins; a streaming mode that compresses while reading would reduce peak memory usage
- Small block sizes — reduce compression ratio and increase the likelihood that matches are rejected by the cost check
- Encoding metadata — the encoding field is currently stored in the
.bmheader but is only used at compression time to validate stride compatibility; it could be removed to save 1 byte of overhead per file
This software is provided as is, without any warranty.
Although great care has been taken in the design, implementation, and testing of this compressor, no software can be guaranteed to be completely free of bugs.
Use this program at your own risk. The author assumes no responsibility for any loss of data or damage resulting from the use of this software.
If the data being compressed is important, it is recommended to keep a backup copy of the original files.
If you use this implementation in academic work:
@software{nardone_bentley_mcilroy_2026,
author = {Nardone, Angelo},
title = {Bentley–McIlroy Compressor (C++ Implementation)},
year = {2026},
url = {https://github.com/Angelido/bentley_mcilroy}
}
or
Angelo Nardone.
Bentley–McIlroy Compressor (C++ Implementation), 2026.
https://github.com/Angelido/bentley_mcilroy
Original algorithm:
Bentley, J., McIlroy, D. (1999).
Data Compression Using Long Common Strings.
Proceedings of the IEEE Data Compression Conference (DCC).
Released under the MIT License. © 2026 Angelo Nardone.