Skip to content

Angelido/bentley_mcilroy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bentley–McIlroy Compressor

C++17 License Build Tests Platform

A modern C++17 implementation of the Bentley–McIlroy lossless compression algorithm, which detects and encodes long repeated substrings using a rolling Rabin–Karp fingerprint.

Implements the algorithm from:

J. Bentley, D. McIlroy
Data Compression Using Long Common Strings
IEEE Data Compression Conference (DCC), 1999.

📘 About the Algorithm

The Bentley–McIlroy scheme compresses a string by finding and encoding long repeated substrings, as described in their 1999 paper. The core idea is to slide a window of size b over the input, storing a fingerprint every b steps, then using those fingerprints to detect repetitions further along the string.

The algorithm in pseudocode, as given in the paper:

initialise fingerprint fp at position 0
for i = b to n-1:
    if i % b == 0:
        store(fp, i)
    update fp: include symbol at i, drop symbol at i-b
    checkformatch(fp, i)

checkformatch looks up fp in a hash table and encodes a match if one is found. The sliding window moves forward one symbol at a time; fingerprints are stored only at block boundaries.

Backward and forward extension. A raw fingerprint hit gives a match of exactly b symbols. Two greedy extensions improve it:

  • Forward — advance both the current position and the candidate position one symbol at a time for as long as the symbols agree. There is no limit.
  • Backward — step both pointers back one symbol at a time, up to a maximum of b−1 steps. The bound exists by construction: if b or more symbols matched to the left, the block starting there would already have been stored in the hash table and the match would have been found in an earlier iteration.

Example with b = 2 and input xabcdabcd:

xabcdabcd
123456789

When the window "bc" is encountered at positions 7..8, the hash table returns a corresponding occurrence at positions 3..4. The initial match is therefore: <3,2>. Extending the match one character backward shows that positions 2..4 match 6..8, which improves the match to: <2,3>. Continuing with forward extension, we observe that positions 2..5 match 6..9. The final encoding therefore becomes:

xabcd<2,4>

Note In the examples throughout this README (and in .bm.report files), positions are written using 1-based indexing to make them easier to read and interpret for humans. The actual implementation, however, uses the conventional 0-based indexing used in C++ and most programming languages.


📚 Further Reading

For a deeper dive into the algorithm, including implementation details and theoretical insights, see:

👉 https://angelido.github.io/#/posts/bentley_mcilroy


🧠 Implementation Overview

This implementation follows the paper closely. The compressor makes a single pass over the input:

  1. A rolling Rabin–Karp fingerprint (64-bit polynomial, low 32 bits exposed) slides one symbol at a time.
  2. At every canonical boundary (position % b == 0), the fingerprint is stored in a hash table with FIFO-eviction chains of depth k.
  3. At every position, the hash table is queried for candidate matches. Each candidate is first compared using the stored fingerprint and, if it matches, verified with memcmp to eliminate potential hash collisions.
  4. Confirmed matches are extended greedily forward (unlimited) and backward (up to b−1 symbols).
  5. A match is emitted only if its encoded size — FF 01 + two LEB128 varints — is strictly smaller than the literal cost of the same bytes (accounting for 0xFF escaping). Otherwise the bytes are kept as literals.
  6. The resulting token stream is serialised into the .bm binary format.

📁 Project Structure

.
├── src/
│   ├── bentley_mcilroy.cpp   # entry point — CLI dispatch, timing, report footer
│   ├── cmdline.hpp           # argument parsing and option validation
│   ├── codec.cpp             # compression and decompression implementation
│   └── hash.hpp              # Rabin–Karp fingerprint and hash table
│
├── tests/
│   ├── simple/               # minimal sanity tests (tiny.txt, repetition.txt, …)
│   ├── custom/               # synthetic datasets: source code, DNA, repetitive text
│   ├── corpora/
│   │   ├── canterbury/       # Canterbury Corpus
│   │   └── pizza&chili/      # Pizza & Chili Corpus (download separately)
│   ├── out/                  # generated outputs — gitignored
│   └── run_tests.sh          # automated round-trip test script
│
├── Makefile
├── LICENSE
└── README.md

⚙️ Requirements

Component Minimum version
C++ C++17
Compiler GCC ≥ 7 or Clang ≥ 5
Platform Linux, macOS, or WSL
Build tool GNU Make

🔨 Build

First clone the repository:

git clone https://github.com/Angelido/bentley_mcilroy.git
cd bentley_mcilroy

The project uses a Makefile-based build system. To build the binary, run:

make

This produces the bm binary in the project root, compiled in release mode (-O3 -flto -DNDEBUG).

Build modes

The project uses a Makefile that supports multiple build configurations (release, debug, sanitizers, and native CPU optimization).

Command Flags Use case
make -O3 -flto -DNDEBUG Default release build
make debug -O0 -g3 -DDEBUG Debug symbols, forces -q 2
make sanitize -O1 -g -fsanitize=address,undefined Memory and undefined-behaviour checks
make native -O3 -flto -march=native Optimised for the host CPU
make clean Remove build artifacts and binary

Run make help to print a summary of all available targets directly from the Makefile.

Running directly via Make

make run ARGS="-C -b 4096 input.txt"
make run-debug ARGS="-C -b 4096 input.txt"      # debug binary, forces -q 2
make run-sanitize ARGS="-C -b 4096 input.txt"   # sanitizer binary

🚀 Quick Start

Compress a file by specifying the block size (-b):

./bm -C -b 4096 input.txt
# → produces input.bm

Decompress a .bm file:

./bm -D input.bm
# → reconstructs input.txt

All compression parameters are recovered automatically from the file header — no extra flags are needed at decompression time.


📖 Usage

Compression

./bm -C -b <N> [options] <input>

The -C and -b flags are required. All other flags are optional.

Flag Description
-b <N> Block size in symbols — minimum match length (required)
-s <1|2|4|8> Stride for match scanning (default: 1)
-e <enc> Input encoding: ascii, utf-8, utf-16, utf-32, iso-8859-1 (default: utf-8)
-A Allow stride smaller than the encoding unit width (may split code units)
-o <file> Output path (default: <input>.bm, replacing the original extension)
-r Remove the input file after a successful compression
-f Overwrite the output file without prompting
-q <0|1|2> Verbosity: 0 normal · 1 errors only · 2 debug + report
-h Show help

Example:

./bm -C -b 4096 -s 2 -e utf-16 document.txt
# → document.bm

Decompression

./bm -D [options] <input.bm>

Compression parameters (-b, -s, -e) are read from the file header and must not be passed again.

Flag Description
-o <file> Output path (default: derived from the header)
-r Remove the input file after a successful decompression
-f Overwrite the output file without prompting
-q <0|1|2> Verbosity: 0 normal · 1 errors only · 2 debug
-h Show help

Example:

./bm -D document.bm
# → document.txt

🧪 Example

The tests/ directory contains several input files for experimenting with the compressor.

The tests/simple/ subdirectory contains small, hand-crafted files designed to make the algorithm's behaviour easy to inspect. We recommend running these with debug output (-q 2) so that the .bm.txt report file is generated, showing the token stream in human-readable form.

Example 1 — repetition.txt

the quick brown fox jumps over the lazy dogs the quick brown fox jumps over the lazy dogs

Run:

./bm -C -b 10 -q 2 tests/simple/repetition.txt

This produces two files: repetition.bm (the compressed binary) and repetition.bm.txt (the report). Opening the report:

the quick brown fox jumps over the lazy dogs <1,44>

The first 44 bytes are emitted as literals. The second half of the input is identical, so the compressor replaces it with a single back-reference <1,44>: start at position 1, copy 44 symbols.


Example 2 — tiny.txt

xyzabcdyzabcdplxxyzabq

Run:

./bm -C -b 3 -q 2 tests/simple/tiny.txt

Report:

xyzabcd<2,6>plx<1,5>q

<2,6> means: go back to position 2, copy 6 symbols (yzabcd). <1,5> means: go back to position 1, copy 5 symbols (xyzab).

Note: with small block sizes like b=3, not every short repetition is necessarily replaced with a match. Before emitting a match token the compressor compares its encoded cost (escape byte + two varints) against the literal cost of the same bytes. For very short matches the literal is sometimes cheaper, so it is kept as-is. In practice, this only affects matches whose length is close to b when b is very small (e.g. b=3).


✅ Running the Tests

An automated round-trip test script is included:

make release
bash tests/run_tests.sh

The script scans all files under tests/simple/, tests/custom/, and tests/corpora/, then for each file:

  1. Compresses it with ./bm -C
  2. Decompresses the result with ./bm -D
  3. Runs diff to verify bit-perfect reconstruction

Outputs are written to tests/out/compressed/ and tests/out/decompressed/ (both gitignored).

📄 For more details on the test script and how to customise block sizes, see tests/README.md.


📂 Datasets

Location Contents
tests/simple/ Tiny hand-crafted files for debugging and sanity checking
tests/custom/ Synthetic files: source code, DNA-like sequences, repetitive text
tests/corpora/canterbury/ Canterbury Corpus
tests/corpora/pizza&chili/ Pizza & Chili Corpus — download separately

📥 Download instructions for the Pizza & Chili datasets are in tests/corpora/pizza&chili/README.md.


🗜️ File Format

Compressed .bm files begin with an 18-byte fixed header:

Offset Size Field
0 4 B Magic number BM01
4 1 B Flags — bits 0–1 encode the stride (0→1, 1→2, 2→4, 3→8)
5 1 B Encoding ID
6 4 B Block size in symbols (uint32, little-endian)
10 8 B Original file size in bytes (uint64, little-endian)

The body uses 0xFF as an escape byte:

Sequence Meaning
Any byte ≠ FF Literal byte, passed through unchanged
FF 00 Literal 0xFF byte
FF 01 <dist varint> <len varint> Back-reference match (distance and length in symbols, LEB128-encoded)

⚠️ Limitations

The current implementation prioritises clarity and correctness over maximum compression ratio or speed.

  • Single-threaded — no parallel block processing
  • Static (non-streaming) compression — the entire input is read into memory before compression begins; a streaming mode that compresses while reading would reduce peak memory usage
  • Small block sizes — reduce compression ratio and increase the likelihood that matches are rejected by the cost check
  • Encoding metadata — the encoding field is currently stored in the .bm header but is only used at compression time to validate stride compatibility; it could be removed to save 1 byte of overhead per file

⚠️ Disclaimer

This software is provided as is, without any warranty.

Although great care has been taken in the design, implementation, and testing of this compressor, no software can be guaranteed to be completely free of bugs.

Use this program at your own risk. The author assumes no responsibility for any loss of data or damage resulting from the use of this software.

If the data being compressed is important, it is recommended to keep a backup copy of the original files.


📄 How to Cite

If you use this implementation in academic work:

@software{nardone_bentley_mcilroy_2026,
  author = {Nardone, Angelo},
  title = {Bentley–McIlroy Compressor (C++ Implementation)},
  year = {2026},
  url = {https://github.com/Angelido/bentley_mcilroy}
}

or

Angelo Nardone.
Bentley–McIlroy Compressor (C++ Implementation), 2026.
https://github.com/Angelido/bentley_mcilroy

Original algorithm:

Bentley, J., McIlroy, D. (1999).
Data Compression Using Long Common Strings.
Proceedings of the IEEE Data Compression Conference (DCC).

📜 License

Released under the MIT License. © 2026 Angelo Nardone.

About

C++ implementation of the Bentley–McIlroy long common string compressor

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors