This note describes how sshash build constructs a dictionary while keeping
peak resident memory bounded by the user-supplied -g (in GiB).
The design has two ideas, applied uniformly:
- Spill, don't accumulate. Every intermediate that grows with the input
size is written to a tmp file under
-d(tmp dir) rather than held in astd::vector/ bit-vector in RAM. Producers append through a small write buffer; consumers re-read through a small read buffer (buffered_record_stream<T>). - Cap working buffers at a fraction of
-g. Buffers that live only inside one step are sized asram_limit_in_GiB · GiB / N(withNtypically 2 or 8). The constants are picked so that even when several buffers are alive at the same time across overlapping steps, their sum stays under the user budget while heap fragmentation across step transitions is absorbed.
The build never materializes the final index in RAM. Instead, step 8
streams it directly to the user-supplied output file (or a tmp file, deleted
on exit, when the user did not pass -o). To run --check, tools/build.cpp
mmaps the saved file and runs the correctness queries against the mmap'd
dictionary.
The orchestration is in include/builder/dictionary_builder.hpp
(run_steps_1_through_7 + build). Per-step details are in
src/builder/{encode_strings,compute_minimizer_tuples,build_sparse_and_skew_index}.cpp.
| Step | What it produces | Where it lives between steps |
|---|---|---|
| 1 | Encoded strings bit-vector + strings_offsets |
tmp files (disk_backed_strings, disk_backed_offsets_builder) |
| 1.1 | Compressed weights (only if --weighted) |
weights::builder (in-RAM, bounded by run-length structure) |
| 2 | Per-thread sorted runs of minimizer_tuple |
tmp files, one per flushed buffer |
| 3 | Single sorted run of all minimizer_tuples |
tmp file (k-way external merge) |
| 4 | Minimizers MPHF F | tmp file (spilled at end of step 5) |
| 5 | Minimizer values replaced by F(minimizer); buffers re-flushed in F-order | new sorted runs, tmp files |
| 6 | Single sorted run keyed by F(minimizer) | tmp file |
| 7.1 | Sparse-index components (control_codewords, mid_load_buckets) |
tmp files |
| 7.2 | Skew-index components (heavy_load_buckets, per-partition MPHFs and positions) |
tmp files |
| 8 | Final on-disk index file | streamed to output, tmp files removed |
After step 8 the dictionary object d is not query-ready: the spilled
components were copied into the output file but never read back into d.
finalize_stats reports index_size_in_bytes via std::filesystem::file_size
on the saved path.
Iterates the input FASTA, producing the 2-bit-packed strings bit-vector
and the strings_offsets array (one offset per sequence + a sentinel).
Both go through disk-backed builders:
disk_backed_strings: appends 2-bit characters into a small in-RAM word buffer; flushes the buffer to a tmp file when full.disk_backed_offsets_builder<Offsets>: appends oneuint64_toffset per sequence into a small write buffer; flushes to a tmp file.
In-RAM footprint of step 1 is proportional to the buffer size, regardless of input size.
Only runs with --weighted. The weights builder uses run-length encoding:
its in-RAM size is proportional to the number of distinct weights, not to
the number of k-mers.
Each thread streams its assigned slice of the input via the disk-backed
strings/offsets readers and emits minimizer_tuple records into a private
in-RAM buffer:
buffer_size = (ram_limit · GiB) / (2 · sizeof(minimizer_tuple) · num_threads)When the buffer fills, the thread sorts it in parallel and flushes a sorted
run to a tmp file (minimizers_tuples::sort_and_flush). The factor of 2
in the denominator leaves headroom for std::sort's allocations and
inter-thread contention; the per-thread split makes the total in-RAM tuple
buffer ≈ ram_limit / 2.
The N tmp files from step 2 are merged into a single sorted run via a
winner-tree-based external-merge iterator (file_merging_iterator<T>).
Each input file is read through a buffered_record_stream<minimizer_tuple>
with default_buffer_records = 4096 records, so the total in-RAM merge
state is N · 4096 · sizeof(minimizer_tuple) ≈ tens of MB even for very
many runs. The output is written through a small std::ofstream buffer.
When N == 1 the merge degenerates to a rename + a single streaming scan to collect bucket statistics; same RAM bound.
Builds an external-memory partitioned PHF over distinct minimizers, using
pthash's build_in_external_memory. The minimizers are streamed from the
sorted run via streaming_minimizers_iterator (one buffered ifstream),
and pthash spills its own working hashes under tmp_dirname capped by
mphf_build_config.ram = ram_limit / 2.
The merged file is re-read in fixed-size blocks; each block is hashed in parallel and re-flushed as a new sorted run. Two RAM caps are combined:
RAM_available = ram_limit · GiB − sizeof(F) − offsets_builder.num_bytes()
buffer_unbounded = RAM_available / (3 · sizeof(minimizer_tuple)) // 3× = read+sort scratch+write
buffer_cap = (ram_limit · GiB / 8) / sizeof(minimizer_tuple)
buffer_size = min(buffer_unbounded, buffer_cap)The / 8 cap exists because step 5 leaves heap pages dirtied that linger
into later steps' allocations; capping at one-eighth of the budget keeps
the cumulative RSS under ram_limit when steps 6/7 start allocating.
After step 5, the minimizers MPHF F is spilled to disk and the in-RAM copy is freed: subsequent steps only ever use F(minimizer) values, not F itself.
Same machinery as step 3, applied to the new sorted runs from step 5.
Constructs control_codewords and mid_load_buckets. Both are produced as
on-disk bits::compact_vector files via streaming_compact_vector_writer,
so neither is ever materialized in RAM.
The most RAM-sensitive step; it has three internal phases, all disk-backed:
-
Phase B (k-mer extraction requests). Heavy-bucket entries become
kmer_extraction_requestrecords. They are external-sorted bystarting_posso that k-mer extraction reduces to a single forward scan overstrings. The request buffer is capped atram_limit / 8 / sizeof(kmer_extraction_request); flushed runs are merged withfile_merging_iterator. -
Per-partition kmer files. While walking
stringsin request-sorted order, each extracted k-mer is written to its partition's tmp file via a buffered writer; this file is the input to the partition's MPHF. -
Phase C (per-partition MPHF +
positions). For each skew partition:- Build the partition MPHF with pthash external-memory (
ram = ram_limit / 2, iterator:skew_partition_kmer_iteratorover the partition's tmp file). - Stream-read the partition file again, emit
(F(kmer), pos_in_bucket)tuples; external-sort them inram_limit / 8-sized buffers and merge. - Pack the sorted tuples into the partition's
positionscompact_vector viastreaming_compact_vector_writer.
Only the freshly-built MPHF for the current partition lives in RAM during phase C; once spilled (
essentials::save), it is freed before the next partition starts.positionsis fully on-disk. - Build the partition MPHF with pthash external-memory (
The dictionary d is walked by essentials::saver, but every spilled
component is intercepted via an address+type-keyed substitution map
(typed_address_sub): when the saver visits a registered (address, type)
pair, it appends the bytes of the corresponding tmp file straight into the
output stream instead of reading from d. The strings bit-vector goes
through the same mechanism via disk_backed_strings.
Concretely, the registered substitutions are:
| Component | Source tmp file |
|---|---|
m_ssi.codewords.control_codewords |
step 7.1 |
m_ssi.mid_load_buckets |
step 7.1 |
m_ssi.ski.heavy_load_buckets |
step 7.2 phase B |
m_ssi.codewords.mphf |
step 5 spill |
m_ssi.ski.positions[i] |
step 7.2 phase C, per partition |
m_ssi.ski.mphfs[i] |
step 7.2 phase C, per partition |
m_spss.strings |
step 1 (disk_backed_strings) |
Because the substitutions are by (address, type) pair, a struct's address
coinciding with its first member's address does not cause confusion.
After step 8 returns, the tmp files are removed and finalize_stats reads
the saved file's size with std::filesystem::file_size.
The on-disk index size grows with num_kmers. The build's resident
memory does not, because every component that scales with input size is
either:
-
always on disk (
strings,strings_offsets, all sorted minimizer runs, the merged minimizers file, the sparse-index compact_vectors, the skew-index per-partition kmer/positions files, the codewords MPHF and per-partition MPHFs), or -
bounded by a working buffer sized as a fraction of
ram_limit:Buffer Cap Step 2 per-thread minimizer buffer ram_limit / 2 / num_threadsStep 5 hashing buffer min(ram/8, RAM_available/3)Step 7.2 kmer-request external sort ram_limit / 8Step 7.2 phase-C position_tuplesortram_limit / 8pthash external-memory builds ram_limit / 2(its ownramfield)Every disk-backed reader/writer default_buffer_records ≈ 32 KiBEvery external merge front (per run) 4096 · sizeof(T)
The fractions (/2 for the dominant per-step buffer, /8 for buffers
that span step boundaries) are chosen so that overlapping allocations and
heap fragmentation between steps stay under ram_limit in practice. There
is a hard floor of min_ram_limit_in_GiB (enforced in
validate_and_normalize_build_config) below which step 4's MPHF builder
no longer has enough room to make progress.
The result: peak RSS during the build is governed by -g, not by
the input size or by the on-disk index size, and the saved index is
identical (byte-for-byte) to one written by an in-RAM builder followed by
essentials::save.