Every command supports --help. This page is a short description plus a working example per command.
Version note: items marked (4.0) shipped in release 4.0.0; (3.15) in 3.15.0.
One-time hardware probe (~10 s): RAM, CPU cores, sequential disk read (small unbuffered sample), and BigSmall's decode-kernel speeds on your machine. GPU facts are read via NVML / nvidia-smi only — no CUDA context is ever created, a busy GPU is never disturbed. Saved to ~/.bigsmall/profile.json; later runs just print it.
bigsmall profileprofile: C:\Users\you\.bigsmall\profile.json
cpu: 8 logical cores | ram: 30.04 GB total, 21.32 GB free at probe
disk: 0.657 GB/s sequential (FILE_FLAG_NO_BUFFERING)
gpu: NVIDIA RTX A4500 | 19.5 GB usable [UNVERIFIED (GPU busy)]
kernels: bf16 decode 484.4 MB/s, int4 decode 853.7 MB/s (CPU, measured)
Flags: --force (re-probe), --path FILE (profile location), --sample-mb N (disk probe size, default 256), --json.
The autopilot decision for a .bs/.bsd file — one sentence, no execution. Picks the highest fidelity that runs at usable speed on your profiled hardware; any pick below the file's best fidelity is announced first.
bigsmall plan qwen2.5-0.5b.bsdRunning perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.
Flags:
--mode perfect|fast|tiny— override the pick (the downgrade announcement still prints).--min-tok-s N— the usable-speed floor (default 1.0 tok/s). Below it, placements count as capacity picks and say so.--json— the full decision: every placement considered, fit math, alternatives ladder.
Pick like plan, announce, then do it: load the chosen mode into host RAM, or stream it layer-by-layer when the pick (or --stream) calls for it. A busy GPU is never touched.
bigsmall run qwen2.5-0.5b.bsdRunning perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.
loaded 290 tensors into host RAM in 27.0s [mode=perfect] (bit-exact receipt honoured)
With --stream (forced layer-streaming executor — automatic when the planner picks a stream placement):
bigsmall run qwen2.5-0.5b.bsd --stream--stream forced: layer-streaming executor on cpu_stream_nvme instead (planner's pick above: cpu_full); ~1.24 GB RAM resident promised
streamed 24 layers [mode=perfect]: 494 MB read, 716 MB decoded in 20.5s (35 MB/s raw, ~0.0488 passes/s)
resident: non-layer 0.27 GB + peak layer slot 0.03 GB; process peak 1.04 GB (planner promised ~1.2 GB)
Flags: --mode perfect|fast|tiny, --min-tok-s N, --stream, --no-progress.
Create a dual-fidelity .bsd (INT4 fast member + lossless residual) from a safetensors file, or inspect an existing one. The .bsd is the Ferrell Duo format — one file, two models: the fast one and the real one. Creation runs the bit-exact gate on every tensor before the file is written — a .bsd that cannot prove perfect-mode exactness is never created.
bigsmall dual model.safetensors -o model.bsddual: model.safetensors -> model.bsd
raw 988,065,536 -> file 680,590,500 (68.88%) | fast-only reads 211,720,831 (21.43%)
169/290 tensors dual | encode 151.98s | bit-exact gate PASS (290 tensors, pre-write)
Inspect / re-gate:
bigsmall dual model.bsd # manifest: modes, sizes, receipts
bigsmall dual model.bsd --verify # re-run the bit-exact gate from diskFlags: -o OUT, --group N (INT4 group size, default 128), --workers N, --verify, --no-progress.
Compress a .safetensors file or a model directory into a .bs file.
bigsmall compress SRC [-o OUT] [--delta-from BASE] [--auto-delta]
[--force-delta] [--group-streams]
[--resume] [--ecc]
[--storage|--balanced|--inference]compressed gpt2_src\model.safetensors -> gpt2.bs
source: 548,105,171 bytes
compressed: 413,973,591 bytes (75.53%)
saved: 134,131,580 bytes
elapsed: 19.4s
Flags:
--delta-from BASE— write as a delta against BASE (alias:--base). The delta engine measures delta AND standalone per tensor and keeps the smaller, so the output is never worse than standalone coding (3.15).--auto-delta— auto-detect closest base by fingerprint. Pair with one or more--base-dir.--force-delta(3.15) — silence the delta regime gate (the warning printed when30% of matched weight bytes changed vs the base — the measured delta-does-not-pay regime). Size is protected either way.
--group-streams(3.15) — pack small same-role tensors (norm/bias chains, GQA k/v projections, MoE routers) into one coded stream per role when measured smaller; bit-exact, every loss gated to 0. Grouped files need bigsmall >= 3.15 to read and cannot be resharded without re-encoding.--resume— tensor-level checkpointing in<OUT>.progress/. Auto-cleans on success.--ecc— write a Reed-Solomon parity sidecar (<OUT>.ecc) for bit-rot recovery. Adds ~14% to total size. Slow on large files: encoding uses the pure-Pythonreedsolobackend and is O(file_size) at only a few MB/s, so a multi-hundred-MB shard can take several minutes (a one-line notice is printed for inputs over 100 MB). It is encoding, not hung.--storage/--balanced/--inference— mode hint (informational for now).--no-progress— disable tqdm progress bar.
FP8 (e4m3/e5m2), BF16, F16, and F32 tensors are all handled automatically — fp8 weights code to a measured 0.829 of their fp8 bytes (details).
Examples:
bigsmall compress mistral.safetensors -o mistral.bs
bigsmall compress finetune.safetensors --delta-from base.safetensors -o patch.bs
bigsmall compress huge.safetensors -o huge.bs --resume --eccDecompress a .bs file back to .safetensors. With --base, treats SRC as a delta.
bigsmall decompress SRC [-o OUT] [--base BASE]decompressed gpt2.bs -> gpt2_back.safetensors (13.8s)
Example:
bigsmall decompress mistral.bs -o mistral.safetensors
bigsmall decompress patch.bs --base base.safetensors -o reconstructed.safetensorsRe-encode a .bs into a runtime-optimised .bsr for decode speed. Lossless: every re-encoded tensor is decode-verified against the source md5 before the output exists. Modes: speed (hot tensors raw + stream-split bulk — the data-picked default), balanced, ratio.
bigsmall transcode gpt2.bs gpt2.bsr --mode speedtranscode[speed]: gpt2.bs -> gpt2.bsr
413,973,591 -> 497,794,303 bytes (+20.25% vs source)
tensors: 160 (148 re-encoded, all md5-verified) wall: 8.9s
The size grows because speed mode trades ratio for decode rate — that is its job; on a bf16 model the measured trade is ~+5 pp of raw for several times the streaming decode speed. Flags: --mode ratio|balanced|speed, --n-streams N (stream-split count), --workers N, --no-progress.
Weight-streaming inference: weights stay compressed in tiered residency (VRAM / RAM / NVMe budgets you set); each layer decompresses into a reusable scratch slot during the forward pass, with optional prefetch of layer L+1 while L computes. The correctness gate behind it: streamed logits are bit-exact against a fully-loaded model (verified at 7B — receipts).
bigsmall serve-stream gpt2.bsr --config ./gpt2_src --ram-gb 0.2 --prefetch \
--prompt "The future of model compression is" --max-new-tokens 4serve-stream: gpt2.bsr
layers: 12 construct: 38.0s
residency (compressed): vram 0.0 MB (SIMULATED on CPU) | ram 198.5 MB | nvme 299.3 MB
output: 'The future of model compression is uncertain. The current'
4 tokens in 362.0s (90.50 s/token; prefetch=on)
per-layer decode: median 33 ms over 48 decodes
Streaming generation is capacity mode — slow per token, honest about it, for models that otherwise would not run at all. Flags: --config DIR (HF config; default next to src), --device DEV (default cpu; CUDA placement code-complete, bench deferred — VRAM residency is simulated on CPU and labeled), --vram-gb N, --ram-gb N, --pin-layer IDX:TIER (repeatable), --prefetch, --decode-threads N, --prompt TEXT, --max-new-tokens N, --placement-json FILE.
Reconstruct a fine-tune from a base model + delta patch. Same as decompress --base, but reads more naturally.
bigsmall apply BASE PATCH.bs -o OUTExample:
bigsmall apply base.safetensors patch.bs -o finetune.safetensorsPrint everything about a compressed file: size, ratio, codecs used, model type, base, layer count, streaming peak-RAM estimate, per-codec breakdown, best- and worst-compressed tensors.
bigsmall info gpt2.bsBigSmall container: gpt2.bs
format fp32
tensor_count 160
file_size 413,973,591 bytes (394.80 MiB)
estimated_raw_bytes 548,090,880 bytes (522.70 MiB)
overall ratio_pct 75.53%
layer_count 12
streaming_peak_ram_est 181.28 MiB
codec_breakdown
fp32_se_ac 148 tensors
tied_ref 11 tensors
special 1 tensors
...
Analyse an uncompressed safetensors file or directory before you compress it. Detects BF16-native F32 (the upcast-to-F32 pattern in Whisper-class models) and recommends a compression strategy.
bigsmall scan ./gpt2_src/model.safetensorsscan: gpt2_src\model.safetensors
shards: 1
tensors total: 160
F32 tensors: 160
BF16-native F32: 12
BF16-native by bytes: 9.2%
recommendation: standard F32 (no BF16-native upcast)
Flag: --json for machine-readable output.
Checkpoint forensics (read-only; no effect on compression). For every float tensor: bf16-grain substream entropies (sign / exponent / mantissa, Miller-Madow corrected) compared against a matched-moments random control (dH and coder-transfer KL per substream), stat fingerprints (H_exp, H_mant, rank1_frac, sign_balance), and anomaly flags. Works on bf16 and fp8 checkpoints (4.0: fp8 tensors get their own lineage path and matched-fp8 random controls).
bf16_native_lineage— F32 tensor whose low-16 bits are >=10% zero (bf16 lineage; the codec exploits this case automatically).lineage_residue— mantissa micro-structure marking fp16/bf16 precision lineage (present at initialization, not training content).mantissa_carved— trained mantissa entropy below the matched-random control (the rare trained-in carve; mamba-1 conv1d class).sign_imbalance— trained-in sign asymmetry (inits are symmetric).exp_near_random— exponent indistinguishable from a matched normal. Trained exponents separate in every measured family, so a checkpoint where most tensors carry this flag looks untrained — the tool prints a loud warning (catches silently-randomized loads).below_random_floor— total ratio under the random-Gaussian floor.
bigsmall xray model.safetensors # table + summary
bigsmall xray model_dir/ --json report.json # full per-tensor JSON
bigsmall xray model.bs --no-svd # works on .bs tooxray: model.safetensors
tensors analyzed: 62 (of 160 seen; min 4096 elements)
median H_exp 2.6007 bits median H_mant 6.9719 bits median ratio 66.08%
median mantissa coder-KL vs matched normal: 0.0001 bits/symbol (coder band 0.02)
flags: bf16_native_lineage=12, mantissa_carved=12, sign_imbalance=12, exp_near_random=5, below_random_floor=12
Per-tensor detail table, sortable.
bigsmall stat mistral.bs --sort ratio --reverse
bigsmall stat mistral.bs --tensor q_projFlags: --sort {ratio,size,name}, --reverse, --tensor NAME (substring filter).
Check integrity. Three modes from quick to thorough:
bigsmall verify --fast mistral.bs # header-only, seconds even on 30 GB
bigsmall verify --sample 0.001 mistral.bs # md5 on a random 0.1% of weights
bigsmall verify mistral.bs # full decode + md5 (slow)OK means every checked weight is bit-identical to the original.
Compare two .bs files (added / removed / changed tensors). With --patch, writes a delta .bs from A to B.
bigsmall diff v1.bs v2.bs
bigsmall diff base.safetensors finetune.safetensors --patch patch.bsRecover a corrupted .bs using its Reed-Solomon .ecc sidecar. Tolerates up to 16 corrupted bytes per 223-byte block.
bigsmall repair corrupted.bs -o recovered.bsCompress then decompress, reporting throughput, peak RSS, and a per-layer-type breakdown.
bigsmall benchmark mistral.safetensorsRe-encode an older .bs file with current codecs. Writes a .bs.bak backup unless --no-backup.
bigsmall migrate old.bs --dry-run # report only
bigsmall migrate old.bs # rewrite in placeList your BigSmall HuggingFace repos. Compares each repo against your local compressed directories.
bigsmall status
bigsmall status --json
bigsmall status --user some_user --suffix -compressedResumable download → compress → upload pipeline. Crashes mid-upload? Re-run; it skips finished stages.
bigsmall pipeline run mistralai/Mistral-7B-Instruct-v0.3 ./mistral_bs --repo-id wpferrell/mistral-7b-bigsmallFlags:
--no-upload/--no-compress— skip a stage.--mode {storage,balanced,inference}— codec mode.--lfs— LFS-based uploader (avoids HF Python upload issues on >2 GB files).--workers N— parallel encode workers.--token TOKEN— HF token (else env or~/.huggingface/token).
bigsmall pipeline status DST_DIR shows the checkpoint for a destination directory.
Split, join, or rebalance one or more .bs files into a new shard layout along transformer-layer boundaries. No re-encoding happens — blobs are copied, so it is fast and every output tensor is md5-verified. Layer groups are never split across shards; non-layer tensors (embeddings, lm_head, final norm) always land in shard 0.
bigsmall reshard model.bs --output-dir resharded/ # pack into ~2 GB shards
bigsmall reshard shard-*.bs --output-dir merged/ --shards 4 # exactly 4 output shards
bigsmall reshard ./model_dir/ --output-dir one/ --join # join everything into one shardFlags:
--output-dir DIR(required) — where to write the resharded output + newbigsmall.index.json.--size-gb N— target output shard size in GB (default2.0).--shards N— split into exactly N shards (overrides--size-gb).--join— pack every input tensor into a singlemodel.bs(overrides the other two).--no-progress— disable the per-shard progress bar.