Skip to content

Latest commit

 

History

History
426 lines (306 loc) · 15.4 KB

File metadata and controls

426 lines (306 loc) · 15.4 KB

CLI reference

Every command supports --help. This page is a short description plus a working example per command.

Version note: items marked (4.0) shipped in release 4.0.0; (3.15) in 3.15.0.


bigsmall profile (4.0)

One-time hardware probe (~10 s): RAM, CPU cores, sequential disk read (small unbuffered sample), and BigSmall's decode-kernel speeds on your machine. GPU facts are read via NVML / nvidia-smi only — no CUDA context is ever created, a busy GPU is never disturbed. Saved to ~/.bigsmall/profile.json; later runs just print it.

bigsmall profile
profile: C:\Users\you\.bigsmall\profile.json
  cpu: 8 logical cores | ram: 30.04 GB total, 21.32 GB free at probe
  disk: 0.657 GB/s sequential (FILE_FLAG_NO_BUFFERING)
  gpu: NVIDIA RTX A4500 | 19.5 GB usable [UNVERIFIED (GPU busy)]
  kernels: bf16 decode 484.4 MB/s, int4 decode 853.7 MB/s (CPU, measured)

Flags: --force (re-probe), --path FILE (profile location), --sample-mb N (disk probe size, default 256), --json.


bigsmall plan (4.0)

The autopilot decision for a .bs/.bsd file — one sentence, no execution. Picks the highest fidelity that runs at usable speed on your profiled hardware; any pick below the file's best fidelity is announced first.

bigsmall plan qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.

Flags:

  • --mode perfect|fast|tiny — override the pick (the downgrade announcement still prints).
  • --min-tok-s N — the usable-speed floor (default 1.0 tok/s). Below it, placements count as capacity picks and say so.
  • --json — the full decision: every placement considered, fit math, alternatives ladder.

bigsmall run (4.0)

Pick like plan, announce, then do it: load the chosen mode into host RAM, or stream it layer-by-layer when the pick (or --stream) calls for it. A busy GPU is never touched.

bigsmall run qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.
loaded 290 tensors into host RAM in 27.0s [mode=perfect] (bit-exact receipt honoured)

With --stream (forced layer-streaming executor — automatic when the planner picks a stream placement):

bigsmall run qwen2.5-0.5b.bsd --stream
--stream forced: layer-streaming executor on cpu_stream_nvme instead (planner's pick above: cpu_full); ~1.24 GB RAM resident promised
streamed 24 layers [mode=perfect]: 494 MB read, 716 MB decoded in 20.5s (35 MB/s raw, ~0.0488 passes/s)
resident: non-layer 0.27 GB + peak layer slot 0.03 GB; process peak 1.04 GB (planner promised ~1.2 GB)

Flags: --mode perfect|fast|tiny, --min-tok-s N, --stream, --no-progress.


bigsmall dual (4.0)

Create a dual-fidelity .bsd (INT4 fast member + lossless residual) from a safetensors file, or inspect an existing one. The .bsd is the Ferrell Duo format — one file, two models: the fast one and the real one. Creation runs the bit-exact gate on every tensor before the file is written — a .bsd that cannot prove perfect-mode exactness is never created.

bigsmall dual model.safetensors -o model.bsd
dual: model.safetensors -> model.bsd
  raw 988,065,536 -> file 680,590,500 (68.88%) | fast-only reads 211,720,831 (21.43%)
  169/290 tensors dual | encode 151.98s | bit-exact gate PASS (290 tensors, pre-write)

Inspect / re-gate:

bigsmall dual model.bsd            # manifest: modes, sizes, receipts
bigsmall dual model.bsd --verify   # re-run the bit-exact gate from disk

Flags: -o OUT, --group N (INT4 group size, default 128), --workers N, --verify, --no-progress.


bigsmall compress

Compress a .safetensors file or a model directory into a .bs file.

bigsmall compress SRC [-o OUT] [--delta-from BASE] [--auto-delta]
                       [--force-delta] [--group-streams]
                       [--resume] [--ecc]
                       [--storage|--balanced|--inference]
compressed gpt2_src\model.safetensors -> gpt2.bs
  source:     548,105,171 bytes
  compressed: 413,973,591 bytes (75.53%)
  saved:      134,131,580 bytes
  elapsed:    19.4s

Flags:

  • --delta-from BASE — write as a delta against BASE (alias: --base). The delta engine measures delta AND standalone per tensor and keeps the smaller, so the output is never worse than standalone coding (3.15).
  • --auto-delta — auto-detect closest base by fingerprint. Pair with one or more --base-dir.
  • --force-delta (3.15) — silence the delta regime gate (the warning printed when

    30% of matched weight bytes changed vs the base — the measured delta-does-not-pay regime). Size is protected either way.

  • --group-streams (3.15) — pack small same-role tensors (norm/bias chains, GQA k/v projections, MoE routers) into one coded stream per role when measured smaller; bit-exact, every loss gated to 0. Grouped files need bigsmall >= 3.15 to read and cannot be resharded without re-encoding.
  • --resume — tensor-level checkpointing in <OUT>.progress/. Auto-cleans on success.
  • --ecc — write a Reed-Solomon parity sidecar (<OUT>.ecc) for bit-rot recovery. Adds ~14% to total size. Slow on large files: encoding uses the pure-Python reedsolo backend and is O(file_size) at only a few MB/s, so a multi-hundred-MB shard can take several minutes (a one-line notice is printed for inputs over 100 MB). It is encoding, not hung.
  • --storage / --balanced / --inference — mode hint (informational for now).
  • --no-progress — disable tqdm progress bar.

FP8 (e4m3/e5m2), BF16, F16, and F32 tensors are all handled automatically — fp8 weights code to a measured 0.829 of their fp8 bytes (details).

Examples:

bigsmall compress mistral.safetensors -o mistral.bs
bigsmall compress finetune.safetensors --delta-from base.safetensors -o patch.bs
bigsmall compress huge.safetensors -o huge.bs --resume --ecc

bigsmall decompress

Decompress a .bs file back to .safetensors. With --base, treats SRC as a delta.

bigsmall decompress SRC [-o OUT] [--base BASE]
decompressed gpt2.bs -> gpt2_back.safetensors  (13.8s)

Example:

bigsmall decompress mistral.bs -o mistral.safetensors
bigsmall decompress patch.bs --base base.safetensors -o reconstructed.safetensors

bigsmall transcode (4.0)

Re-encode a .bs into a runtime-optimised .bsr for decode speed. Lossless: every re-encoded tensor is decode-verified against the source md5 before the output exists. Modes: speed (hot tensors raw + stream-split bulk — the data-picked default), balanced, ratio.

bigsmall transcode gpt2.bs gpt2.bsr --mode speed
transcode[speed]: gpt2.bs -> gpt2.bsr
  413,973,591 -> 497,794,303 bytes (+20.25% vs source)
  tensors: 160 (148 re-encoded, all md5-verified)  wall: 8.9s

The size grows because speed mode trades ratio for decode rate — that is its job; on a bf16 model the measured trade is ~+5 pp of raw for several times the streaming decode speed. Flags: --mode ratio|balanced|speed, --n-streams N (stream-split count), --workers N, --no-progress.


bigsmall serve-stream (4.0)

Weight-streaming inference: weights stay compressed in tiered residency (VRAM / RAM / NVMe budgets you set); each layer decompresses into a reusable scratch slot during the forward pass, with optional prefetch of layer L+1 while L computes. The correctness gate behind it: streamed logits are bit-exact against a fully-loaded model (verified at 7B — receipts).

bigsmall serve-stream gpt2.bsr --config ./gpt2_src --ram-gb 0.2 --prefetch \
    --prompt "The future of model compression is" --max-new-tokens 4
serve-stream: gpt2.bsr
  layers: 12  construct: 38.0s
  residency (compressed): vram 0.0 MB (SIMULATED on CPU) | ram 198.5 MB | nvme 299.3 MB
  output: 'The future of model compression is uncertain. The current'
  4 tokens in 362.0s (90.50 s/token; prefetch=on)
  per-layer decode: median 33 ms over 48 decodes

Streaming generation is capacity mode — slow per token, honest about it, for models that otherwise would not run at all. Flags: --config DIR (HF config; default next to src), --device DEV (default cpu; CUDA placement code-complete, bench deferred — VRAM residency is simulated on CPU and labeled), --vram-gb N, --ram-gb N, --pin-layer IDX:TIER (repeatable), --prefetch, --decode-threads N, --prompt TEXT, --max-new-tokens N, --placement-json FILE.


bigsmall apply

Reconstruct a fine-tune from a base model + delta patch. Same as decompress --base, but reads more naturally.

bigsmall apply BASE PATCH.bs -o OUT

Example:

bigsmall apply base.safetensors patch.bs -o finetune.safetensors

bigsmall info

Print everything about a compressed file: size, ratio, codecs used, model type, base, layer count, streaming peak-RAM estimate, per-codec breakdown, best- and worst-compressed tensors.

bigsmall info gpt2.bs
BigSmall container: gpt2.bs
  format                     fp32
  tensor_count               160
  file_size                  413,973,591 bytes (394.80 MiB)
  estimated_raw_bytes        548,090,880 bytes (522.70 MiB)
  overall ratio_pct          75.53%
  layer_count                12
  streaming_peak_ram_est     181.28 MiB
  codec_breakdown
    fp32_se_ac           148 tensors
    tied_ref             11 tensors
    special              1 tensors
  ...

bigsmall scan

Analyse an uncompressed safetensors file or directory before you compress it. Detects BF16-native F32 (the upcast-to-F32 pattern in Whisper-class models) and recommends a compression strategy.

bigsmall scan ./gpt2_src/model.safetensors
scan: gpt2_src\model.safetensors
  shards:                 1
  tensors total:          160
  F32 tensors:            160
  BF16-native F32:        12
  BF16-native by bytes:   9.2%
  recommendation:         standard F32 (no BF16-native upcast)

Flag: --json for machine-readable output.


bigsmall xray (3.15)

Checkpoint forensics (read-only; no effect on compression). For every float tensor: bf16-grain substream entropies (sign / exponent / mantissa, Miller-Madow corrected) compared against a matched-moments random control (dH and coder-transfer KL per substream), stat fingerprints (H_exp, H_mant, rank1_frac, sign_balance), and anomaly flags. Works on bf16 and fp8 checkpoints (4.0: fp8 tensors get their own lineage path and matched-fp8 random controls).

  • bf16_native_lineage — F32 tensor whose low-16 bits are >=10% zero (bf16 lineage; the codec exploits this case automatically).
  • lineage_residue — mantissa micro-structure marking fp16/bf16 precision lineage (present at initialization, not training content).
  • mantissa_carved — trained mantissa entropy below the matched-random control (the rare trained-in carve; mamba-1 conv1d class).
  • sign_imbalance — trained-in sign asymmetry (inits are symmetric).
  • exp_near_random — exponent indistinguishable from a matched normal. Trained exponents separate in every measured family, so a checkpoint where most tensors carry this flag looks untrained — the tool prints a loud warning (catches silently-randomized loads).
  • below_random_floor — total ratio under the random-Gaussian floor.
bigsmall xray model.safetensors                 # table + summary
bigsmall xray model_dir/ --json report.json     # full per-tensor JSON
bigsmall xray model.bs --no-svd                 # works on .bs too
xray: model.safetensors
  tensors analyzed: 62 (of 160 seen; min 4096 elements)
  median H_exp  2.6007 bits   median H_mant 6.9719 bits   median ratio 66.08%
  median mantissa coder-KL vs matched normal: 0.0001 bits/symbol (coder band 0.02)
  flags: bf16_native_lineage=12, mantissa_carved=12, sign_imbalance=12, exp_near_random=5, below_random_floor=12

bigsmall stat

Per-tensor detail table, sortable.

bigsmall stat mistral.bs --sort ratio --reverse
bigsmall stat mistral.bs --tensor q_proj

Flags: --sort {ratio,size,name}, --reverse, --tensor NAME (substring filter).


bigsmall verify

Check integrity. Three modes from quick to thorough:

bigsmall verify --fast mistral.bs         # header-only, seconds even on 30 GB
bigsmall verify --sample 0.001 mistral.bs # md5 on a random 0.1% of weights
bigsmall verify mistral.bs                # full decode + md5 (slow)

OK means every checked weight is bit-identical to the original.


bigsmall diff

Compare two .bs files (added / removed / changed tensors). With --patch, writes a delta .bs from A to B.

bigsmall diff v1.bs v2.bs
bigsmall diff base.safetensors finetune.safetensors --patch patch.bs

bigsmall repair

Recover a corrupted .bs using its Reed-Solomon .ecc sidecar. Tolerates up to 16 corrupted bytes per 223-byte block.

bigsmall repair corrupted.bs -o recovered.bs

bigsmall benchmark

Compress then decompress, reporting throughput, peak RSS, and a per-layer-type breakdown.

bigsmall benchmark mistral.safetensors

bigsmall migrate

Re-encode an older .bs file with current codecs. Writes a .bs.bak backup unless --no-backup.

bigsmall migrate old.bs --dry-run     # report only
bigsmall migrate old.bs               # rewrite in place

bigsmall status

List your BigSmall HuggingFace repos. Compares each repo against your local compressed directories.

bigsmall status
bigsmall status --json
bigsmall status --user some_user --suffix -compressed

bigsmall pipeline run

Resumable download → compress → upload pipeline. Crashes mid-upload? Re-run; it skips finished stages.

bigsmall pipeline run mistralai/Mistral-7B-Instruct-v0.3 ./mistral_bs --repo-id wpferrell/mistral-7b-bigsmall

Flags:

  • --no-upload / --no-compress — skip a stage.
  • --mode {storage,balanced,inference} — codec mode.
  • --lfs — LFS-based uploader (avoids HF Python upload issues on >2 GB files).
  • --workers N — parallel encode workers.
  • --token TOKEN — HF token (else env or ~/.huggingface/token).

bigsmall pipeline status DST_DIR shows the checkpoint for a destination directory.


bigsmall reshard

Split, join, or rebalance one or more .bs files into a new shard layout along transformer-layer boundaries. No re-encoding happens — blobs are copied, so it is fast and every output tensor is md5-verified. Layer groups are never split across shards; non-layer tensors (embeddings, lm_head, final norm) always land in shard 0.

bigsmall reshard model.bs --output-dir resharded/            # pack into ~2 GB shards
bigsmall reshard shard-*.bs --output-dir merged/ --shards 4  # exactly 4 output shards
bigsmall reshard ./model_dir/ --output-dir one/ --join       # join everything into one shard

Flags:

  • --output-dir DIR (required) — where to write the resharded output + new bigsmall.index.json.
  • --size-gb N — target output shard size in GB (default 2.0).
  • --shards N — split into exactly N shards (overrides --size-gb).
  • --join — pack every input tensor into a single model.bs (overrides the other two).
  • --no-progress — disable the per-shard progress bar.