Forgather is a configuration-driven ML framework that uses template inheritance and code generation to eliminate configuration duplication and enable systematic experimentation. Instead of copying and modifying entire training scripts, you inherit from base templates and specify only what changes.
📚 Documentation: forgather.readthedocs.io is the complete documentation index. New users should head straight to Getting Started.
Most research ML codebases accrete: one training script becomes ten training scripts, each a near-copy of the others with subtle differences. Every variation is expensive to try. Small bugs -- a loss function wired wrong, a scheduler silently reset on resume, a CLI flag that didn't actually reach the tokenizer -- hide across forks.
Forgather addresses this with three layers:
- Template inheritance. A project config extends a parent; both
are plain YAML with Jinja2 preprocessing. Overrides are explicit
(
-- set ns.seq_len = 16384), and every knob is documented on the parent. - Portable model source. Model construction emits standalone
Python source into the training run's output directory, and the
training run then loads that source. You can zip that directory
and load the model on another machine with just
transformersinstalled -- no Forgather dependency. (The trainer, optimiser, datasets, and callbacks materialise straight from the config graph; the Python export is specifically for the model so it can live outside Forgather's runtime.) - Live job control. Running training jobs register a small control endpoint. From another shell you can save a checkpoint on demand, gracefully stop a run, or abort a failed experiment -- works uniformly across DDP, FSDP2, and pipeline-parallel jobs.
- No config duplication -- inherit from base templates and override only what changes; every knob is an explicit overridable block.
- Types as hyperparameters -- swap optimizers, models, datasets, trainers, and samplers directly in YAML via custom tags (
!partial,!factory,!singleton). No Python edits required to try a new optimizer or a different layer norm. - Full reproducibility -- automatic snapshots of configs and generated code with each run.
- Standalone generated models -- the PyTorch source written into each run's
output_models/directory loads with plainAutoModelForCausalLM.from_pretrained(..., trust_remote_code=True). No Forgather dependency at inference time; zip and ship. - Pipeline-parallel trainer for bandwidth-limited environments -- requires dramatically less cross-device communication than DDP or FSDP. Forgather has trained a 7B model across two machines linked only by 1 Gbit Ethernet with bandwidth to spare, and the same design means it scales well on consumer hardware where FSDP stalls on PCIe. Should extend to more than two nodes on the same fabric.
- Low-memory training suite -- gradient checkpointing, CPU activation offloading, fused optimizer step, fused linear+cross-entropy loss (Liger / Apple CCE / torch.compile), Triton Adafactor with stochastic bf16 rounding, FP8 via torchao. Full-parameter (not LoRA) finetuning of 7B models at up to ~53K context on a single 24 GB GPU.
- Fast dataset loading -- large HF datasets (e.g. C4) normally take 10-20 minutes to open. Forgather indexes the Arrow files on first use and is ready in a few seconds after that, with stateful resume from any position.
- Packed sequences + Flex Attention -- pack multiple documents into every batch with explicit document-boundary tracking; flex-attention masks enforce no cross-document attention, so packing doesn't cost attention quality. Combined with the fused-loss kernel, this is a large throughput win on small-document corpora.
- Live job control -- save, stop, save-stop, or abort running training jobs from another shell. Coordinates automatically across DDP / FSDP-2 / pipeline-parallel workers.
- Extensive examples and documentation -- tutorials covering every step from scratch, pretraining recipes from 4M to ~1B params, finetune recipes for Llama-2/3, Mistral, Qwen3, Gemma-3, and a growing set of ablation-focused experiments. Every major subsystem has its own docs page.
- Speed-optimised -- meaningful effort has gone into wall-clock performance (Triton kernels, packed iterators, SDPA-backend selection,
torch.compileeverywhere it helps). A formal head-to-head benchmark against alternative frameworks is still to be done.
- Apr 2026 -- New recommended base template
projects/lm_training_project.yaml(pretraining and finetuning) andprojects/finetune_v2.yaml(finetune-specific layer). Token-budget-driven step computation, automatic batch-size-aware LR scaling, WSD scheduler, fully-documented parameter surface. Replaces several drifting older base templates. - Apr 2026 -- Tiny Llama and H.P. Lovecraft tutorials rewritten around the v2 templates as README-first (no Jupyter required). Tiny Llama covers the full train → monitor → control → eval → inference → export flow.
- Mar 2026 -- YaRN and Llama-3 style RoPE scaling in the rotary-embeddings module. Configure via
rope_parameterswithrope_type: yarnorrope_type: llama3. - Mar 2026 --
forgather eval test-- run any named eval config against a trained model and write markdown + JSON results to{model}/evals/. - Feb 2026 -- Trainer job control (
forgather control list / status / save / stop / save-stop / abort). Distributed-safe; works across DDP and pipeline-parallel runs. - Feb 2026 -- Sharded-checkpoint abstraction with explicit state-sharing patterns (GLOBAL / PER_RANK / REPLICATED / PER_GROUP / PER_NODE) and per-checkpoint manifests. See
docs/checkpointing/. - Dec 2025 -- Fused linear-cross-entropy loss (paper) -- Liger / Apple CCE / PyTorch-compiled implementations. Large peak-memory reduction for training with big vocabularies. Example:
examples/finetune/samantha/templates/configs/llama3_1b/1gpu_default.yaml. - Dec 2025 -- Triton Adafactor --
src/forgather/ml/optim/adafactor_triton.py-- lower peak memory and faster training than the reference Adafactor. - Dec 2025 -- Inference server supports
device_map="auto", so models too large for one GPU can be sharded across all visible GPUs for serving. - Nov 2025 -- Overhauled model-conversion tool with support for Llama (incl. RoPE scaling, tied embeddings), Mistral, Qwen3, Gemma-3.
- Nov 2025 -- OpenAssistant dataset -- high-quality example of a custom dataset that generates examples on the fly (quality-weighted sampling from conversation trees, sequence packing, multi-language, deterministic). Demo finetune project.
- Nov 2025 -- Support for packed sequences and Flex Attention; KV cache in models.
- Torch Titan integration -- Use Forgather to configure Torch Titan.
vLLM integration is currently broken due to Forgather's move to Transformers v5, which vLLM does not yet support. Upstream is working on v5 compatibility; we'll re-enable the integration once that lands.
- Why Forgather?
- Key Benefits
- Quick Start
- Key Features
- Core Concepts
- Learning Forgather
- Featured Examples
- Command Overview
- Project Structure
- Development Setup
- Contributing
Full install walkthrough and first-training-run tutorial: docs/getting-started/README.md.
The short version, assuming Python 3.12+ and
build-essential / python3-dev / graphviz already installed.
An NVIDIA GPU is strongly recommended but not required -- CPU-only
training works (Tiny Llama runs end-to-end on a Chromebook in a day);
it's just ~two to three orders of magnitude slower than the same
workload on a 4090. Non-CUDA accelerators (Intel, AMD) may work in
theory -- we've avoided hard CUDA dependencies -- but have not been
tested outside of CUDA and CPU.
# Create and activate a virtual environment first -- most modern distros
# refuse `pip install` into the system Python (PEP 668). venv, conda, or
# uv all work; pick one:
python3.12 -m venv ~/venvs/forgather
source ~/venvs/forgather/bin/activate
git clone https://github.com/jdinalt/forgather.git
cd forgather
pip install -e .
pip install "cut-cross-entropy @ git+https://github.com/apple/ml-cross-entropy.git"
cd examples/tutorials/tiny_llama
forgather -t v2.yaml train # ~10 min on a single RTX 4090See examples/tutorials/tiny_llama/README.md
for the full "train → monitor → control → eval → inference → export"
walkthrough, or docs/getting-started/README.md
for the install details.
Create new experiments by inheriting from existing configs and specifying only the differences:
-- extends 'base_experiment.yaml'
[config_metadata]
== super()
-- set ns.seq_len = 16384 # longer context
[optimizer]
== super()
lr: 1.0e-3 # override the LR, keep everything elseUse any Python class or function directly in configs. Custom YAML tags
(!partial, !factory, !singleton, !var, !call) describe how to
build live Python objects from the parsed graph:
optimizer: !partial:torch.optim.AdamW
lr: 1.0e-3
weight_decay: 0.01
[layer_factory]
# Experiment: swap PreLayerNorm for PostLayerNorm
layer_factory: &layer_factory !partial:.post_ln_layer:PostLNLayer@layer_factory
feedforward_factory: *feedforward_factory
attention_factory: *attention_factory
norm_factory: *layer_norm_factory
dropout: !var "layer_dropout"
residual_dropout: !var "residual_dropout"See Syntax Reference for the full list of line statements and YAML tags.
At runtime Forgather materialises the parsed node graph directly into Python objects -- no intermediate code-generation phase is involved. Python-source export is a separate function with two uses:
-
Custom model source. When you construct a model for the first time, Forgather writes the equivalent Python source into the training run's output directory. The generated code has no Forgather dependency: any HF-compatible consumer (
transformers, vLLM, etc.) can load the model without Forgather installed. This is whattrust_remote_code=Trueresolves.from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "output_models/v2", trust_remote_code=True, )
If you want plain-HF weights without
trust_remote_code, convert viaforgather convert --reverse --model-type llama <src> <dst>. The converter supports Llama, Mistral, Qwen3, and Gemma-3. -
Config debugging / pedagogy.
forgather code --target Xprints the Python equivalent of any node in your config graph -- useful when you want to understand what a complex!partial/!factorychain actually constructs, or to see how template overrides materialise. Not used by training itself.
basictrainer -- single-GPU, the fast path for small-model experiments.ddptrainer -- multi-GPU DistributedDataParallel, with optional PostLocalSGD for reduced communication frequency.fsdp2trainer -- FSDP-2 sharded data parallel, with configurable parameter/reduce/buffer dtypes and CPU offload.pipelinetrainer -- pipeline parallelism. GPipe, 1F1B, Interleaved-1F1B, and zero-bubble schedules, multi-stage support, per-stagetorch.compile. Designed for bandwidth-limited setups (multi-node over Ethernet or consumer GPUs over PCIe).- DiLoCo -- distributed local SGD for very-low-bandwidth
multi-machine training. Sync and async modes, Delayed Nesterov
momentum, dynamic local-update adaptation. See
docs/trainers/diloco.md. AccelTrainer-- legacy Hugging Face Accelerate wrapper, kept for a few older examples. Maintenance is low priority; preferddp/fsdp2/pipelinefor new work.- A small Transformers-Trainer compatibility shim also exists for pre-Forgather-trainer scripts. Legacy; low priority.
- Adafactor with bf16 stochastic rounding (the distinctive one)
-- Forgather's Triton Adafactor combines factored second-moment
estimation with per-parameter SR applied to bf16 weight updates, in
a single fused kernel. Stochastic rounding is critical for
pure-bf16 training (no fp32 master weights) -- without it,
small updates below the bf16 precision step are systematically
rounded to zero and the model's weight norms slowly drift.
To our knowledge this is the only Adafactor+SR implementation
available, and it runs faster in our tests than any other Adafactor
we've benchmarked. File:
src/forgather/ml/optim/adafactor_triton.py. - AdamW with SR -- Forgather ships a stochastic-rounding AdamW
(
src/forgather/ml/optim/adamw.py) that makes a real difference in pure-bf16 runs, but if you want quantized state on top of SR, prefertorchao.optim.AdamW4bit(4-bit optimizer state, SR-enabled viastochastic_rounding=True). Example config:examples/finetune/samantha/templates/configs/llama3_1b/ddp_adam4bit.yaml. - Apollo / Apollo-mini (arXiv:2412.05270) -- low-rank gradient projection for SGD-level memory with AdamW-level performance. Experimental -- interesting for small ablations and memory-constrained single-GPU runs, not production-hardened.
- Other optimizers -- SinkGD
(arXiv:2502.06742, stateless
matrix normalization), SGD, Muon (see the optimizer-comparison
experiment), plus a regex-based
multiopthelper for per-parameter-group optimizer assignment. - FP8 via torchao -- adapters for
tensorwise/rowwise/rowwise_with_gw_hprecipes, orthogonal to bf16 mixed precision. - Mixed precision -- bf16 (default) and fp16 (with GradScaler);
TF32 matmul controls; SDPA backend selection (flash / mem-efficient
/ math); FP8 via torchao (
tensorwise,rowwise,rowwise_with_gw_hprecipes). - Learning-rate schedulers -- Warmup-Stable-Decay, Cosine, Infinite-LR, all with configurable warmup / decay budgets in tokens or steps.
Model parameters are HF-compatible. The weight shards are written
as a standard Hugging Face Safetensors layout (a *.safetensors shard
set plus a pytorch_model.bin.index.json / model.safetensors.index.json
manifest), not a bespoke Forgather format. That's the critical part:
any tool that reads an HF checkpoint -- transformers, vLLM,
llama.cpp conversion, remote eval harnesses -- can read the trained
model. Combined with forgather checkpoint link (which symlinks the
latest checkpoint's shards up into the model directory), a plain
AutoModelForCausalLM.from_pretrained("output_models/my_run") works
without trust_remote_code once the model has been converted to a
canonical HF architecture via forgather convert --reverse.
The rest of the checkpoint (optimizer state, LR-scheduler state, dataset iterator state, per-rank RNG state, trainer step counter, manifest) is Forgather-specific -- it has to be, since it encodes Forgather-specific trainer internals -- and is used only by Forgather itself for resume. So "zip and ship" to another framework is supported for the model; full-state resume is Forgather-only.
The Forgather coordination layer sits above the on-disk format:
explicit state-sharing patterns. Every checkpoint component
declares whether it's GLOBAL (rank-0 only), PER_RANK,
REPLICATED (across DDP replicas), PER_GROUP (within a PP / TP
group), or PER_NODE. Coordination barriers and load paths are
derived from those declarations, so pipeline-parallel and FSDP-2 runs
checkpoint correctly without per-trainer custom code.
Each checkpoint also writes a JSON manifest recording every component's size, sharing pattern, and origin ranks. Resume is partial by default (a missing optional component warns instead of failing). Optional replication validation (NONE / QUICK / TENSOR / FULL) catches DDP-synchronisation bugs by hashing parameters across replicas post-save.
See docs/checkpointing/ for the full
abstraction.
Every Forgather experiment is a Project with this structure:
my_project/
├── meta.yaml # Project metadata
├── templates/
│ ├── project.yaml # Project-wide defaults
│ └── configs/ # Experiment configurations
│ ├── baseline.yaml
│ └── experiment_a.yaml
├── output_models/ # Generated code + runs (per config)
└── project_index.ipynb # Optional interactive notebook
A workspace groups related projects and centralises template
search paths. Use forgather ws create to scaffold one and
forgather project create to add projects to it.
Forgather uses Jinja2 + YAML with custom syntax:
-- extends 'template.yaml'-- template inheritance (single parent)[block_name]-- named override-able sections== super()-- include parent's version of the current block-- set ns.var = value-- set a variable in the namespace-- include 'template.yaml'-- include template content inline#---- inline.template.name ------ split a document into multiple templates!partial:module:Class/!factory:.../!singleton:...-- construct Python objects!var "name"-- variable references
Every config goes through the same pipeline, and each intermediate step is inspectable:
Templates → YAML → Node Graph → Python Objects
│
└──> (optional) Python source code
- model source export (for HF
trust_remote_code loading)
- debugging / pedagogy
Forgather materialises the node graph directly into Python objects at runtime; the Python-source path is a separate export, not an intermediary step. Model construction uses the export path so the resulting model is framework-portable; everything else (trainer, optimiser, dataset, callbacks) is built by walking the graph.
Inspection commands:
forgather -t config.yaml pp # Preprocess Jinja2 → YAML
forgather -t config.yaml graph --format yaml # Parsed node graph
forgather -t config.yaml targets # Constructable objects in the graph
forgather -t config.yaml code --target model # Python-source export of a target (debug / model export)
forgather -t config.yaml construct --target model --call
# Materialise and show the constructed objectWhen you hit a config bug, start with forgather ls -d (dumps the
preprocessed file with YAML errors, or the Jinja2 error if
preprocessing itself failed), then escalate to pp --debug (dumps
every template in the chain).
- examples/tutorials/tiny_llama -- trains a 5M-param Llama in ~10 minutes; covers config anatomy, dynamic CLI args, monitoring, control, eval, inference, and exporting to plain HF format. Start here.
- examples/tutorials/projects_overview -- how Forgather's multi-project layout is organised.
- examples/tutorials/project_composition -- cross-project composition (datasets / models / evals as independent projects that reference each other).
- examples/tutorials/hp_lovecraft_project -- fine-tune Mistral-7B / Llama-2-7B on the complete works of H.P. Lovecraft on a single 24 GB GPU. Long-context (up to 53K tokens), YaRN, gradient checkpointing, activation offloading.
forgather -iDrops you into a shell where every subcommand works without the
forgather prefix (so pp, ls, train instead of forgather pp,
etc.). Convenient for quick iteration inside a single project.
Forgather ships with a library of worked examples that go well beyond
the tutorials. The ones below are the best starting points for each
journey — each has a detailed README with reproducible commands and,
where relevant, a headline result. For the full directory map, see
examples/README.md.
| Journey | Project |
|---|---|
| Pretrain from scratch | examples/pretrain/small-llm |
| Fine-tune a 7B model (multi-GPU) | examples/finetune/samantha |
| Instruction / reasoning fine-tune | examples/finetune/open-orca |
| Long-context fine-tuning + RoPE recipes | examples/tutorials/hp_lovecraft_project |
| Cut peak memory | examples/tiny_experiments/peak_memory |
| Pick an optimizer | examples/tiny_experiments/optimizers |
| Pipeline-parallel recipes | examples/tiny_experiments/pipeline_parallel |
| Decentralised / bandwidth-limited training | examples/tiny_experiments/diloco |
pretrain/small-llm -- a
162M-parameter Llama trained from scratch on the SmolLM corpus
(FineWeb-Edu + Cosmopedia) with packed sequences and flex-attention.
Ten production-ready configs covering 1× and 10× Chinchilla budgets,
AdamW / Adafactor / bf16 variants, and the "Canon-A" custom
architecture variant. Reproducible Chinchilla scaling-law plots via
forgather logs plot. Runs on the lm_training_project.yaml base
template.
finetune/samantha -- fine-tune
Mistral-7B or Llama-3.2-1B on the Samantha conversational dataset
across every trainer backend in the library. Configs cover single-GPU,
2/4-GPU pipeline parallel, FSDP-2, and DDP. Documented throughput
(~8.9K tok/s on 4× RTX 4090 pipeline) and multi-node training notes.
The most-referenced finetune project -- most other recipes cross-link
to it rather than duplicating the setup.
finetune/open-orca --
instruction + reasoning fine-tune on Open-Orca, complementing the
Samantha chat-persona work. The companion to Samantha for learners:
ChatML-formatted evaluation prompts covering chain-of-thought math,
logic puzzles, reading comprehension, summarisation, and
format-constrained instruction following (wired into the textgen
callback). Uses Forgather's fast iterable-dataset loader -- 1 B Llama
3.2 on a 1 B-token budget completes in ~11 hours on 4× RTX 4090,
with initialisation in seconds rather than the ~10 min a naive load
would take. Headline run includes a full inference-server eval
script as an appendix.
tutorials/hp_lovecraft_project
-- fine-tune Mistral-7B / Llama-2-7B on the complete works of H.P.
Lovecraft on a single 24 GB GPU. Fits up to 53 K tokens of
context at 7B. Its companion
long_context_experiments.md
documents a four-way RoPE comparison (plain, YaRN, Llama-3
NTK-by-parts, bumped θ) evaluating 8K-trained models out to 16K on
held-out text. Headline: bumping rope_theta to 500 000 is the
single biggest intervention for extrapolation, and Llama-3-style
scaling adds a small further win. YaRN with a factor that doesn't
cover the deployment window is catastrophic. The doc ends with a
follow-up proposal for pretraining recipes.
tiny_experiments/peak_memory
-- a systematic 9-way ablation of memory-optimisation techniques
(BF16, activation checkpointing, torch.compile, fused optimizer
step, activation-memory budget) on a 1.6 B model. Headline:
81% peak-memory reduction (BF16 + fused checkpointing + optimizer
fusion) at ~2.7× throughput over the unoptimised baseline.
Pareto-frontier plots included.
tiny_experiments/optimizers
-- empirical comparison of ten optimisers (Muon, Apollo, AdamW,
Adafactor, SinkGD, SGD, etc.) on a 30M Llama trained on the SmolLM
corpus. Headline: Muon wins at small batch (eval loss 2.6778 vs
AdamW 2.7392), and beta2 scaling becomes critical at large batch.
References Marek et al. on small-batch SGD viability, the Muon paper,
Apollo, SinkGD. Includes per-optimiser memory / throughput tiers and
implementation-maturity notes.
tiny_experiments/pipeline_parallel
-- test harness and reference configs for PyTorch's pipeline-parallel
schedules (GPipe, 1F1B, ZBV, interleaved), with checkpoint save/resume
coverage across 2/4-GPU setups.
tiny_experiments/diloco
-- DiLoCo (distributed local SGD) on a 4M-parameter model. Pseudo-
gradient compression, streaming-fragment overlap with backward pass,
sync and async modes. The lowest-communication-bandwidth trainer in
the library -- pair with the pipeline-parallel recipes above when
nodes aren't co-located.
- Scaffold a new project with
forgather project create(inside an existing workspace) orforgather ws create(a brand-new workspace). These commands generate a minimum-workingmeta.yamltemplates/tree that extends the recommended base templates. Full walk-through: the Tiny Llama tutorial.
examples/base_lm_project-- a bare harness that drives the rawprojects/lm_training_project.yamltemplate with no project-specific overrides. Useful for inspecting what the base template does on its own, and for debugging changes to the base template itself, but not a typical starting point for new work.