Distribute Sinter QEC simulation jobs across cloud spot instances using SkyPilot.
Sinter handles Monte Carlo sampling of quantum error correction circuits on a single machine — multicore parallelism, adaptive batch sizing, decoder integration, and resumable CSV output. qec-kiln adds multi-node distribution by partitioning circuit files across spot instances, with each node running an unmodified sinter collect process.
From the Sinter documentation:
"Sinter doesn't support cloud compute, but it does scale well on a single machine."
For many QEC studies, a single machine is sufficient. However, large parameter sweeps — many code distances, noise rates, decoder comparisons, and code families — can take days on a single workstation. Chatterjee et al. reported running 8,640 Stim experiments over the course of weeks using four parallel workers, with individual runs at distance 25 taking approximately 62 hours.
qec-kiln distributes circuit variants across cloud spot instances. Each instance runs a standard sinter collect process. SkyPilot handles provisioning, spot recovery, and cost optimization. qec-kiln handles circuit partitioning and CSV merging.
The system has three layers:
- Sinter (inner loop): On each node,
sinter collectmanages multicore sampling, adaptive batch sizing, decoder invocation, and resumable CSV output. This code is unmodified. - SkyPilot (outer loop): Provisions spot/preemptible instances across cloud providers and Kubernetes clusters, handles preemption recovery, and selects the lowest-cost available compute.
- qec-kiln (glue): Partitions circuit files across jobs, launches SkyPilot managed jobs, and merges the resulting CSV fragments via
sinter combine.
┌─────────────────────────────────────────────────────────────┐
│ Researcher │
│ │
│ 1. Generate .stim circuit files (as you normally would) │
│ 2. ./scripts/launch.sh circuits/ --decoders pymatching │
│ 3. ./scripts/merge.py --bucket s3://sinter-results │
│ 4. sinter plot --in merged.csv ... │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ SkyPilot Managed Jobs Controller │
│ Routes each circuit batch to cheapest spot instance │
└────────┬───────────┬───────────┬───────────┬────────────────┘
▼ ▼ ▼ ▼
┌──────────┐┌──────────┐┌──────────┐┌──────────┐
│ Spot #1 ││ Spot #2 ││ Spot #3 ││ Spot #N │
│ ││ ││ ││ │
│ sinter ││ sinter ││ sinter ││ sinter │
│ collect ││ collect ││ collect ││ collect │
│ d=3,5 ││ d=7,9 ││ d=11,13 ││ d=15,17 │
│ ↓ ││ ↓ ││ ↓ ││ ↓ │
│ stats.csv││ stats.csv││ stats.csv││ stats.csv│
└────┬─────┘└────┬─────┘└────┬─────┘└────┬─────┘
└───────────┴───────────┴───────────┘
│
▼
┌─────────────────┐
│ Cloud Storage │
│ (S3/GCS/PVC) │
│ │
│ CSV fragments │
│ per job │
└────────┬────────┘
│
▼
┌─────────────────┐
│ sinter combine │
│ → merged.csv │
│ → sinter plot │
└─────────────────┘
For small sweeps, sinter collect on a single machine is sufficient. qec-kiln is useful when:
- The sweep has many circuit variants (50+) and single-machine wall time becomes a constraint
- No high-core-count workstation is available and cloud spot instances are more cost-effective
- GPU-accelerated decoders or tsim (Clifford+T) circuits require GPU hardware not available locally
- Multiple decoder comparisons (PyMatching, fusion_blossom, Tesseract) can be run in parallel
- A shared lab cluster is oversubscribed and cloud compute can supplement it
qec-kiln uses SkyPilot for cloud orchestration rather than cloud-specific SDKs or custom job scheduling infrastructure.
- Multi-cloud resource selection. A single YAML specifies resource requirements (e.g., 16+ CPUs or 1x A100 GPU). SkyPilot selects the lowest-cost spot instance across configured backends (AWS, GCP, Azure, Kubernetes, and others). If one provider lacks capacity, it falls through to the next.
- Spot preemption recovery. When a spot instance is reclaimed, SkyPilot provisions a replacement and restarts the job. Combined with Sinter's
--save_resume_filepath, no work is lost. - Managed job controller.
sky jobs launchsubmits jobs to a controller that manages lifecycle, log streaming, and failure recovery. A single controller handles 2,000+ concurrent jobs. - Slurm and Kubernetes support. The same YAML can target HPC clusters via Slurm or Kubernetes deployments, in addition to cloud providers.
Sinter collection jobs have properties that align well with SkyPilot's execution model:
- Embarrassingly parallel. Each batch of circuits runs independently with no inter-node communication. This maps directly to independent managed jobs.
- Spot-compatible by construction. Sinter writes incremental CSV output and resumes from partial results. SkyPilot handles instance recovery; Sinter handles state recovery.
- Short-lived. Typical batches complete in 30–120 minutes. Short jobs have low preemption probability, and SkyPilot terminates instances on completion.
- GPU access without ownership. tsim circuits and GPU-accelerated decoders can run on spot A100 instances (e.g., ~$0.70–1.50/GPU-hour on current providers).
- Reproducible. The YAML is a complete job specification. Anyone with a configured cloud account can run the same sweep.
- Python 3.9+
- SkyPilot:
pip install "skypilot[kubernetes,aws,gcp]" - At least one backend passing
sky check - Stim circuit files (
.stim) with detectors and observables annotated
Use Stim's built-in generators or your own circuit construction code. qec-kiln does not modify this step.
import stim
for p in [0.001, 0.003, 0.005, 0.008, 0.01]:
for d in [3, 5, 7, 9, 11, 13]:
circuit = stim.Circuit.generated(
code_task="surface_code:rotated_memory_x",
distance=d,
rounds=d,
after_clifford_depolarization=p,
after_reset_flip_probability=p,
before_measure_flip_probability=p,
before_round_data_depolarization=p,
)
path = f"circuits/d={d},p={p},b=X,type=rotated_surface_memory.stim"
with open(path, "w") as f:
print(circuit, file=f)# Distribute circuits across spot instances, 6 circuits per job
./scripts/launch.sh circuits/ \
--decoders pymatching \
--max-shots 1_000_000 \
--max-errors 1000 \
--circuits-per-job 6
# Or dry-run to see what would be launched
./scripts/launch.sh circuits/ --dry-runsky jobs queue # job status
sky jobs logs <job_id> # stream sinter's progress output# Merge CSV fragments from all jobs
python scripts/merge.py --bucket s3://sinter-results --out merged.csv
# Plot using Sinter's built-in plotting (unchanged from normal workflow)
sinter plot \
--in merged.csv \
--group_func "f'Rotated Surface Code d={m.d}'" \
--x_func m.p \
--xaxis "[log]Physical Error Rate" \
--fig_size 1024 1024 \
--out threshold.pngqec-kiln/
├── README.md
├── LICENSE # Apache 2.0
├── SECURITY.md # Vulnerability reporting and security considerations
├── CONTRIBUTING.md # How to contribute
├── CODE_OF_CONDUCT.md # Community guidelines
├── CHANGELOG.md # Release history
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.md
│ │ └── feature_request.md
│ └── PULL_REQUEST_TEMPLATE.md
├── configs/
│ ├── sinter_job.yaml # SkyPilot task: runs sinter collect on one batch
│ └── sinter_job_gpu.yaml # GPU variant (for tsim circuits or GPU decoders)
├── scripts/
│ ├── launch.sh # Partition circuits and launch SkyPilot jobs
│ ├── merge.py # Merge CSV fragments using sinter combine
│ └── partition.py # Split circuit files into balanced batches
├── examples/
│ ├── generate_surface_codes.py # Generate example circuit files
│ └── generate_tsim_circuits.py # Generate Clifford+T circuits for tsim
├── docker/
│ ├── Dockerfile.cpu # Stim + Sinter + PyMatching (CPU)
│ └── Dockerfile.gpu # + tsim + CUDA (GPU)
├── docs/
│ └── design.md # Architecture decisions and trade-offs
└── .gitignore
Sinter already parallelizes shots across CPU cores within a single machine. Splitting shots for the same circuit across multiple machines would require external coordination to merge partial statistics correctly. Splitting circuits across machines is trivially parallel — each job gets a disjoint set of .stim files, runs sinter collect independently, and produces a self-contained CSV. Merging is just sinter combine (concatenation + deduplication by strong_id).
Sinter has a Sampler API that allows custom sampling backends. We could theoretically build a CloudSampler that dispatches work to remote nodes. But this would mean reimplementing job scheduling, failure recovery, and cost optimization — all things SkyPilot already does. Keeping the integration at the job level (one SkyPilot job = one sinter collect process) is simpler, more robust, and easier to reason about.
Sinter's CSV format is the standard interchange for QEC benchmarking. Sinter's combine and plot commands operate on it, PyMatching's documentation uses it, and research papers publish threshold plots generated from it. A custom format would require reimplementing existing tooling.
Sinter's --save_resume_filepath flag writes statistics incrementally. If a spot instance is preempted and SkyPilot restarts the job, Sinter reads the existing CSV and counts collected data toward its targets. No additional checkpointing is needed.
qec-kiln works with any simulator or decoder that Sinter supports:
Simulators: Stim (Clifford, CPU), tsim (Clifford+T, CPU/GPU)
Decoders: PyMatching, fusion_blossom, Tesseract, chromobius, NVIDIA TensorRT decoder, or any custom sinter.Decoder subclass
Infrastructure: Any SkyPilot backend — Kubernetes, AWS, GCP, Azure, Lambda, RunPod, Nebius, and 15+ others
For groups running qec-kiln regularly, a platform engineer can manage:
- Cluster configuration: K8s node pools, NVIDIA device plugins, Kueue for GPU scheduling
- Cost management: Spot instance policies, multi-cloud fallback, autostop, per-user budgets
- Container images: Pinned Stim/Sinter/decoder versions, CI rebuilds on upstream releases
- Monitoring: Job completion rates, GPU utilization, cost tracking
- Storage: S3/GCS bucket or K8s PVC configuration with retention policies
Apache-2.0