qec-kiln

Distribute Sinter QEC simulation jobs across cloud spot instances using SkyPilot.

Sinter handles Monte Carlo sampling of quantum error correction circuits on a single machine — multicore parallelism, adaptive batch sizing, decoder integration, and resumable CSV output. qec-kiln adds multi-node distribution by partitioning circuit files across spot instances, with each node running an unmodified sinter collect process.

Motivation

From the Sinter documentation:

"Sinter doesn't support cloud compute, but it does scale well on a single machine."

For many QEC studies, a single machine is sufficient. However, large parameter sweeps — many code distances, noise rates, decoder comparisons, and code families — can take days on a single workstation. Chatterjee et al. reported running 8,640 Stim experiments over the course of weeks using four parallel workers, with individual runs at distance 25 taking approximately 62 hours.

qec-kiln distributes circuit variants across cloud spot instances. Each instance runs a standard sinter collect process. SkyPilot handles provisioning, spot recovery, and cost optimization. qec-kiln handles circuit partitioning and CSV merging.

How it works

The system has three layers:

Sinter (inner loop): On each node, sinter collect manages multicore sampling, adaptive batch sizing, decoder invocation, and resumable CSV output. This code is unmodified.
SkyPilot (outer loop): Provisions spot/preemptible instances across cloud providers and Kubernetes clusters, handles preemption recovery, and selects the lowest-cost available compute.
qec-kiln (glue): Partitions circuit files across jobs, launches SkyPilot managed jobs, and merges the resulting CSV fragments via sinter combine.

┌─────────────────────────────────────────────────────────────┐
│  Researcher                                                 │
│                                                             │
│  1. Generate .stim circuit files (as you normally would)    │
│  2. ./scripts/launch.sh circuits/ --decoders pymatching     │
│  3. ./scripts/merge.py --bucket s3://sinter-results         │
│  4. sinter plot --in merged.csv ...                         │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  SkyPilot Managed Jobs Controller                           │
│  Routes each circuit batch to cheapest spot instance        │
└────────┬───────────┬───────────┬───────────┬────────────────┘
         ▼           ▼           ▼           ▼
   ┌──────────┐┌──────────┐┌──────────┐┌──────────┐
   │ Spot #1  ││ Spot #2  ││ Spot #3  ││ Spot #N  │
   │          ││          ││          ││          │
   │ sinter   ││ sinter   ││ sinter   ││ sinter   │
   │ collect  ││ collect  ││ collect  ││ collect  │
   │ d=3,5    ││ d=7,9    ││ d=11,13  ││ d=15,17  │
   │ ↓        ││ ↓        ││ ↓        ││ ↓        │
   │ stats.csv││ stats.csv││ stats.csv││ stats.csv│
   └────┬─────┘└────┬─────┘└────┬─────┘└────┬─────┘
        └───────────┴───────────┴───────────┘
                         │
                         ▼
                ┌─────────────────┐
                │  Cloud Storage  │
                │  (S3/GCS/PVC)   │
                │                 │
                │  CSV fragments  │
                │  per job        │
                └────────┬────────┘
                         │
                         ▼
                ┌─────────────────┐
                │  sinter combine │
                │  → merged.csv   │
                │  → sinter plot  │
                └─────────────────┘

When to use this

For small sweeps, sinter collect on a single machine is sufficient. qec-kiln is useful when:

The sweep has many circuit variants (50+) and single-machine wall time becomes a constraint
No high-core-count workstation is available and cloud spot instances are more cost-effective
GPU-accelerated decoders or tsim (Clifford+T) circuits require GPU hardware not available locally
Multiple decoder comparisons (PyMatching, fusion_blossom, Tesseract) can be run in parallel
A shared lab cluster is oversubscribed and cloud compute can supplement it

Why SkyPilot?

qec-kiln uses SkyPilot for cloud orchestration rather than cloud-specific SDKs or custom job scheduling infrastructure.

General properties

Multi-cloud resource selection. A single YAML specifies resource requirements (e.g., 16+ CPUs or 1x A100 GPU). SkyPilot selects the lowest-cost spot instance across configured backends (AWS, GCP, Azure, Kubernetes, and others). If one provider lacks capacity, it falls through to the next.
Spot preemption recovery. When a spot instance is reclaimed, SkyPilot provisions a replacement and restarts the job. Combined with Sinter's --save_resume_filepath, no work is lost.
Managed job controller. sky jobs launch submits jobs to a controller that manages lifecycle, log streaming, and failure recovery. A single controller handles 2,000+ concurrent jobs.
Slurm and Kubernetes support. The same YAML can target HPC clusters via Slurm or Kubernetes deployments, in addition to cloud providers.

Fit for QEC simulation

Sinter collection jobs have properties that align well with SkyPilot's execution model:

Embarrassingly parallel. Each batch of circuits runs independently with no inter-node communication. This maps directly to independent managed jobs.
Spot-compatible by construction. Sinter writes incremental CSV output and resumes from partial results. SkyPilot handles instance recovery; Sinter handles state recovery.
Short-lived. Typical batches complete in 30–120 minutes. Short jobs have low preemption probability, and SkyPilot terminates instances on completion.
GPU access without ownership. tsim circuits and GPU-accelerated decoders can run on spot A100 instances (e.g., ~$0.70–1.50/GPU-hour on current providers).
Reproducible. The YAML is a complete job specification. Anyone with a configured cloud account can run the same sweep.

Quick start

Prerequisites

Python 3.9+
SkyPilot: pip install "skypilot[kubernetes,aws,gcp]"
At least one backend passing sky check
Stim circuit files (.stim) with detectors and observables annotated

1. Generate your circuits

Use Stim's built-in generators or your own circuit construction code. qec-kiln does not modify this step.

import stim

for p in [0.001, 0.003, 0.005, 0.008, 0.01]:
    for d in [3, 5, 7, 9, 11, 13]:
        circuit = stim.Circuit.generated(
            code_task="surface_code:rotated_memory_x",
            distance=d,
            rounds=d,
            after_clifford_depolarization=p,
            after_reset_flip_probability=p,
            before_measure_flip_probability=p,
            before_round_data_depolarization=p,
        )
        path = f"circuits/d={d},p={p},b=X,type=rotated_surface_memory.stim"
        with open(path, "w") as f:
            print(circuit, file=f)

2. Launch the sweep

# Distribute circuits across spot instances, 6 circuits per job
./scripts/launch.sh circuits/ \
  --decoders pymatching \
  --max-shots 1_000_000 \
  --max-errors 1000 \
  --circuits-per-job 6

# Or dry-run to see what would be launched
./scripts/launch.sh circuits/ --dry-run

3. Monitor

sky jobs queue                    # job status
sky jobs logs <job_id>            # stream sinter's progress output

4. Collect and plot

# Merge CSV fragments from all jobs
python scripts/merge.py --bucket s3://sinter-results --out merged.csv

# Plot using Sinter's built-in plotting (unchanged from normal workflow)
sinter plot \
  --in merged.csv \
  --group_func "f'Rotated Surface Code d={m.d}'" \
  --x_func m.p \
  --xaxis "[log]Physical Error Rate" \
  --fig_size 1024 1024 \
  --out threshold.png

Project structure

qec-kiln/
├── README.md
├── LICENSE                      # Apache 2.0
├── SECURITY.md                  # Vulnerability reporting and security considerations
├── CONTRIBUTING.md              # How to contribute
├── CODE_OF_CONDUCT.md           # Community guidelines
├── CHANGELOG.md                 # Release history
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.md
│   │   └── feature_request.md
│   └── PULL_REQUEST_TEMPLATE.md
├── configs/
│   ├── sinter_job.yaml          # SkyPilot task: runs sinter collect on one batch
│   └── sinter_job_gpu.yaml      # GPU variant (for tsim circuits or GPU decoders)
├── scripts/
│   ├── launch.sh                # Partition circuits and launch SkyPilot jobs
│   ├── merge.py                 # Merge CSV fragments using sinter combine
│   └── partition.py             # Split circuit files into balanced batches
├── examples/
│   ├── generate_surface_codes.py    # Generate example circuit files
│   └── generate_tsim_circuits.py    # Generate Clifford+T circuits for tsim
├── docker/
│   ├── Dockerfile.cpu           # Stim + Sinter + PyMatching (CPU)
│   └── Dockerfile.gpu           # + tsim + CUDA (GPU)
├── docs/
│   └── design.md                # Architecture decisions and trade-offs
└── .gitignore

Design decisions

Why partition circuits across jobs (not shots)?

Sinter already parallelizes shots across CPU cores within a single machine. Splitting shots for the same circuit across multiple machines would require external coordination to merge partial statistics correctly. Splitting circuits across machines is trivially parallel — each job gets a disjoint set of .stim files, runs sinter collect independently, and produces a self-contained CSV. Merging is just sinter combine (concatenation + deduplication by strong_id).

Why not build a custom Sinter sampler?

Sinter has a Sampler API that allows custom sampling backends. We could theoretically build a CloudSampler that dispatches work to remote nodes. But this would mean reimplementing job scheduling, failure recovery, and cost optimization — all things SkyPilot already does. Keeping the integration at the job level (one SkyPilot job = one sinter collect process) is simpler, more robust, and easier to reason about.

Why Sinter's CSV format?

Sinter's CSV format is the standard interchange for QEC benchmarking. Sinter's combine and plot commands operate on it, PyMatching's documentation uses it, and research papers publish threshold plots generated from it. A custom format would require reimplementing existing tooling.

Spot instance compatibility

Sinter's --save_resume_filepath flag writes statistics incrementally. If a spot instance is preempted and SkyPilot restarts the job, Sinter reads the existing CSV and counts collected data toward its targets. No additional checkpointing is needed.

Compatibility

qec-kiln works with any simulator or decoder that Sinter supports:

Simulators: Stim (Clifford, CPU), tsim (Clifford+T, CPU/GPU)

Decoders: PyMatching, fusion_blossom, Tesseract, chromobius, NVIDIA TensorRT decoder, or any custom sinter.Decoder subclass

Infrastructure: Any SkyPilot backend — Kubernetes, AWS, GCP, Azure, Lambda, RunPod, Nebius, and 15+ others

Operational considerations

For groups running qec-kiln regularly, a platform engineer can manage:

Cluster configuration: K8s node pools, NVIDIA device plugins, Kueue for GPU scheduling
Cost management: Spot instance policies, multi-cloud fallback, autostop, per-user budgets
Container images: Pinned Stim/Sinter/decoder versions, CI rebuilds on upstream releases
Monitoring: Job completion rates, GPU utilization, cost tracking
Storage: S3/GCS bucket or K8s PVC configuration with retention policies

License

Apache-2.0

References

Stim — Craig Gidney, Google Quantum AI
Sinter — Statistical collection subproject of Stim
tsim — Clifford+T simulator via ZX stabilizer rank decomposition
PyMatching — MWPM decoder
Tesseract — Search-based QEC decoder
SkyPilot — Multi-cloud job orchestration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qec-kiln

Motivation

How it works

When to use this

Why SkyPilot?

General properties

Fit for QEC simulation

Quick start

Prerequisites

1. Generate your circuits

2. Launch the sweep

3. Monitor

4. Collect and plot

Project structure

Design decisions

Why partition circuits across jobs (not shots)?

Why not build a custom Sinter sampler?

Why Sinter's CSV format?

Spot instance compatibility

Compatibility

Operational considerations

License

References

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
patches		patches
spike_tsim		spike_tsim
upstream/archived		upstream/archived
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
collect_tsim.py		collect_tsim.py
design.md		design.md
generate_surface_codes.py		generate_surface_codes.py
generate_tsim_circuits.py		generate_tsim_circuits.py
launch.sh		launch.sh
merge.py		merge.py
partition.py		partition.py
partition_tsim.py		partition_tsim.py
sinter_job.yaml		sinter_job.yaml
sinter_job_docker.yaml		sinter_job_docker.yaml
sinter_job_gpu.yaml		sinter_job_gpu.yaml
smoke_test_fixed_observable.py		smoke_test_fixed_observable.py
test_collect_tsim.py		test_collect_tsim.py
test_generate_circuits.py		test_generate_circuits.py
tsim_sampler.py		tsim_sampler.py
validate_gpu.sh		validate_gpu.sh

Folders and files

Latest commit

History

Repository files navigation

qec-kiln

Motivation

How it works

When to use this

Why SkyPilot?

General properties

Fit for QEC simulation

Quick start

Prerequisites

1. Generate your circuits

2. Launch the sweep

3. Monitor

4. Collect and plot

Project structure

Design decisions

Why partition circuits across jobs (not shots)?

Why not build a custom Sinter sampler?

Why Sinter's CSV format?

Spot instance compatibility

Compatibility

Operational considerations

License

References

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages