Add demo_multiprocessing notebook (auto-tuning) by kochjens · Pull Request #22 · scqubits/scqubits-examples

kochjens · 2026-06-04T12:42:09Z

New demo_multiprocessing notebook centered on letting scqubits choose the multiprocessing settings: recommend_parallelization, num_cpus="auto", and the one-time calibrate_parallelization. It then explains what's being balanced — the grid break-even and the BLAS oversubscription cliff — and covers manual control and the __main__ guard for scripts.

Demonstrates the parallelization heuristic + calibration shipping in the corresponding scqubits release.

Demonstrates efficient parameter sweeps: the interaction between num_cpus and BLAS/LAPACK threads, capping BLAS threads before import, and timing num_cpus on the user's own machine. Cells are left unexecuted; timings are machine-specific. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The previous version led the reader to expect a num_cpus speedup, but its default example (small system, small grid) is in the regime where multiprocessing is net-negative -- so the first hands-on result showed num_cpus=2 slower than 1 with no explanation, and the "it helps for larger grids" caveat came afterwards. Restructure around expectation-setting and two measured regimes: - Open with the mental model: parallelism helps only when grid-size x per-point cost greatly exceeds the per-task overhead; on small/cheap sweeps it does nothing or slows you down -- that is normal, keep num_cpus=1. - Cap BLAS threads FIRST (before imports), with a threadpoolctl verification cell (the #1 cause of confusing/backwards timings is the cap not taking effect). - Demo 1: a cheap system (dim 216, ~5 ms/pt) where num_cpus does not help. - Demo 2: the same system at larger truncated_dim (dim 512, ~40 ms/pt) where num_cpus gives ~2.8x -- the only change is per-point cost. - Explain the BLAS oversubscription footgun with the measured ~90x cliff, note num_cpus x BLAS-threads ~ cores, and point to sparse diagonalization as the bigger lever for large composite systems. All illustrative numbers are real measurements (10-core laptop); timing cells are left runnable but unexecuted (run-it-yourself). Code paths smoke-tested.

The multiprocessing follow-up (per-task IPC reduced from O(N^2) to O(N)) lowered per-task overhead enough that the previous "light" system (dim 216) now benefits from num_cpus even at 96 points, so the old truncated_dim contrast no longer demonstrated "parallelism does not help." Re-anchor the two demos on GRID SIZE, the lever users actually vary, measured on the integration of all current perf branches: - Demo 1: same system, 16 points -> num_cpus 1/2/4 are flat (0.10 s, ~1x/0.9x), the normal "too few points to amortize" regime. - Demo 2: same system, 384 points -> 2.14 s -> 0.90 s at 4 cores (2.37x). Single make_sweep(num_cpus, n_points) factory; refreshed illustrative tables; note the crossover also shifts with per-point cost. Verified against the combined branches: cap-first ordering keeps BLAS at 1 thread, and the matrix-element matmul fix means the coupled-transmon sweep now runs with zero spurious matmul warnings.

Add a "Letting scqubits pick the settings" section showing recommend_parallelization (with a runnable cell over the 16- and 384-point grids), num_cpus="auto", AUTO_PARALLEL, and calibrate_parallelization; replace the dead tools/autotune_multiprocessing.py reference. Also fold in the earlier demo fixes (restored runnable second half, Mac mini timing labels, the __main__-guard note for plain scripts, and the corrected fork/spawn BLAS-cap wording).

Rewrite from scratch so the main story is the shipped capability: let scqubits choose num_cpus and the BLAS-thread cap. Lead with recommend_parallelization, num_cpus="auto", and AUTO_PARALLEL; show a serial-vs-auto correctness check; then the one-time calibrate_parallelization(). The manual mechanics (grid break-even, the BLAS oversubscription cliff, manual settings, sparse-first, the __main__ guard) are demoted to supporting "why" and "manual control" sections. American English throughout. 17 cells, down from 25.

review-notebook-app · 2026-06-04T12:42:14Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

The manual-control section now explains that the BLAS-thread cap is automatic by default ("auto" = cores // num_cpus); setting an int overrides it and None opts out.

…ecalibration

… intro

kochjens and others added 5 commits May 31, 2026 17:39

kochjens added 3 commits June 5, 2026 23:18

docs(demo): note MULTIPROC_BLAS_THREADS defaults to "auto"

5fa0767

The manual-control section now explains that the BLAS-thread cap is automatic by default ("auto" = cores // num_cpus); setting an int overrides it and None opts out.

docs(demo): reassure existing users, clarify auto vs AUTO_PARALLEL, r…

04a415f

…ecalibration

docs(demo): drop update-from-older-scqubits aside; keep current-state…

7878728

… intro

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add demo_multiprocessing notebook (auto-tuning)#22

Add demo_multiprocessing notebook (auto-tuning)#22
kochjens wants to merge 8 commits into
mainfrom
add-multiprocessing-demo

kochjens commented Jun 4, 2026

Uh oh!

review-notebook-app Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kochjens commented Jun 4, 2026

Uh oh!

review-notebook-app Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant