Skip to content

Add demo_multiprocessing notebook (auto-tuning)#22

Open
kochjens wants to merge 8 commits into
mainfrom
add-multiprocessing-demo
Open

Add demo_multiprocessing notebook (auto-tuning)#22
kochjens wants to merge 8 commits into
mainfrom
add-multiprocessing-demo

Conversation

@kochjens

@kochjens kochjens commented Jun 4, 2026

Copy link
Copy Markdown
Member

New demo_multiprocessing notebook centered on letting scqubits choose the multiprocessing settings: recommend_parallelization, num_cpus="auto", and the one-time calibrate_parallelization. It then explains what's being balanced — the grid break-even and the BLAS oversubscription cliff — and covers manual control and the __main__ guard for scripts.

Demonstrates the parallelization heuristic + calibration shipping in the corresponding scqubits release.

kochjens and others added 5 commits May 31, 2026 17:39
Demonstrates efficient parameter sweeps: the interaction between num_cpus and
BLAS/LAPACK threads, capping BLAS threads before import, and timing num_cpus on
the user's own machine. Cells are left unexecuted; timings are machine-specific.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous version led the reader to expect a num_cpus speedup, but its default
example (small system, small grid) is in the regime where multiprocessing is
net-negative -- so the first hands-on result showed num_cpus=2 slower than 1 with
no explanation, and the "it helps for larger grids" caveat came afterwards.

Restructure around expectation-setting and two measured regimes:
- Open with the mental model: parallelism helps only when grid-size x per-point
  cost greatly exceeds the per-task overhead; on small/cheap sweeps it does
  nothing or slows you down -- that is normal, keep num_cpus=1.
- Cap BLAS threads FIRST (before imports), with a threadpoolctl verification cell
  (the #1 cause of confusing/backwards timings is the cap not taking effect).
- Demo 1: a cheap system (dim 216, ~5 ms/pt) where num_cpus does not help.
- Demo 2: the same system at larger truncated_dim (dim 512, ~40 ms/pt) where
  num_cpus gives ~2.8x -- the only change is per-point cost.
- Explain the BLAS oversubscription footgun with the measured ~90x cliff, note
  num_cpus x BLAS-threads ~ cores, and point to sparse diagonalization as the
  bigger lever for large composite systems.

All illustrative numbers are real measurements (10-core laptop); timing cells are
left runnable but unexecuted (run-it-yourself). Code paths smoke-tested.
The multiprocessing follow-up (per-task IPC reduced from O(N^2) to O(N)) lowered
per-task overhead enough that the previous "light" system (dim 216) now benefits
from num_cpus even at 96 points, so the old truncated_dim contrast no longer
demonstrated "parallelism does not help." Re-anchor the two demos on GRID SIZE,
the lever users actually vary, measured on the integration of all current perf
branches:
- Demo 1: same system, 16 points -> num_cpus 1/2/4 are flat (0.10 s, ~1x/0.9x),
  the normal "too few points to amortize" regime.
- Demo 2: same system, 384 points -> 2.14 s -> 0.90 s at 4 cores (2.37x).
Single make_sweep(num_cpus, n_points) factory; refreshed illustrative tables;
note the crossover also shifts with per-point cost.

Verified against the combined branches: cap-first ordering keeps BLAS at 1
thread, and the matrix-element matmul fix means the coupled-transmon sweep now
runs with zero spurious matmul warnings.
Add a "Letting scqubits pick the settings" section showing
recommend_parallelization (with a runnable cell over the 16- and 384-point
grids), num_cpus="auto", AUTO_PARALLEL, and calibrate_parallelization; replace the
dead tools/autotune_multiprocessing.py reference. Also fold in the earlier demo
fixes (restored runnable second half, Mac mini timing labels, the __main__-guard
note for plain scripts, and the corrected fork/spawn BLAS-cap wording).
Rewrite from scratch so the main story is the shipped capability: let scqubits
choose num_cpus and the BLAS-thread cap. Lead with recommend_parallelization,
num_cpus="auto", and AUTO_PARALLEL; show a serial-vs-auto correctness check; then
the one-time calibrate_parallelization(). The manual mechanics (grid break-even,
the BLAS oversubscription cliff, manual settings, sparse-first, the __main__ guard)
are demoted to supporting "why" and "manual control" sections. American English
throughout. 17 cells, down from 25.
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

kochjens added 3 commits June 5, 2026 23:18
The manual-control section now explains that the BLAS-thread cap is automatic
by default ("auto" = cores // num_cpus); setting an int overrides it and None
opts out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant