Add demo_multiprocessing notebook (auto-tuning)#22
Open
kochjens wants to merge 8 commits into
Open
Conversation
Demonstrates efficient parameter sweeps: the interaction between num_cpus and BLAS/LAPACK threads, capping BLAS threads before import, and timing num_cpus on the user's own machine. Cells are left unexecuted; timings are machine-specific. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous version led the reader to expect a num_cpus speedup, but its default example (small system, small grid) is in the regime where multiprocessing is net-negative -- so the first hands-on result showed num_cpus=2 slower than 1 with no explanation, and the "it helps for larger grids" caveat came afterwards. Restructure around expectation-setting and two measured regimes: - Open with the mental model: parallelism helps only when grid-size x per-point cost greatly exceeds the per-task overhead; on small/cheap sweeps it does nothing or slows you down -- that is normal, keep num_cpus=1. - Cap BLAS threads FIRST (before imports), with a threadpoolctl verification cell (the #1 cause of confusing/backwards timings is the cap not taking effect). - Demo 1: a cheap system (dim 216, ~5 ms/pt) where num_cpus does not help. - Demo 2: the same system at larger truncated_dim (dim 512, ~40 ms/pt) where num_cpus gives ~2.8x -- the only change is per-point cost. - Explain the BLAS oversubscription footgun with the measured ~90x cliff, note num_cpus x BLAS-threads ~ cores, and point to sparse diagonalization as the bigger lever for large composite systems. All illustrative numbers are real measurements (10-core laptop); timing cells are left runnable but unexecuted (run-it-yourself). Code paths smoke-tested.
The multiprocessing follow-up (per-task IPC reduced from O(N^2) to O(N)) lowered per-task overhead enough that the previous "light" system (dim 216) now benefits from num_cpus even at 96 points, so the old truncated_dim contrast no longer demonstrated "parallelism does not help." Re-anchor the two demos on GRID SIZE, the lever users actually vary, measured on the integration of all current perf branches: - Demo 1: same system, 16 points -> num_cpus 1/2/4 are flat (0.10 s, ~1x/0.9x), the normal "too few points to amortize" regime. - Demo 2: same system, 384 points -> 2.14 s -> 0.90 s at 4 cores (2.37x). Single make_sweep(num_cpus, n_points) factory; refreshed illustrative tables; note the crossover also shifts with per-point cost. Verified against the combined branches: cap-first ordering keeps BLAS at 1 thread, and the matrix-element matmul fix means the coupled-transmon sweep now runs with zero spurious matmul warnings.
Add a "Letting scqubits pick the settings" section showing recommend_parallelization (with a runnable cell over the 16- and 384-point grids), num_cpus="auto", AUTO_PARALLEL, and calibrate_parallelization; replace the dead tools/autotune_multiprocessing.py reference. Also fold in the earlier demo fixes (restored runnable second half, Mac mini timing labels, the __main__-guard note for plain scripts, and the corrected fork/spawn BLAS-cap wording).
Rewrite from scratch so the main story is the shipped capability: let scqubits choose num_cpus and the BLAS-thread cap. Lead with recommend_parallelization, num_cpus="auto", and AUTO_PARALLEL; show a serial-vs-auto correctness check; then the one-time calibrate_parallelization(). The manual mechanics (grid break-even, the BLAS oversubscription cliff, manual settings, sparse-first, the __main__ guard) are demoted to supporting "why" and "manual control" sections. American English throughout. 17 cells, down from 25.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
The manual-control section now explains that the BLAS-thread cap is automatic
by default ("auto" = cores // num_cpus); setting an int overrides it and None
opts out.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
New
demo_multiprocessingnotebook centered on letting scqubits choose the multiprocessing settings:recommend_parallelization,num_cpus="auto", and the one-timecalibrate_parallelization. It then explains what's being balanced — the grid break-even and the BLAS oversubscription cliff — and covers manual control and the__main__guard for scripts.Demonstrates the parallelization heuristic + calibration shipping in the corresponding scqubits release.