Skip to content

Expose CoreML compute units as gpu device IDs in onnx-coreml backend#2401

Merged
borg323 merged 2 commits intoLeelaChessZero:masterfrom
ChinChangYang:coreml-ml-compute-units
Mar 29, 2026
Merged

Expose CoreML compute units as gpu device IDs in onnx-coreml backend#2401
borg323 merged 2 commits intoLeelaChessZero:masterfrom
ChinChangYang:coreml-ml-compute-units

Conversation

@ChinChangYang
Copy link
Copy Markdown
Contributor

@ChinChangYang ChinChangYang commented Mar 16, 2026

Summary

  • Exposes CoreML compute unit selection through the standard gpu device ID parameter in the onnx-coreml backend, allowing users to control which Apple hardware accelerators CoreML uses for inference.
  • Removes ProfileComputePlan=1 from CoreML provider options (previously caused ~111 s warm session creation; without it warm startup is ~46 s — a ~58% improvement).

Usage

The gpu parameter selects which CoreML compute units to use:

  • gpu=0 (default) — CPU + GPU (CPUAndGPU)
  • gpu=1 — CPU + Neural Engine (CPUAndNeuralEngine)
  • gpu=2 or higher — all available hardware (ALL: CPU, GPU, Neural Engine)

This is most useful with the multiplexing backend to run separate instances targeting different compute units simultaneously:

./lc0 backendbench -w <model> \
  -b multiplexing \
  -o "gpu(backend=onnx-coreml,gpu=0,threads=2,max_batch=64),npu(backend=onnx-coreml,gpu=1,threads=2,max_batch=64)" \
  --batch-step=16 --start-batch-size=16 --max-batch-size=64 -t 4

Motivation

On Apple Silicon, the GPU and Neural Engine are separate hardware units that can run simultaneously. By running two onnx-coreml instances via the multiplexing backend — one targeting gpu=0 (CPUAndGPU) and one targeting gpu=1 (CPUAndNeuralEngine) — both accelerators can be kept busy in parallel, increasing overall inference throughput compared to a single instance.

Using the existing gpu parameter (instead of a custom compute_units string option) keeps the interface consistent with other backends like CUDA, where gpu=N selects a device.

Test plan

  • Build on macOS with ONNX Runtime CoreML provider
  • Run onnx-coreml backend with gpu=0 and gpu=1 and verify inference produces valid results
  • Confirm session creation time is reduced compared to ProfileComputePlan=1
  • Verify error histogram between gpu=0 and gpu=1 shows acceptable numerical differences

🤖 Generated with Claude Code

Replace the hardcoded ProfileComputePlan=1 provider option with a
configurable MLComputeUnits option (default: ALL). Accepts the same
values as the CoreML MLComputeUnits enum: ALL, CPU_AND_NE, CPU_ONLY,
CPU_AND_GPU.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gsobala
Copy link
Copy Markdown
Contributor

gsobala commented Mar 16, 2026

On a MacBook Pro M5 Max this PR gives a substantial improvement in nps using multiplexing compared to simple onnx-coreml.

Results for nets 792013, 11248, 771473 and BT4-1024x15x32h-swa-6147500.pb.gz :

       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: /Users/george/nets/792013
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,    10648,   1.503, 0.1334, 0.0888,   11449,   11138,    8433
  32,    17628,   1.815, 0.0348, 0.0192,   18145,   17641,   16285
  48,    22268,   2.156, 0.0303, 0.0141,   22661,   22360,   21211
  64,    24411,   2.622, 0.0522, 0.0199,   25129,   24600,   23583
(venv) george@MacBook-Pro-M5 lc0 % build/release/lc0 backendbench -w ~/nets/792013 \                                                -b multiplexing \
  -o "gpu(backend=onnx-coreml,compute_units=CPUAndGPU,threads=2,max_batch=64),npu(backend=onnx-coreml,compute_units=CPUAndNeuralEngine,threads=2,max_batch=64)" \
  --batch-step=16 --start-batch-size=16 --max-batch-size=64 -t 4
       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: /Users/george/nets/792013
Creating backend [onnx-coreml]...
Creating backend [onnx-coreml]...
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,    16133,  0.9917, 0.9198, 0.9275,   31425,   21211,    1816
  32,    24198,   1.322, 0.6204, 0.4691,   36078,   35202,   14128
  48,    29909,   1.605, 0.9525, 0.5935,   45265,   44699,   14475
  64,    31363,   2.041, 1.4780, 0.7243,   50340,   49783,   12861

===============

(venv) george@MacBook-Pro-M5 lc0 % build/release/lc0 backendbench -w ~/nets/11248 -b onnx-coreml --batch-step=16 --start-batch-size=16 --max-batch-size=64
       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: /Users/george/nets/11248
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,     9847,   1.625, 0.1290, 0.0794,   10421,   10238,    7198
  32,    12766,   2.507, 0.0372, 0.0148,   12996,   12838,   12095
  48,    17450,   2.751, 0.0312, 0.0114,   17883,   17465,   16786
  64,    17649,   3.626, 0.0264, 0.0073,   17941,   17633,   17239
(venv) george@MacBook-Pro-M5 lc0 % build/release/lc0 backendbench -w ~/nets/11248 \                                                                       
  -b multiplexing \
  -o "gpu(backend=onnx-coreml,compute_units=CPUAndGPU,threads=2,max_batch=64),npu(backend=onnx-coreml,compute_units=CPUAndNeuralEngine,threads=2,max_batch=64)" \
  --batch-step=16 --start-batch-size=16 --max-batch-size=64 -t 4
       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: /Users/george/nets/11248
Creating backend [onnx-coreml]...
Creating backend [onnx-coreml]...
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,    15351,   1.042, 0.4065, 0.3900,   34930,   20180,    4857
  32,    18855,   1.697, 0.6864, 0.4044,   26197,   25834,   11683
  48,    23482,   2.044, 1.1654, 0.5701,   35785,   35317,   11934
  64,    23723,   2.698, 1.5842, 0.5872,   36116,   35769,   11721

==================

(venv) george@MacBook-Pro-M5 lc0 % build/release/lc0 backendbench -w ~/nets/771473 -b onnx-coreml --batch-step=16 --start-batch-size=16 --max-batch-size=64 
       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: /Users/george/nets/771473
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,    11455,   1.397, 0.1726, 0.1235,   12496,   12255,    8396
  32,    19503,   1.641, 0.0287, 0.0175,   19862,   19597,   17893
  48,    24508,   1.959, 0.0357, 0.0182,   24920,   24621,   21452
  64,    26660,   2.401, 0.0547, 0.0228,   27665,   26388,   25113
(venv) george@MacBook-Pro-M5 lc0 % build/release/lc0 backendbench -w ~/nets/771473 \                                                                       
  -b multiplexing \
  -o "gpu(backend=onnx-coreml,compute_units=CPUAndGPU,threads=2,max_batch=64),npu(backend=onnx-coreml,compute_units=CPUAndNeuralEngine,threads=2,max_batch=64)" \
  --batch-step=16 --start-batch-size=16 --max-batch-size=64 -t 4
       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: /Users/george/nets/771473
Creating backend [onnx-coreml]...
Creating backend [onnx-coreml]...
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,    20916,   0.765, 0.3384, 0.4424,   39735,   20115,    4313
  32,    28857,   1.109, 0.3778, 0.3407,   69628,   35032,   14486
  48,    34207,   1.403, 0.6503, 0.4634,   49860,   49017,   19835
  64,    37409,   1.711, 0.8957, 0.5236,   55543,   54635,   19939

==================

(venv) george@MacBook-Pro-M5 lc0 % build/release/lc0 backendbench -w BT4 -b onnx-coreml --batch-step=16 --start-batch-size=16 --max-batch-size=64
       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: BT4
Weights file has multihead format, updating format flag
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,     1979,   8.083, 0.0379, 0.0047,    1992,    1982,    1918
  32,     2252,   14.21, 0.0314, 0.0022,    2259,    2254,    2227
  48,     2421,   19.83, 0.0561, 0.0028,    2431,    2422,    2385
  64,     2463,   25.99, 0.0942, 0.0036,    2475,    2464,    2413
(venv) george@MacBook-Pro-M5 lc0 % build/release/lc0 backendbench -w BT4 \                                                                       
  -b multiplexing \
  -o "gpu(backend=onnx-coreml,compute_units=CPUAndGPU,threads=2,max_batch=64),npu(backend=onnx-coreml,compute_units=CPUAndNeuralEngine,threads=2,max_batch=64)" \
  --batch-step=16 --start-batch-size=16 --max-batch-size=64 -t 4
       _
|   _ | |
|_ |_ |_| v0.33.0-dev+git.cb97caf built Mar 16 2026
Loading weights file from: BT4
Weights file has multihead format, updating format flag
Creating backend [onnx-coreml]...
Creating backend [onnx-coreml]...
size, mean nps, mean ms,   sdev,     cv, max nps,  median, min nps,
  16,     2249,   7.115, 6.9841, 0.9816,    7927,    3985,     535
  32,     2624,    12.2,11.4768, 0.9410,    8893,    4472,     841
  48,     2763,   17.37,16.9928, 0.9781,    4779,    4752,     846
  64,     2857,    22.4,22.0272, 0.9832,    4816,    4799,     848

@ChinChangYang ChinChangYang marked this pull request as ready for review March 16, 2026 11:46
batch_size_ = opts.GetOrDefault<int>("batch", default_batch);
steps_ = opts.GetOrDefault<int>("steps", default_steps);
min_batch_size_ = opts.GetOrDefault<int>("min_batch", default_min_batch);
compute_units_ = opts.GetOrDefault<std::string>("compute_units", "ALL");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend can expose different compute units as separate GPUs. GPU should get id 0 and neural engine id 1. This would use an existing option in a logical way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 634ad6e.

Replace the custom `compute_units` string option with the standard `gpu`
integer option used by all other GPU backends. gpu=0 (default) selects
CPUAndGPU, gpu=1 selects CPUAndNeuralEngine, and any other value uses ALL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ChinChangYang ChinChangYang changed the title Add configurable MLComputeUnits option to onnx-coreml backend Expose CoreML compute units as gpu device IDs in onnx-coreml backend Mar 22, 2026
@borg323 borg323 merged commit 5d8b306 into LeelaChessZero:master Mar 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants