[QDP Runtime] Add v1 State-Partitioned Distributed Runtime Draft by 400Ping · Pull Request #1226 · apache/mahout

400Ping · 2026-03-29T07:47:36Z

Summary

This PR introduces a v1 draft of qdp-runtime, a state-partitioned distributed runtime skeleton for Mahout QDP.

The goal of this PR is to establish the core abstractions and control-plane scaffolding needed for multi-GPU and future multi-node execution. This is not intended to be a production-ready distributed runtime yet. Instead, it defines the v1 execution model, task lifecycle, placement policies, gather/reduce semantics, and runtime object tracking needed to iterate safely.

What This PR Adds

New crate

qdp-runtime

Core runtime model

state-partitioned distributed state metadata
local/global qubit layout via PartitionLayout
DistributedStateHandle and StatePartitionRef

Placement and topology

RoundRobin, Weighted, and TopologyAware placement policies
heterogeneous GPU-aware device capability model
NVLink-aware topology metadata
ClusterInventory and DeviceTopology

Coordinator / worker scaffolding

worker registration
in-process worker model
coordinator job planning
partition task generation
partition-to-worker mapping

Task lifecycle

Pending / Assigned / Running / Completed / Failed
retry policy
lease timeout skeleton
task result reporting

Output handling

runtime object/output registry
gather planning via GatherPlan
metric reduction planning via ReducePlan
host-side metric aggregation for Sum / Mean / Min / Max / Concat

Examples and docs

qdp-runtime/examples/local_runtime_smoke.rs
qdp-runtime/examples/local_runtime_benchmark.rs
qdp/docs/runtime/RUNTIME_V1.md

Scope of This PR

This PR focuses on the v1 runtime draft and control-plane design.

It intentionally does not attempt to provide:

full multi-node transport
persistent GPU object store
partition migration
dynamic repartitioning
production-ready fault tolerance
complete benchmark suite integration

Design Notes

The current v1 draft is centered around a state-partitioned execution model.

That means:

a logical state may be partitioned across devices
partition layout is currently contiguous amplitude blocks
placement can be weighted and topology-aware
gather and metric reduction are explicit runtime operations
runtime outputs are tracked via a coordinator-side object registry

This PR also keeps NVTX instrumentation hooks in place so that future runtime bottlenecks can be profiled more easily.

Current Limitations

transport is still in-process / metadata-oriented
local executor support is minimal
examples are smoke/benchmark scaffolding, not full workflow integration
topology support is advisory metadata in v1
object tracking exists, but a persistent GPU-resident object store is future work

Follow-Up Work

wire the runtime into a real single-node local execution path end-to-end
add persistent runtime object storage semantics
improve retry/reassignment behavior
extend topology-aware gather/reduce logic
add benchmark integration with existing QDP benchmark workflows
add multi-node transport and worker communication

Related Issues

Related to #1210

Checklist

Added or updated unit tests for all changes
Added or updated documentation for all changes

Signed-off-by: 400Ping <jiekaichang@apache.org>

400Ping · 2026-03-29T07:50:57Z

cc @viiccwen

Signed-off-by: 400Ping <jiekaichang@apache.org>

[QDP Runtime] Add v1 State-Partitioned Distributed Runtime Draft

11fbbe0

Signed-off-by: 400Ping <jiekaichang@apache.org>

400Ping requested review from guan404ming, rich7420 and ryankert01 March 29, 2026 07:50

split lib.rs

a220c43

Signed-off-by: 400Ping <jiekaichang@apache.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP Runtime] Add v1 State-Partitioned Distributed Runtime Draft#1226

[QDP Runtime] Add v1 State-Partitioned Distributed Runtime Draft#1226
400Ping wants to merge 2 commits intoapache:dev-distributed-qdpfrom
400Ping:dev-distributed-qdp

400Ping commented Mar 29, 2026

Uh oh!

400Ping commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

400Ping commented Mar 29, 2026

Summary

What This PR Adds

New crate

Core runtime model

Placement and topology

Coordinator / worker scaffolding

Task lifecycle

Output handling

Examples and docs

Scope of This PR

Design Notes

Current Limitations

Follow-Up Work

Related Issues

Checklist

Uh oh!

400Ping commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant