Refactor: unify Worker.run(callable, args, config), delete Task (#578)

ChaoWao · web-flow · commit 131de2a13e6c · 2026-04-16T11:21:52.000+08:00
- Delete Task dataclass; Worker.run now takes (callable, args, config)
  directly instead of Task(orch=fn, args=...)
- Orch fn signature changes from (o, args) to (o, args, cfg) — receives
  the ChipCallConfig passed to Worker.run
- Sub callable signature changes from fn() to fn(args) — receives
  TaskArgs decoded from the mailbox blob via _read_args_from_mailbox
- Add 3 new tests verifying args pass-through: tensor metadata, scalar
  values, and empty args
- Update scene_test, all L3 unit tests, and L3 scene tests
- Update docs: task-flow.md, distributed_level_runtime.md,
  worker-manager.md, roadmap.md, orchestrator.py docstring
diff --git a/docs/distributed_level_runtime.md b/docs/distributed_level_runtime.md
@@ -108,13 +108,18 @@ See [scheduler.md](scheduler.md) for the dispatch loop and coordination.
 The **execution layer**. `WorkerManager` holds two pools of `WorkerThread`s
 (next-level pool and sub pool). Each `WorkerThread`:
 
-- owns one `IWorker` (`ChipWorker`, `SubWorker`, or nested `Worker`)
+- owns one `IWorker` (`ChipWorker` or nested `Worker`) for next-level workers
 - has its own `std::thread`
 - runs in one of two modes:
   - `THREAD`: calls `worker->run(callable, view, config)` directly in-process
   - `PROCESS`: forks a child at init; each dispatch writes task data to a shm
     mailbox, the child decodes and runs
 
+SUB workers are Python-only: the forked child process runs a Python loop
+(``_sub_worker_loop``) that reads the args blob from the mailbox, decodes it
+into a ``TaskArgs``, and calls the registered callable as ``fn(args)``.
+There is no C++ ``SubWorker`` class.
+
 See [worker-manager.md](worker-manager.md) for thread/process mechanics, fork
 ordering, and mailbox layout. See [task-flow.md](task-flow.md) for what flows
 through `IWorker::run`.
@@ -177,7 +182,7 @@ w4 = Worker(level=4, child_mode=WorkerManager.Mode.THREAD)
 w4.add_worker(NEXT_LEVEL, w3)         # w3 is an IWorker
 w4.init()
 
-w4.run(Task(orch=my_l4_orch, task_args=..., config=...))
+w4.run(my_l4_orch, my_args, my_config)
 ```
 
 When L4's `WorkerThread` dispatches to L3, L3's `Worker::run` opens its own
diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -77,6 +77,21 @@ get if I pip install `main` today", this page.
   `DIST_MAX_SCOPE_DEPTH = 64`; scopes deeper than the ring depth share
   the innermost ring.
 
+### Uniform `Worker.run(callable, args, config)`
+
+- **`Task` dataclass deleted** — `Worker.run` now takes
+  `(callable, args=None, config=None)` directly. For L3+, `callable` is
+  the orch function; for L2, it is a `ChipCallable`. `config` defaults
+  to `ChipCallConfig()` if omitted.
+- **Orch fn signature is 3-param**: `def orch(o, args, cfg)` — receives
+  the `Orchestrator`, `TaskArgs`, and `ChipCallConfig` passed to
+  `Worker.run`.
+- **Sub callable signature is `fn(args)`** — registered callables now
+  receive the `TaskArgs` decoded from the mailbox blob. The Python child
+  loop (`_sub_worker_loop`) reads the blob at `_OFF_ARGS` and constructs
+  a `TaskArgs` via `_read_args_from_mailbox`. Callable registry stays
+  Python-only (`dict[int, Callable]`).
+
 ### Dispatch internals
 
 - `IWorker::run(uint64_t callable, TaskArgsView args, ChipCallConfig cfg)`
@@ -119,14 +134,6 @@ get if I pip install `main` today", this page.
   `Mode = THREAD | PROCESS` (no separate fork-proxy classes). Strict-4
   per-worker-type ready queues already landed in PR-D-1.
 
-### PR-E: uniform `Worker.run` + callable registry unification
-
-- Python `Worker.run` drops the `if level==2` branch.
-- Callable registry moves fully into C++
-  (`unordered_map<uint64_t, nb::object>` owned by `Worker`) so
-  `ChipCallable` and Python `sub` callables share one lookup path.
-  This unblocks L4+ recursion.
-
 ### PR-F: C++ `Worker::run(Task)` for L4+ recursion
 
 - C++ `Task { OrchFn orch; TaskArgs task_args; CallConfig config; }`
diff --git a/docs/task-flow.md b/docs/task-flow.md
@@ -244,13 +244,20 @@ void ChipWorker::run(Callable cb, TaskArgsView view, const CallConfig &config) o
 
 ### `SubWorker` (Python callable leaf)
 
-```cpp
-void SubWorker::run(Callable cb, TaskArgsView view, const CallConfig &config) override {
-    uint64_t cid = cb;                     // no cast
-    py_registry_[cid](view);               // invoke Python callable with view
-}
+SubWorker execution is handled entirely in Python. The forked child process
+runs ``_sub_worker_loop`` which reads the args blob from the shared-memory
+mailbox, decodes it into a ``TaskArgs`` object, and passes it to the
+registered callable:
+
+```python
+fn(args)    # args: TaskArgs decoded from the mailbox blob
 ```
 
+The callable receives the same `TaskArgs` that was submitted via
+`orch.submit_sub(cid, args)`, with tags stripped (tags are consumed by the
+Orchestrator at submit time). There is no C++ `SubWorker` class — the
+Python child loop and callable registry are the entire implementation.
+
 Child inherits the Python registry through fork COW; the registry lookup works
 with no IPC.
 
@@ -333,6 +340,12 @@ slot.config     ─┼─► memcpy into shm mailbox ─► child reads view ─
 slot.task_args  ─┘    (write_blob)                (read_blob)
 ```
 
+For SUB workers in PROCESS mode, the child is a Python process running
+``_sub_worker_loop``. The mailbox carries the same blob format, but the
+Python child decodes it via ``_read_args_from_mailbox`` into a ``TaskArgs``
+object and calls ``fn(args)`` directly — the dispatch path bypasses
+``IWorker`` entirely.
+
 The mailbox layout, fork ordering, and child loop are in
 [worker-manager.md](worker-manager.md) §4.
 
@@ -418,7 +431,7 @@ def my_l3_orch(orch3, args, cfg):
 def my_l4_orch(orch4, args, cfg):
     orch4.submit_next_level(my_l3_orch_handle, args, cfg)
 
-w4.run(Task(orch=my_l4_orch, task_args=..., config=...))
+w4.run(my_l4_orch, my_args, my_config)
 ```
 
 L4's WorkerThread dispatches to `w3` via `IWorker::run`. `Worker::run`
@@ -463,14 +476,14 @@ w3 = Worker(level=3, child_mode=PROCESS)
 w3.add_worker(NEXT_LEVEL, chip_worker_0)
 w3.init()    # fork chip_0 here
 
-w3.run(Task(orch=my_orch, task_args=args, config=CallConfig(block_dim=3)))
+w3.run(my_orch, args, CallConfig(block_dim=3))
 ```
 
 Step-by-step (PROCESS mode, one chip worker):
 
 | Step | Where | What happens |
 | ---- | ----- | ------------ |
-| 1 | parent Python | user builds `args: TaskArgs`, calls `w3.run(Task)` |
+| 1 | parent Python | user builds `args: TaskArgs`, calls `w3.run(my_orch, args, config)` |
 | 2 | `Worker::run` | `scope_begin` → call `my_orch(&orch_, args.view(), cfg)` |
 | 3 | `Orchestrator::submit_next_level` | `slot = ring.alloc()`; move `chip_args` into `slot.task_args`; walk tags → `tensormap.lookup(a.data)`, `tensormap.lookup(b.data)`, `tensormap.insert(c.data, slot)`; push ready |
 | 4 | Scheduler thread | pop `slot`; `wt = manager.pick_idle(NEXT_LEVEL)` (WT_chip_0); `wt->dispatch(slot)` |
@@ -481,7 +494,7 @@ Step-by-step (PROCESS mode, one chip worker):
 | 9 | chip_0 child | `run` returns; write `TASK_DONE` |
 | 10 | WT_chip_0 parent | see `TASK_DONE`; call `on_complete_(slot)` |
 | 11 | Scheduler | mark slot COMPLETED; fanout release (none in this DAG); scope_end will release scope ref |
-| 12 | `Worker::run` returns | user's `w3.run(Task)` returns; `c` contains result in shm, visible to user |
+| 12 | `Worker::run` returns | user's `w3.run(...)` returns; `c` contains result in shm, visible to user |
 
 ---
 
diff --git a/docs/worker-manager.md b/docs/worker-manager.md
@@ -156,15 +156,20 @@ void WorkerThread::dispatch_thread(WorkerDispatch d) {
   `IWorker` casts back appropriately.
 - `IWorker::run` dispatches polymorphically based on the actual worker type.
 
+Note: SUB workers in PROCESS mode bypass `IWorker` entirely — the Python
+child loop (``_sub_worker_loop``) reads the args blob from the mailbox,
+decodes it into a ``TaskArgs``, and calls the registered callable as
+``fn(args)``. The C++ dispatch path writes the same mailbox format for
+both worker types.
+
 **When is THREAD mode safe?**
 
 - The IWorker implementation must be thread-safe relative to other concurrent
   calls and other system state
 - `ChipWorker` (dlsym'd runtime.so) is safe when the runtime `.so` and its
   device driver support concurrent use
-- `SubWorker` in THREAD mode is constrained by Python's GIL (all SubWorkers
-  in the pool effectively serialize), but this is often fine for light
-  Python callables
+- SUB workers run in Python child processes (PROCESS mode) where the
+  callable receives ``TaskArgs`` as its sole argument
 
 ---
 
diff --git a/python/simpler/orchestrator.py b/python/simpler/orchestrator.py
@@ -12,18 +12,18 @@
 Orchestrator handle at init, retrieves the C++ object via ``DistWorker.get_orchestrator()``,
 and passes the handle to the user's orch function::
 
-    def my_orch(orch, args):
+    def my_orch(orch, args, cfg):
         # build the args object yourself; tags drive dependency inference
         a = TaskArgs()
         a.add_tensor(make_tensor_arg(input_tensor),  TensorArgType.INPUT)
         a.add_tensor(make_tensor_arg(output_tensor), TensorArgType.OUTPUT)
-        orch.submit_next_level(chip_callable, a, config)
+        orch.submit_next_level(chip_callable, a, cfg)
 
         sub_args = TaskArgs()
         sub_args.add_tensor(make_tensor_arg(output_tensor), TensorArgType.INPUT)
         orch.submit_sub(cid, sub_args)
 
-    w.run(Task(orch=my_orch, args=my_args))
+    w.run(my_orch, my_args, my_config)
 
 Scope/drain lifecycle is managed by ``Worker.run()``; users never call those
 directly.
diff --git a/python/simpler/worker.py b/python/simpler/worker.py
@@ -13,64 +13,49 @@
     # L2: one NPU chip
     w = Worker(level=2, device_id=8, platform="a2a3", runtime="tensormap_and_ringbuffer")
     w.init()
-    w.run(chip_callable, chip_args, block_dim=24)
+    w.run(chip_callable, chip_args, config)
     w.close()
 
     # L3: multiple chips + SubWorkers, auto-discovery in init()
     w = Worker(level=3, device_ids=[8, 9], num_sub_workers=2,
                platform="a2a3", runtime="tensormap_and_ringbuffer")
-    cid = w.register(lambda: postprocess())
+    cid = w.register(lambda args: postprocess())
     w.init()
 
-    def my_orch(orch, args):
-        r = orch.submit_next_level(chip_callable, chip_args_ptr, config, outputs=[64])
-        orch.submit_sub(cid, inputs=[r.outputs[0].ptr])
+    def my_orch(orch, args, cfg):
+        r = orch.submit_next_level(chip_callable, chip_args_ptr, cfg)
+        orch.submit_sub(cid, sub_args)
 
-    w.run(Task(orch=my_orch, args=my_args))
+    w.run(my_orch, my_args, my_config)
     w.close()
 """
 
 import ctypes
 import os
 import struct
 import sys
-from dataclasses import dataclass, field
 from multiprocessing.shared_memory import SharedMemory
 from typing import Any, Callable, Optional
 
 from .orchestrator import Orchestrator
 from .task_interface import (
     DIST_MAILBOX_SIZE,
+    ChipCallConfig,
     ChipWorker,
+    ContinuousTensor,
+    DataType,
     DistWorker,
+    TaskArgs,
     _ChipWorker,
 )
 
-# ---------------------------------------------------------------------------
-# Task
-# ---------------------------------------------------------------------------
-
-
-@dataclass
-class Task:
-    """Execution unit for Worker.run() at any level.
-
-    For L2: call ``Worker.run(chip_callable, chip_args, config)`` directly.
-    For L3+: provide an orch function ``fn(orchestrator, args)`` that builds
-    the DAG via ``orchestrator.submit_*``.
-    """
-
-    orch: Callable
-    args: Any = field(default=None)
-
-
 # ---------------------------------------------------------------------------
 # Unified mailbox layout (must match dist_worker_manager.h MAILBOX_OFF_*)
 # ---------------------------------------------------------------------------
 #
 # One layout for both NEXT_LEVEL (chip) and SUB workers. SUB children
-# read `callable` as a uint64 encoding the callable_id and ignore the
-# config + args_blob region.
+# read `callable` as a uint64 encoding the callable_id and decode the
+# args_blob region to pass TaskArgs to the registered callable.
 
 _OFF_STATE = 0
 _OFF_ERROR = 4
@@ -93,6 +78,35 @@ def _mailbox_addr(shm: SharedMemory) -> int:
     return ctypes.addressof(ctypes.c_char.from_buffer(buf))
 
 
+def _read_args_from_mailbox(buf) -> TaskArgs:
+    """Decode the TaskArgs blob written by C++ write_blob from the mailbox.
+
+    Blob layout at _OFF_ARGS:
+      int32 tensor_count (T), int32 scalar_count (S),
+      ContinuousTensor[T] (40 B each), uint64_t[S] (8 B each).
+    """
+    base = _OFF_ARGS
+    t_count = struct.unpack_from("i", buf, base)[0]
+    s_count = struct.unpack_from("i", buf, base + 4)[0]
+
+    args = TaskArgs()
+    ct_off = base + 8
+    for i in range(t_count):
+        off = ct_off + i * 40
+        data = struct.unpack_from("Q", buf, off)[0]
+        shapes = struct.unpack_from("5I", buf, off + 8)
+        ndims = struct.unpack_from("I", buf, off + 28)[0]
+        dtype_val = struct.unpack_from("B", buf, off + 32)[0]
+        ct = ContinuousTensor.make(data, tuple(shapes[:ndims]), DataType(dtype_val))
+        args.add_tensor(ct)
+
+    sc_off = ct_off + t_count * 40
+    for i in range(s_count):
+        args.add_scalar(struct.unpack_from("Q", buf, sc_off + i * 8)[0])
+
+    return args
+
+
 def _sub_worker_loop(buf, registry: dict) -> None:
     """Runs in forked child process. Reads unified mailbox layout."""
     while True:
@@ -105,7 +119,8 @@ def _sub_worker_loop(buf, registry: dict) -> None:
                 error = 1
             else:
                 try:
-                    fn()
+                    args = _read_args_from_mailbox(buf)
+                    fn(args)
                 except Exception:  # noqa: BLE001
                     error = 2
             struct.pack_into("i", buf, _OFF_ERROR, error)
@@ -351,25 +366,26 @@ def _start_level3(self) -> None:
     # run — uniform entry point
     # ------------------------------------------------------------------
 
-    def run(self, task_or_callable, args=None, **kwargs) -> None:
-        """Execute one task synchronously.
+    def run(self, callable, args=None, config=None) -> None:
+        """Execute one task (L2) or one DAG (L3+) synchronously.
 
-        L2: run(chip_callable, chip_args, block_dim=N)
-        L3: run(Task(orch=fn, args=...))
+        callable: ChipCallable (L2) or Python orch fn (L3+)
+        args:     TaskArgs (optional)
+        config:   ChipCallConfig (optional, default-constructed if None)
         """
         assert self._initialized, "Worker not initialized; call init() first"
+        cfg = config if config is not None else ChipCallConfig()
 
         if self.level == 2:
             assert self._chip_worker is not None
-            self._chip_worker.run(task_or_callable, args, **kwargs)
+            self._chip_worker.run(callable, args, cfg)
         else:
             self._start_level3()
             assert self._orch is not None
             assert self._dist_worker is not None
-            task = task_or_callable
             self._orch._scope_begin()
             try:
-                task.orch(self._orch, task.args)
+                callable(self._orch, args, cfg)
             finally:
                 # Always release scope refs and drain so ring slots aren't
                 # stranded when the orch fn raises mid-DAG.
diff --git a/simpler_setup/scene_test.py b/simpler_setup/scene_test.py
@@ -815,8 +815,6 @@ def _run_and_validate_l3(
         enable_profiling=False,
         enable_dump_tensor=False,
     ):
-        from simpler.worker import Task  # noqa: PLC0415
-
         params = case.get("params", {})
         config_dict = case.get("config", {})
 
@@ -854,12 +852,13 @@ def _run_and_validate_l3(
                 enable_dump_tensor=enable_dump_tensor,
             )
 
-            # Wrap in Task — user orch signature: (orch, callables, task_args, config)
-            def task_orch(orch, _unused, _ns=ns, _test_args=test_args, _config=config):
+            # Orch fn signature: (orch, args, cfg) — inner fn forwards to
+            # the user's scene orch which takes (orch, callables, task_args, config).
+            def task_orch(orch, _args, _cfg, _ns=ns, _test_args=test_args, _config=config):
                 orch_fn(orch, _ns, _test_args, _config)
 
             with _temporary_env(self._resolve_env()):
-                worker.run(Task(orch=task_orch))
+                worker.run(task_orch)
 
             if not skip_golden:
                 _compare_outputs(test_args, golden_args, all_tensor_names, self.RTOL, self.ATOL)
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py b/tests/st/a2a3/tensormap_and_ringbuffer/test_l3_dependency.py
@@ -24,7 +24,7 @@
 KERNELS_BASE = "../../../../examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels"
 
 
-def verify():
+def verify(args):
     """SubCallable — dependency target, runs after ChipTask completes."""
 
 
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py b/tests/st/a2a3/tensormap_and_ringbuffer/test_l3_group.py
@@ -24,7 +24,7 @@
 KERNELS_BASE = "../../../../examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels"
 
 
-def verify():
+def verify(args):
     """SubCallable — runs after group completes."""
 
 
diff --git a/tests/ut/py/test_dist_worker/test_group_task.py b/tests/ut/py/test_dist_worker/test_group_task.py
diff --git a/tests/ut/py/test_dist_worker/test_host_worker.py b/tests/ut/py/test_dist_worker/test_host_worker.py
diff --git a/tests/ut/py/test_dist_worker/test_multi_worker.py b/tests/ut/py/test_dist_worker/test_multi_worker.py