|
| 1 | +# Devsper Runtime Core |
| 2 | + |
| 3 | +Devsper runtime is built as a modular orchestration core with bounded concurrency, dynamic DAG mutation, and event-stream-driven observability. |
| 4 | + |
| 5 | +## Runtime Architecture Diagram |
| 6 | + |
| 7 | +```text |
| 8 | +User Task |
| 9 | + | |
| 10 | + v |
| 11 | +Swarm.run() |
| 12 | + | |
| 13 | + v |
| 14 | +Planner -> Scheduler (DAG) |
| 15 | + | |
| 16 | + v |
| 17 | +Runtime Executor ------------------------------+ |
| 18 | + | | |
| 19 | + +-> RuntimeStateManager | |
| 20 | + +-> ExecutionGraph | |
| 21 | + +-> RuntimeEventStream -> EventLog -> SSE | |
| 22 | + +-> TaskRunner | |
| 23 | + +-> AgentRunner (optional stream-tool loop) |
| 24 | + +-> Agent |
| 25 | + +-> ToolRunner / tools.tool_runner |
| 26 | +``` |
| 27 | + |
| 28 | +## Core Components |
| 29 | + |
| 30 | +- `swarm/swarm.py` |
| 31 | + - Public entrypoint (`Swarm`), config/wiring, planning and execution bootstrap. |
| 32 | +- `runtime/executor.py` |
| 33 | + - Event-driven scheduler loop, bounded parallel execution, cancellation propagation, dynamic task injection. |
| 34 | +- `runtime/state_manager.py` |
| 35 | + - Concurrency-safe task state transitions and runtime DAG mutation entrypoint. |
| 36 | +- `runtime/execution_graph.py` |
| 37 | + - Execution graph with lineage, edges, attempts, and status transitions. |
| 38 | +- `runtime/planner.py` |
| 39 | + - Runtime wrapper for dynamic planner expansion and parent-child lineage. |
| 40 | +- `runtime/task_runner.py` |
| 41 | + - Task lifecycle orchestration with scoped retries and fallback model handling. |
| 42 | +- `runtime/agent_runner.py` |
| 43 | + - Async agent wrapper with optional streaming tool invocation loop. |
| 44 | +- `runtime/tool_runner.py` |
| 45 | + - Parallel tool scheduler with bounded concurrency, dependency-aware batching, timeout, and cancellation. |
| 46 | +- `runtime/event_stream.py` |
| 47 | + - In-process stream with queue backpressure policy (drop-oldest on overflow). |
| 48 | +- `runtime/retry.py` |
| 49 | + - Retry scopes (`tool`, `agent`, `task`, `model_fallback`) and backoff policies. |
| 50 | + |
| 51 | +## Execution Lifecycle |
| 52 | + |
| 53 | +1. `Swarm.run()` creates root task, plans subtasks, and builds scheduler DAG. |
| 54 | +2. Executor emits `EXECUTOR_STARTED` and begins bounded task dispatch. |
| 55 | +3. Ready tasks transition to running through `RuntimeStateManager`. |
| 56 | +4. `TaskRunner` executes with scoped retries. |
| 57 | +5. `AgentRunner` may invoke tools (normal or streaming tool loop). |
| 58 | +6. Task completion/failure updates scheduler + execution graph. |
| 59 | +7. Adaptive mode can inject follow-up tasks at runtime. |
| 60 | +8. Executor emits `EXECUTOR_FINISHED` and `RUN_COMPLETED`. |
| 61 | + |
| 62 | +## Tool Calling Flow |
| 63 | + |
| 64 | +1. Agent determines tool intent from model output. |
| 65 | +2. Tool calls are parsed and scheduled. |
| 66 | +3. `ToolRunner.run_many(...)` executes calls in parallel with: |
| 67 | + - `max_concurrency` semaphore |
| 68 | + - dependency constraints (`depends_on`) |
| 69 | + - per-call timeouts |
| 70 | + - cancellation checks |
| 71 | +4. Results are isolated per call and fed back into agent loop. |
| 72 | + |
| 73 | +## Planner Flow |
| 74 | + |
| 75 | +1. Initial decomposition by `swarm/planner.py`. |
| 76 | +2. Runtime execution completes a task. |
| 77 | +3. `RuntimePlanner.expand(...)` optionally creates follow-up tasks. |
| 78 | +4. `RuntimeStateManager.add_tasks(...)` mutates DAG safely. |
| 79 | +5. `ExecutionGraph` records lineage from parent task. |
| 80 | + |
| 81 | +## Concurrency Model |
| 82 | + |
| 83 | +- Bounded executor workers (`worker_count`). |
| 84 | +- Bounded tool concurrency (`ToolRunner` semaphore). |
| 85 | +- Queue backpressure in event streaming (`max_queue_size` with controlled dropping). |
| 86 | +- Cooperative pause/resume and cancellation propagation across runtime loops. |
| 87 | +- Lock-guarded scheduler mutations via `RuntimeStateManager`. |
| 88 | + |
| 89 | +## Distributed Runtime Architecture |
| 90 | + |
| 91 | +```text |
| 92 | +Controller |
| 93 | + | |
| 94 | + v |
| 95 | +Worker Pool |
| 96 | + | |
| 97 | + v |
| 98 | +Worker Runtime |
| 99 | + | |
| 100 | + v |
| 101 | +Runtime Executor |
| 102 | + | |
| 103 | + v |
| 104 | +Agent Pool |
| 105 | + | |
| 106 | + v |
| 107 | +Tool Runner |
| 108 | +``` |
| 109 | + |
| 110 | +- `distributed/controller.py` |
| 111 | + - Worker registry, health state, load-aware assignment, reassignment hooks. |
| 112 | +- `distributed/worker_runtime.py` |
| 113 | + - Worker-local runtime composition: `Executor`, `AgentPool`, `ModelRouter`, `ToolRunner`. |
| 114 | +- Existing multi-node transport/control remains in `nodes/controller.py` and `nodes/worker.py`. |
| 115 | + |
| 116 | +## Worker Lifecycle |
| 117 | + |
| 118 | +1. Worker registers with controller. |
| 119 | +2. Controller assigns tasks based on health and load. |
| 120 | +3. Worker executes task locally through runtime executor stack. |
| 121 | +4. Worker returns completion/failure. |
| 122 | +5. Controller updates worker/task state and reassigns on failure. |
| 123 | + |
| 124 | +## Controller Lifecycle |
| 125 | + |
| 126 | +1. Track worker registration and health. |
| 127 | +2. Assign ready tasks using load-aware strategy. |
| 128 | +3. Handle failures and perform retry/reassignment. |
| 129 | +4. Maintain global execution progress via event/log channels. |
| 130 | + |
| 131 | +## Agent Pool |
| 132 | + |
| 133 | +- `AgentPool` manages reusable agent instances per worker. |
| 134 | +- Supports `acquire_agent`, `release_agent`, `run_agent`, and `run_parallel`. |
| 135 | +- Enables worker-local, concurrent, reuse-based execution. |
| 136 | + |
| 137 | +## Speculative Execution |
| 138 | + |
| 139 | +- `SpeculativePlanner` predicts likely successor tasks. |
| 140 | +- Executor marks/schedules speculative tasks early. |
| 141 | +- Unused speculative branches can be cancelled on dependency failure. |
| 142 | + |
| 143 | +## HITL Flow |
| 144 | + |
| 145 | +1. Agent indicates human input is required. |
| 146 | +2. Runtime emits HITL/clarification event. |
| 147 | +3. Task pauses until response or timeout. |
| 148 | +4. Execution resumes or fails based on response policy. |
0 commit comments