This document explains the runtime model behind MirrorNeuron and why the project now looks the way it does.
MirrorNeuron is not trying to replicate Airflow as a general-purpose data scheduler. It is a multi-agent runtime with a narrower job:
- keep orchestration and collaboration in BEAM
- let worker code run in safe isolated sandboxes
- support cross-node execution
- keep inter-agent communication event-driven and observable
- avoid turning the BEAM control plane into a pile of heavyweight OS processes
That last point drives most of the recent changes.
The project intentionally borrows a few control-plane ideas from Airflow while avoiding Airflow's product shape.
- Small built-in primitive set instead of domain-specific agents
- Clear separation between author-time workflow definition and runtime execution
- Pools and slot accounting for heavyweight execution capacity
- Strong message and artifact boundaries between scheduler concerns and task payload concerns
Airflow's big lesson for this runtime is not "copy operators." It is "treat heavyweight execution capacity as scarce and schedule it explicitly."
MirrorNeuron now separates the user-facing problem workflow from the lower-level agent runtime. For mn.workflow/v1 blueprints, workflow is the problem DAG: it names the source, sink, step dependencies, branch requirements, join behavior, and accepted outcomes. agents.nodes and agents.edges are the runtime agent communication topology used by the BEAM runtime; they are not the problem workflow. Top-level nodes, edges, and entrypoints are runtime-submission compatibility fields and should not be authored in OtterDesk manifests.
The current DAG executor is a static DAG scheduler with per-step lifecycle state. It is not a full GOAP or PDDL planner yet.
workflow.edges + workflow.steps
-> graph compiler
-> static DAG scheduler
-> step lifecycle state machine
-> runtime.bindings
-> agents, workers, validators, reducers
-> workflow events
-> CLI and Web UI progress snapshots
Execution works in layers:
- The graph compiler validates source/sink, reachability, acyclicity, edge ids, join modes, retry bounds, timeout values, and parallel output path conflicts.
- The scheduler keeps a map of step outcomes such as
done,partial,skipped,failed, andblocked. - A step becomes ready only when its parent edges and join rule are satisfied. Ready steps in the same graph layer can run concurrently.
- Each step follows a small lifecycle:
pending -> ready -> running -> done, with terminal alternatives such aspartial,skipped,failed, orblocked. runtime.bindingsmaps each workflow step to one or more internal workers. A tax step, for example, may run a preparer and a validator while still appearing as one problem-workflow node.- The executor emits generic workflow events such as
workflow_graph_compiled,workflow_step_started,workflow_edge_satisfied,workflow_join_waiting, andworkflow_finished. The shared SDK turns these into the progress snapshots used by CLI and Web UI.
GOAP/PDDL-like planning is intentionally a future layer. The manifest already has planner-friendly fields such as requires, provides, control, and disabled dynamic metadata, but the runtime currently executes the declared graph. A future planner can generate or patch workflow; the static DAG scheduler remains the executor for the currently accepted graph version.
MirrorNeuron now has a sharper two-layer model.
The control plane lives in BEAM:
- job orchestration
- message routing
- supervision
- persistence
- placement scheduling and allocation metadata
- retries and backoff
- event history
- aggregation
- cluster coordination
- service registry health state
- deployment and schedule dispatch state
These are all cheap, stateful, highly concurrent tasks that BEAM is excellent at.
The execution plane lives in OpenShell:
- sandbox creation
- external process startup
- filesystem staging
- shell or Python command execution
- stdout and stderr capture
- sandbox isolation
This work is comparatively expensive. It should be bounded and scheduled, not launched without limit.
This is the most important runtime distinction.
A logical worker is a BEAM process representing workflow state:
- it can receive messages
- it can wait cheaply
- it can be retried
- it can emit events
- it may request external execution
A physical execution lease is permission to consume sandbox capacity on the current node:
- one or more executor slots from a pool
- typically one OpenShell sandbox run
- expensive compared to a BEAM process
The runtime should be able to hold thousands of logical workers without starting thousands of sandboxes at once.
That is why executor nodes now go through MirrorNeuron.Execution.LeaseManager.
Before this change, each executor agent launched OpenShell directly from its own process. That worked for small demos, but it broke down under large fan-out:
- gateway resets under launch pressure
- duplicate cleanup races
- OS subprocess pressure
- too many expensive execution attempts starting at once
The lease manager addresses that by making executor capacity explicit.
- Every executor requests a lease before running OpenShell.
- The lease manager grants or queues the request per node.
- Executors emit events when they request, acquire, and release leases.
- Capacity is configured with
MN_EXECUTOR_MAX_CONCURRENCYor per-pool overrides.
This keeps BEAM lightweight while still allowing large logical graphs.
MirrorNeuron now supports local executor pools.
- default pool capacity comes from
MN_EXECUTOR_MAX_CONCURRENCY - named pools can be configured with
MN_EXECUTOR_POOL_CAPACITIES
Examples:
export MN_EXECUTOR_MAX_CONCURRENCY=4
export MN_EXECUTOR_POOL_CAPACITIES="default=4,gpu=1,io=8"Executor node config can request a pool and slot count:
{
"agent_type": "executor",
"config": {
"pool": "default",
"pool_slots": 1
}
}At the moment, pools are enforced per runtime node. That means the cluster scales by adding nodes, each with its own bounded execution capacity.
The message system is intentionally split into control-plane and payload-plane sections.
envelope: runtime-owned routing and trace metadataheaders: extensible routing and schema metadatabody: application-owned payloadartifacts: references to large externalized datastream: optional stream framing metadata
This keeps the runtime generic.
The runtime should understand enough to route and observe messages, but it should not need to parse every application payload to work correctly.
That also enables multi-language workers: Python and shell code consume the payload contract inside the sandbox, while BEAM only needs the stable envelope.
MirrorNeuron now has a small-cluster orchestration layer above the message graph:
- desired-vs-actual reconciliation for failed nodes and orphaned jobs
- full
service,batch,system, andsysbatchjob type behavior - restart and reschedule policy with sliding windows and backoff
- node maintenance and drain
- Redis-backed service registry and health checks
- CUDA, Metal, port, volume, and runtime-driver aware placement
- rolling, canary, promote, rollback, and version history for long-running deployments
- periodic, delayed, and event-triggered job dispatch
The detailed map is in Nomad-Inspired Runtime Features.
Passing large blobs directly between agents is a bad fit for a clustered runtime.
Instead, messages should stay small and carry:
- ids
- routing metadata
- schema references
- artifact references
This is the same core operational lesson many schedulers learn: small control messages scale much better than inline large payloads.
Cluster BlobRef sharing uses a simple peer HTTP content-addressed store. Blob URLs are not individually bearer-token authenticated; operators should expose the artifact port only on trusted loopback, Docker, VPN, or LAN interfaces and use firewall rules or bind/publish host settings to keep it inside the intended cluster boundary.
MirrorNeuron keeps only a small runtime primitive set in core:
routerexecutoraggregatorsensor
Why so small:
- runtime primitives stay reusable
- domain-specific agents belong in user code or examples
- the project stays focused on collaboration mechanics instead of shipping business personas
The project is not BEAM-only in the sense of "all useful code must be Elixir." That would make the runtime less practical.
Instead:
- BEAM owns coordination
- OpenShell owns isolated execution
- worker payloads can be shell, Python, or other supported runtimes
That means a logical worker does not "become Python." It requests external execution when needed.
A healthy deployment usually looks like this:
- many logical workers
- modest executor concurrency per node
- more nodes added when more real execution capacity is needed
Example:
- 1000 logical workers
- 4 nodes
- 8 executor slots per node
- real concurrent sandbox count: 32
That is still a large multi-agent workload, but it does not overload a single OpenShell gateway.
The current implementation improves the runtime boundary a lot, but a few things are still intentionally simple:
- executor pools are local to each node, not globally brokered
- there is no cluster-wide lease balancer yet
- sensor and deferred waiting primitives can still grow richer
- artifacts are modeled in messages, but there is not yet a full external artifact store abstraction
- resource allocation is scheduling metadata and environment hints, not OS-level isolation
Those are good next steps, but the current runtime is already much closer to the intended design:
- BEAM as the lightweight orchestrator
- OpenShell as bounded worker execution
- message-driven collaboration across supervised logical workers