Skip to content

Integrate the new async scheduler with the base FlowCfg#138

Draft
AlexJones0 wants to merge 15 commits intolowRISC:masterfrom
AlexJones0:async_scheduler_integration
Draft

Integrate the new async scheduler with the base FlowCfg#138
AlexJones0 wants to merge 15 commits intolowRISC:masterfrom
AlexJones0:async_scheduler_integration

Conversation

@AlexJones0
Copy link
Copy Markdown
Contributor

@AlexJones0 AlexJones0 commented Apr 1, 2026

Note: this PR is currently a draft as it depends on #134, #135 and #137, which have not yet been merged; the first 2 commits are from the #134, the 3rd-5th commits are from #135, and the 6th-9th commits are from #137, and can be safely ignored. Only the last 6 commits are relevant. It is otherwise ready to review.

This PR is the sixteenth of a series of PRs to rewrite DVSim's core scheduling functionality (Scheduler, status display, launchers / runtime backends) to use an async design, with key goals of long term maintainability and extensibility.

This PR implements the logic to integrate the async scheduler with the base FlowCfg so that it can actually be used to run jobs in regular DVSim operation. It also hooks up all the relevant scheduler observers - the async status printers, instrumentation, and a log manager (implementing logic previously in the Launcher objects) - so that the full functionality of the previous scheduler is implemented, without unnecessary coupling of code.

Since the intention is to migrate soon, for now the use of this scheduler is placed behind an EXPERIMENTAL_ENABLE_ASYNC_SCHEDULER=1 environment variable. This should let it be tested locally before the final change is made. After the change is made, the old launchers (the ones that have been rewritten), scheduler and status printer can all be removed.

See the commit messages for more information.

This commit fleshes out the abstract `RuntimeBackend` base class with a
lot of core functionality that will be needed to implement new runtime
backends (which aren't just the legacy launcher adapter). These mostly
take the form of protected methods optionally called by backends,
comprised of logic on the `Launcher` base class or that was previously
duplicated across its various subclasses.

Some key changes to note from the launchers:
- A new `DVSIM_RUN_INTERACTIVE` env var is introduced intended to
  replace the `RUN_INTERACTIVE` env var long term, to avoid potential
  name collision.
- Errors are raised if an interactive job tries to run on a backend that
  doesn't support running jobs interactively.
- Log parsing functionality is extracted to a separate object; logs
  are always lazily loaded so that for jobs that don't need them
  (passing jobs without any fail or pass patterns), we don't waste time.
- Efficiency of the log contents pass/fail regex pattern parsing is
  improved. Fail patterns are combined into a single regex check, and
  all regexes are compiled once instead of per-line.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This is the async `RuntimeBackend` replacement of the `LocalLauncher`,
which will eventually by removed in lieu of this new backend.

Some behavioural differences to note:
- We now try to await() after a SIGKILL to be sure the process ended,
  bounded by a short timeout in case blocked at the kernel level.
- We now use psutil to enumerate and kill descendent processes in
  addition to the created subprocess. This won't catch orphaned
  processes (needs e.g. cgroups), but should cover most sane usage.
- The backend does _not_ link the output directories based on status
  (the `JobSpec.links`, e.g. "passing/", "failed/", "killed/"). The
  intention is that this detail is not core functionality for either
  the scheduler or the backends - instead, it will be implemented as
  an observer on the new async scheduler callbacks when introduced.

By using async subprocesses and launching/killing jobs in batch, we are
able to more efficiently launch jobs in parallel via async coroutines.
We likewise avoid the ned to poll jobs - instead we have an async task
awaiting the subprocess' completion, which we then forward to notify the
(to be added) scheduler of the job's completion.

Note that interactive jobs are still basically handled synchronously as
before - assumed that there is only 1 interactive job running at a time.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
See the explanatory comments added to JobStatus. The intention is that
the new async scheduler will distinguish between jobs that are blocked
due to unfinished dependencies (`SCHEDULED`), and those that are pending
because there is no availability to run them, despite their dependencies
being fulfilled (`QUEUED`). This new state is currently unused.

Also add a short test to prevent potential future bugs from status
shorthand name collisions.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This field will be used to inform the new scheduler of which backend it
should use to execute a job. Though the plumbing is not there in the
rest of DVSim, the intent is to make the scheduler such that it could
feasibly be run with multiple backends (e.g. some jobs faked, some jobs
on the local machine, some dispatched to various remote clusters).

To support this design, each job spec can now specify that it should be
run on a certain backend, with some designated string name. To instead
just use the configured default backend (which is the current behaviour,
as the current scheduler only supports one backend / `launcher_cls`),
this can be set to `None`.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
For now, this is separated in `async_core.py` - the intention is that
it will eventually replace the scheduler in `core.py` when all
necessary components for it to work are integrated.

This commit contains the fully async scheduler design. Some notes:
- Everything is now async. The scheduler is no longer tied to a Timer
  object, nor does it have to manage its print interval and poll
  frequency. It takes advantage of parallelism via cooperative
  multitasking as much as possible.
- The scheduler is designed to support multiple different backends (new
  async versions of launchers). Jobs are dispatch according to their
  specifications and scheduler parameters.
- The scheduler implements the Observer pattern for various events
  (start, end, job status change, kill signal), allowing consumers that
  want to use this functionality (e.g. instrumentation, status printer)
  to hook into the scheduler, instead of unnecessarily coupling code.
- The previous scheduler only recognized killed jobs when they were
  reached in the queue and their status was updated. The new design
  immediately transitively updates jobs to instantly reflect status
  updates of all jobs when information is known.
- Since the scheduler knows _why_ it is killing the jobs, we attach
  JobStatusInfo information to give more info in the failure buckets.
- The job DAG is indexed and validated during initialization;
  dependency cycles are detected and cause an error to be raised.
- Job info is encapsulated by records, keeping state centralized
  (outside of indexes).
- The scheduler now accepts a prioritization function. It schedules
  jobs in a heap and schedules according to highest priority. Default
  prioritisation is by weights, but this can be customized.
- The scheduler now has its own separate modifiable parallelism limit.
- The scheduler has it sown separate modifiable parallelism limit
  separate from each individual backend's parallelism limit.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Add the new async `StatusPrinter` abstract base class, intended to
replace the original for use with the new async scheduler. The original
will not removed until the scheduler has been switched.

This is now an abstract base class, rather than an empty class used for
interactive sessions - since the status printer will live outside the
new scheduler, it becomes much easier to just _not connect_ any status
printer hooks during interactive mode.

Some notable changes and overhauls:
- Status printing now runs entirely independently of the scheduler. If
  a print interval > 0 is configured, then the status printer now runs
  as a loop with async awaits such that the timing logic is entirely
  separate from the scheduler, maintained by cooperative multitasking.
- As new functionality, if a print interval of 0 is configured, we
  instead activate in synchronous "event/update-driven mode" where every
  single status update is printed. This might be useful for e.g. the TTY
  printer where you may want to capture exact times of all updates.
- As a result of observing the scheduler, the status printer maintains
  its own stateful tracking of job information.
- Field alignment is calculated from the initial job information and
  data is appropriately justified to clean up the output tables.
- The ability to pause the status bar is introduced to help (later)
  deal with issues in the EnlightenStatusBar, where its terminal
  interactivity can be broken and cause hangs under heavy load.
- General refactoring: the status header and fields are no longer
  hardcoded and are instead derived from the JobStatus enum.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Extract time-related utilities (the `hms` functionality of the `Timer`
and the two timestamp formats from `fs.py`) into a new `time.py` utility
module. The intention is to use the `hms` functionality inside the
new async status printers and to eventually remove `timer.py` completely
when the old scheduler is removed.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This commit ports the `TtyStatusPrinter` to use the new async interface,
extending the `StatusPrinter` interface introduced previously. The
extended logic remains mostly the same as the original code, with some
small refactors and tweaks for aesthetics.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This commit ports the original `EnlightenStatusPrinter` to the new async
interface, extending the `StatusPrinter` abstract base class.

The logic is mostly the same, with a few important caveats to note:
- Since the new interface allows float (and hence sub-second) print
  intervals, a warning is introduced for intervals less than Enlighten's
  internal minimum delta value which will cause updates to be coalesced
  and potentially lost at points if not refreshed. We could also lower
  the `min_delta` to match the print interval, but experimentation shows
  that this introduces performance concerns and is best left as is.
- Because of the above, logic is added to refresh (flush) the status bar
  when a target is done, to ensure it locks the final time correctly.
- An occasional bug was encountered on using Ctrl-C to gracefully exit
  where Enlighten's `StatusBar.update` would hang indefinitely. This
  occurred during the terminal protocol used by the underlying Blessed
  library, which queried the terminal for its size and expected a
  response. Under heavy loads, particularly when a large number of
  processes are killed due to an exit signal, the terminal response
  might not be received, causing Blessed to hang on a `getch` call.
  To prevent this, the `pause` interface was introduced to the base
  `StatusPrinter` which is used for that purpose here.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Add a new global runtime backend registry to mirror and extend the
existing launcher factory. This serves as a registry for built-in
backends (and legacy launchers), and provides helper methods for
registering custom backends/launchers to support plugin-like extension.

When DVSim fully moves to use the new async scheduler, the launcher
factory that this replaces will be removed.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
In the `CovReport.post_finish` deploy callback, check if the coverage
report actually exists and skip / do nothing if not. This is needed as
a temporary workaround for the fake launcher, which does not generate
this kind of report, but does now call `post_finish` in its backend due
to this being implemented on the base backend.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Allow you to run DVSim using the new async scheduler by using a switch -
either modify the code to make the conditional true, or set e.g.
    `EXPERIMENTAL_ENABLE_ASYNC_SCHEDULER=1 dvsim ...`
before your dvsim command to enable the new scheduler.

Note: this currently does not have the status directory linking
(JobSpec.links) hooked up, nor the status printing (the StatusPrinter
logic, not the logs), or the instrumentation. It just performs the core
scheduling functionality.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This is easily added outside of the scheduler itself via the new async
scheduler's callbacks.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This replicates the scratch directory symlinking behaviour that is
implemented independently by each of the launcher classes. This
didn't make much sense to be on the launchers - the scheduler should be
in control of the overall job status, not the launcher, which should
just reports pass/fail/killed. It also doesn't make sense to live in
the scheduler - ideally we want to keep the core logic separate.

Instead, make these symlink directories via a new minimal `LogManager`
which acts as an observer to scheduler job status changes and creates
soft links as before.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 force-pushed the async_scheduler_integration branch from c3de780 to f9a2a54 Compare April 1, 2026 18:35
This enables the new async status printers to be used with the new
scheduler. The connections are via the callbacks defined on the
scheduler, independent from the scheduler itself.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 force-pushed the async_scheduler_integration branch from f9a2a54 to 8de6347 Compare April 1, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant