Skip to content

Add new async scheduler#135

Open
AlexJones0 wants to merge 3 commits intolowRISC:masterfrom
AlexJones0:new_async_scheduler
Open

Add new async scheduler#135
AlexJones0 wants to merge 3 commits intolowRISC:masterfrom
AlexJones0:new_async_scheduler

Conversation

@AlexJones0
Copy link
Copy Markdown
Contributor

@AlexJones0 AlexJones0 commented Apr 1, 2026

This PR is the thirteenth of a series of PRs to rewrite DVSim's core scheduling functionality (Scheduler, status display, launchers / runtime backends) to use an async design, with key goals of long term maintainability and extensibility.

Edit: I've also just opened #136. This PR should be merged later down the line (after more integration PRs), but is intended for now to show that the new scheduler is passing the existing test suite.

This is probably the largest PR, and contains the entire new async scheuduler implementation (i.e. the scheduler rewrite). Note that the code to integrate this new async scheduler is not yet included - I thought about including it to make the async scheduler usable in this PR, but decided that it would probably be a bit too much to review at once, so I'm deferring it to a future PR. Note also that the new scheduler is being merged in stages - the old scheduler, launchers etc. will not be removed until the official transition to use the new scheduler.

Some of the main features/differences to note about the new scheduler:

  1. Everything is now async. The scheduler is no longer tied to a Timer object, nor does it have to manage its print interval and poll frequency. It takes advantage of parallelism via cooperative multitasking as much as possible.
  2. The scheduler is designed to support multiple different backends (new async versions of launchers). Jobs are dispatched according to their specifications and scheduler parameters.
  3. The scheduler implements the Observer pattern for various events, allowing consumers that want to use this functionality (e.g. instrumentation, status printer) to hook into the scheduler, instead of unnecessarily coupling code. These consumers are not included in this PR (will come later).
  4. The previous scheduler only recognized killed jobs when they were reached in the queue and their status was updated. The new design immediately transitively updates jobs to instantly reflect status updates to all jobs when possible. Also, since the scheduler knows why it is killing jobs, we add the reason to give more failure bucket info.
  5. The job DAG is indexed and validated during initialization; dependency cycles are detected and cause a raised error.
  6. The scheduler now accepts a prioritization function. It schedules jobs in a heap and schedules according to highest priority. Default prioritization is by weights, but this can be customized.
  7. The scheduler now has its own separate modifiable parallelism limit.

I recommend to see the commit messages for more information.

See the explanatory comments added to JobStatus. The intention is that
the new async scheduler will distinguish between jobs that are blocked
due to unfinished dependencies (`SCHEDULED`), and those that are pending
because there is no availability to run them, despite their dependencies
being fulfilled (`QUEUED`). This new state is currently unused.

Also add a short test to prevent potential future bugs from status
shorthand name collisions.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This field will be used to inform the new scheduler of which backend it
should use to execute a job. Though the plumbing is not there in the
rest of DVSim, the intent is to make the scheduler such that it could
feasibly be run with multiple backends (e.g. some jobs faked, some jobs
on the local machine, some dispatched to various remote clusters).

To support this design, each job spec can now specify that it should be
run on a certain backend, with some designated string name. To instead
just use the configured default backend (which is the current behaviour,
as the current scheduler only supports one backend / `launcher_cls`),
this can be set to `None`.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
For now, this is separated in `async_core.py` - the intention is that
it will eventually replace the scheduler in `core.py` when all
necessary components for it to work are integrated.

This commit contains the fully async scheduler design. Some notes:
- Everything is now async. The scheduler is no longer tied to a Timer
  object, nor does it have to manage its print interval and poll
  frequency. It takes advantage of parallelism via cooperative
  multitasking as much as possible.
- The scheduler is designed to support multiple different backends (new
  async versions of launchers). Jobs are dispatched according to their
  specifications and scheduler parameters.
- The scheduler implements the Observer pattern for various events
  (start, end, job status change, kill signal), allowing consumers that
  want to use this functionality (e.g. instrumentation, status printer)
  to hook into the scheduler, instead of unnecessarily coupling code.
- The previous scheduler only recognized killed jobs when they were
  reached in the queue and their status was updated. The new design
  immediately transitively updates jobs to instantly reflect status
  updates of all jobs when information is known.
- Since the scheduler knows _why_ it is killing the jobs, we attach
  JobStatusInfo information to give more info in the failure buckets.
- The job DAG is indexed and validated during initialization;
  dependency cycles are detected and cause an error to be raised.
- Job info is encapsulated by records, keeping state centralized
  (outside of indexes).
- The scheduler now accepts a prioritization function. It schedules
  jobs in a heap and schedules according to highest priority. Default
  prioritization is by weights, but this can be customized.
- The scheduler now has its own separate modifiable parallelism limit.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 force-pushed the new_async_scheduler branch from 1adbdb1 to 1dd6d12 Compare April 1, 2026 16:10
@AlexJones0 AlexJones0 marked this pull request as ready for review April 1, 2026 16:17
Copy link
Copy Markdown
Collaborator

@machshev machshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @AlexJones0!
Just a small nit.

Comment on lines +128 to +132
self.backends = dict(backends)
self.default_backend = default_backend
self.max_parallelism = max_parallelism
self.priority_fn = priority_fn or self._default_priority
self.coalesce_window = coalesce_window
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need public attributes? This implies they can be mutated externally while the scheduler is running?

"""Prioritizes jobs according to their weight. The default prioritization method."""
return job.spec.weight

def _build_graph(self, specs: Iterable[JobSpec]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later, but it might be nice to be able to build the graph as a debug thing. Sort of a dry run mode, but more for checking the scheduler... maybe. Just a thought, we can add it in later if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants