Open
Conversation
See the explanatory comments added to JobStatus. The intention is that the new async scheduler will distinguish between jobs that are blocked due to unfinished dependencies (`SCHEDULED`), and those that are pending because there is no availability to run them, despite their dependencies being fulfilled (`QUEUED`). This new state is currently unused. Also add a short test to prevent potential future bugs from status shorthand name collisions. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This field will be used to inform the new scheduler of which backend it should use to execute a job. Though the plumbing is not there in the rest of DVSim, the intent is to make the scheduler such that it could feasibly be run with multiple backends (e.g. some jobs faked, some jobs on the local machine, some dispatched to various remote clusters). To support this design, each job spec can now specify that it should be run on a certain backend, with some designated string name. To instead just use the configured default backend (which is the current behaviour, as the current scheduler only supports one backend / `launcher_cls`), this can be set to `None`. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
For now, this is separated in `async_core.py` - the intention is that it will eventually replace the scheduler in `core.py` when all necessary components for it to work are integrated. This commit contains the fully async scheduler design. Some notes: - Everything is now async. The scheduler is no longer tied to a Timer object, nor does it have to manage its print interval and poll frequency. It takes advantage of parallelism via cooperative multitasking as much as possible. - The scheduler is designed to support multiple different backends (new async versions of launchers). Jobs are dispatched according to their specifications and scheduler parameters. - The scheduler implements the Observer pattern for various events (start, end, job status change, kill signal), allowing consumers that want to use this functionality (e.g. instrumentation, status printer) to hook into the scheduler, instead of unnecessarily coupling code. - The previous scheduler only recognized killed jobs when they were reached in the queue and their status was updated. The new design immediately transitively updates jobs to instantly reflect status updates of all jobs when information is known. - Since the scheduler knows _why_ it is killing the jobs, we attach JobStatusInfo information to give more info in the failure buckets. - The job DAG is indexed and validated during initialization; dependency cycles are detected and cause an error to be raised. - Job info is encapsulated by records, keeping state centralized (outside of indexes). - The scheduler now accepts a prioritization function. It schedules jobs in a heap and schedules according to highest priority. Default prioritization is by weights, but this can be customized. - The scheduler now has its own separate modifiable parallelism limit. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
1adbdb1 to
1dd6d12
Compare
machshev
approved these changes
Apr 1, 2026
Collaborator
machshev
left a comment
There was a problem hiding this comment.
Nice work @AlexJones0!
Just a small nit.
Comment on lines
+128
to
+132
| self.backends = dict(backends) | ||
| self.default_backend = default_backend | ||
| self.max_parallelism = max_parallelism | ||
| self.priority_fn = priority_fn or self._default_priority | ||
| self.coalesce_window = coalesce_window |
Collaborator
There was a problem hiding this comment.
Do we need public attributes? This implies they can be mutated externally while the scheduler is running?
| """Prioritizes jobs according to their weight. The default prioritization method.""" | ||
| return job.spec.weight | ||
|
|
||
| def _build_graph(self, specs: Iterable[JobSpec]) -> None: |
Collaborator
There was a problem hiding this comment.
For later, but it might be nice to be able to build the graph as a debug thing. Sort of a dry run mode, but more for checking the scheduler... maybe. Just a thought, we can add it in later if needed.
This was referenced Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is the thirteenth of a series of PRs to rewrite DVSim's core scheduling functionality (Scheduler, status display, launchers / runtime backends) to use an async design, with key goals of long term maintainability and extensibility.
Edit: I've also just opened #136. This PR should be merged later down the line (after more integration PRs), but is intended for now to show that the new scheduler is passing the existing test suite.
This is probably the largest PR, and contains the entire new async scheuduler implementation (i.e. the scheduler rewrite). Note that the code to integrate this new async scheduler is not yet included - I thought about including it to make the async scheduler usable in this PR, but decided that it would probably be a bit too much to review at once, so I'm deferring it to a future PR. Note also that the new scheduler is being merged in stages - the old scheduler, launchers etc. will not be removed until the official transition to use the new scheduler.
Some of the main features/differences to note about the new scheduler:
Timerobject, nor does it have to manage its print interval and poll frequency. It takes advantage of parallelism via cooperative multitasking as much as possible.I recommend to see the commit messages for more information.