Integrate the new async scheduler with the base FlowCfg#138
Draft
AlexJones0 wants to merge 15 commits intolowRISC:masterfrom
Draft
Integrate the new async scheduler with the base FlowCfg#138AlexJones0 wants to merge 15 commits intolowRISC:masterfrom
FlowCfg#138AlexJones0 wants to merge 15 commits intolowRISC:masterfrom
Conversation
This commit fleshes out the abstract `RuntimeBackend` base class with a lot of core functionality that will be needed to implement new runtime backends (which aren't just the legacy launcher adapter). These mostly take the form of protected methods optionally called by backends, comprised of logic on the `Launcher` base class or that was previously duplicated across its various subclasses. Some key changes to note from the launchers: - A new `DVSIM_RUN_INTERACTIVE` env var is introduced intended to replace the `RUN_INTERACTIVE` env var long term, to avoid potential name collision. - Errors are raised if an interactive job tries to run on a backend that doesn't support running jobs interactively. - Log parsing functionality is extracted to a separate object; logs are always lazily loaded so that for jobs that don't need them (passing jobs without any fail or pass patterns), we don't waste time. - Efficiency of the log contents pass/fail regex pattern parsing is improved. Fail patterns are combined into a single regex check, and all regexes are compiled once instead of per-line. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This is the async `RuntimeBackend` replacement of the `LocalLauncher`, which will eventually by removed in lieu of this new backend. Some behavioural differences to note: - We now try to await() after a SIGKILL to be sure the process ended, bounded by a short timeout in case blocked at the kernel level. - We now use psutil to enumerate and kill descendent processes in addition to the created subprocess. This won't catch orphaned processes (needs e.g. cgroups), but should cover most sane usage. - The backend does _not_ link the output directories based on status (the `JobSpec.links`, e.g. "passing/", "failed/", "killed/"). The intention is that this detail is not core functionality for either the scheduler or the backends - instead, it will be implemented as an observer on the new async scheduler callbacks when introduced. By using async subprocesses and launching/killing jobs in batch, we are able to more efficiently launch jobs in parallel via async coroutines. We likewise avoid the ned to poll jobs - instead we have an async task awaiting the subprocess' completion, which we then forward to notify the (to be added) scheduler of the job's completion. Note that interactive jobs are still basically handled synchronously as before - assumed that there is only 1 interactive job running at a time. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
See the explanatory comments added to JobStatus. The intention is that the new async scheduler will distinguish between jobs that are blocked due to unfinished dependencies (`SCHEDULED`), and those that are pending because there is no availability to run them, despite their dependencies being fulfilled (`QUEUED`). This new state is currently unused. Also add a short test to prevent potential future bugs from status shorthand name collisions. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This field will be used to inform the new scheduler of which backend it should use to execute a job. Though the plumbing is not there in the rest of DVSim, the intent is to make the scheduler such that it could feasibly be run with multiple backends (e.g. some jobs faked, some jobs on the local machine, some dispatched to various remote clusters). To support this design, each job spec can now specify that it should be run on a certain backend, with some designated string name. To instead just use the configured default backend (which is the current behaviour, as the current scheduler only supports one backend / `launcher_cls`), this can be set to `None`. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
For now, this is separated in `async_core.py` - the intention is that it will eventually replace the scheduler in `core.py` when all necessary components for it to work are integrated. This commit contains the fully async scheduler design. Some notes: - Everything is now async. The scheduler is no longer tied to a Timer object, nor does it have to manage its print interval and poll frequency. It takes advantage of parallelism via cooperative multitasking as much as possible. - The scheduler is designed to support multiple different backends (new async versions of launchers). Jobs are dispatch according to their specifications and scheduler parameters. - The scheduler implements the Observer pattern for various events (start, end, job status change, kill signal), allowing consumers that want to use this functionality (e.g. instrumentation, status printer) to hook into the scheduler, instead of unnecessarily coupling code. - The previous scheduler only recognized killed jobs when they were reached in the queue and their status was updated. The new design immediately transitively updates jobs to instantly reflect status updates of all jobs when information is known. - Since the scheduler knows _why_ it is killing the jobs, we attach JobStatusInfo information to give more info in the failure buckets. - The job DAG is indexed and validated during initialization; dependency cycles are detected and cause an error to be raised. - Job info is encapsulated by records, keeping state centralized (outside of indexes). - The scheduler now accepts a prioritization function. It schedules jobs in a heap and schedules according to highest priority. Default prioritisation is by weights, but this can be customized. - The scheduler now has its own separate modifiable parallelism limit. - The scheduler has it sown separate modifiable parallelism limit separate from each individual backend's parallelism limit. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Add the new async `StatusPrinter` abstract base class, intended to replace the original for use with the new async scheduler. The original will not removed until the scheduler has been switched. This is now an abstract base class, rather than an empty class used for interactive sessions - since the status printer will live outside the new scheduler, it becomes much easier to just _not connect_ any status printer hooks during interactive mode. Some notable changes and overhauls: - Status printing now runs entirely independently of the scheduler. If a print interval > 0 is configured, then the status printer now runs as a loop with async awaits such that the timing logic is entirely separate from the scheduler, maintained by cooperative multitasking. - As new functionality, if a print interval of 0 is configured, we instead activate in synchronous "event/update-driven mode" where every single status update is printed. This might be useful for e.g. the TTY printer where you may want to capture exact times of all updates. - As a result of observing the scheduler, the status printer maintains its own stateful tracking of job information. - Field alignment is calculated from the initial job information and data is appropriately justified to clean up the output tables. - The ability to pause the status bar is introduced to help (later) deal with issues in the EnlightenStatusBar, where its terminal interactivity can be broken and cause hangs under heavy load. - General refactoring: the status header and fields are no longer hardcoded and are instead derived from the JobStatus enum. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Extract time-related utilities (the `hms` functionality of the `Timer` and the two timestamp formats from `fs.py`) into a new `time.py` utility module. The intention is to use the `hms` functionality inside the new async status printers and to eventually remove `timer.py` completely when the old scheduler is removed. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This commit ports the `TtyStatusPrinter` to use the new async interface, extending the `StatusPrinter` interface introduced previously. The extended logic remains mostly the same as the original code, with some small refactors and tweaks for aesthetics. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This commit ports the original `EnlightenStatusPrinter` to the new async interface, extending the `StatusPrinter` abstract base class. The logic is mostly the same, with a few important caveats to note: - Since the new interface allows float (and hence sub-second) print intervals, a warning is introduced for intervals less than Enlighten's internal minimum delta value which will cause updates to be coalesced and potentially lost at points if not refreshed. We could also lower the `min_delta` to match the print interval, but experimentation shows that this introduces performance concerns and is best left as is. - Because of the above, logic is added to refresh (flush) the status bar when a target is done, to ensure it locks the final time correctly. - An occasional bug was encountered on using Ctrl-C to gracefully exit where Enlighten's `StatusBar.update` would hang indefinitely. This occurred during the terminal protocol used by the underlying Blessed library, which queried the terminal for its size and expected a response. Under heavy loads, particularly when a large number of processes are killed due to an exit signal, the terminal response might not be received, causing Blessed to hang on a `getch` call. To prevent this, the `pause` interface was introduced to the base `StatusPrinter` which is used for that purpose here. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Add a new global runtime backend registry to mirror and extend the existing launcher factory. This serves as a registry for built-in backends (and legacy launchers), and provides helper methods for registering custom backends/launchers to support plugin-like extension. When DVSim fully moves to use the new async scheduler, the launcher factory that this replaces will be removed. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
In the `CovReport.post_finish` deploy callback, check if the coverage report actually exists and skip / do nothing if not. This is needed as a temporary workaround for the fake launcher, which does not generate this kind of report, but does now call `post_finish` in its backend due to this being implemented on the base backend. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Allow you to run DVSim using the new async scheduler by using a switch -
either modify the code to make the conditional true, or set e.g.
`EXPERIMENTAL_ENABLE_ASYNC_SCHEDULER=1 dvsim ...`
before your dvsim command to enable the new scheduler.
Note: this currently does not have the status directory linking
(JobSpec.links) hooked up, nor the status printing (the StatusPrinter
logic, not the logs), or the instrumentation. It just performs the core
scheduling functionality.
Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This is easily added outside of the scheduler itself via the new async scheduler's callbacks. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This replicates the scratch directory symlinking behaviour that is implemented independently by each of the launcher classes. This didn't make much sense to be on the launchers - the scheduler should be in control of the overall job status, not the launcher, which should just reports pass/fail/killed. It also doesn't make sense to live in the scheduler - ideally we want to keep the core logic separate. Instead, make these symlink directories via a new minimal `LogManager` which acts as an observer to scheduler job status changes and creates soft links as before. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
c3de780 to
f9a2a54
Compare
This enables the new async status printers to be used with the new scheduler. The connections are via the callbacks defined on the scheduler, independent from the scheduler itself. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
f9a2a54 to
8de6347
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: this PR is currently a draft as it depends on #134, #135 and #137, which have not yet been merged; the first 2 commits are from the #134, the 3rd-5th commits are from #135, and the 6th-9th commits are from #137, and can be safely ignored. Only the last 6 commits are relevant. It is otherwise ready to review.
This PR is the sixteenth of a series of PRs to rewrite DVSim's core scheduling functionality (Scheduler, status display, launchers / runtime backends) to use an async design, with key goals of long term maintainability and extensibility.
This PR implements the logic to integrate the async scheduler with the base
FlowCfgso that it can actually be used to run jobs in regular DVSim operation. It also hooks up all the relevant scheduler observers - the async status printers, instrumentation, and a log manager (implementing logic previously in theLauncherobjects) - so that the full functionality of the previous scheduler is implemented, without unnecessary coupling of code.Since the intention is to migrate soon, for now the use of this scheduler is placed behind an
EXPERIMENTAL_ENABLE_ASYNC_SCHEDULER=1environment variable. This should let it be tested locally before the final change is made. After the change is made, the old launchers (the ones that have been rewritten), scheduler and status printer can all be removed.See the commit messages for more information.