Separate building/scheduling from storage#15542
Open
lisanna-dettwyler wants to merge 1 commit intoNixOS:masterfrom
Open
Separate building/scheduling from storage#15542lisanna-dettwyler wants to merge 1 commit intoNixOS:masterfrom
lisanna-dettwyler wants to merge 1 commit intoNixOS:masterfrom
Conversation
2d06439 to
5f79cd4
Compare
Ericson2314
reviewed
Mar 23, 2026
xokdvium
reviewed
Mar 23, 2026
5f79cd4 to
73a5876
Compare
Ericson2314
reviewed
Mar 23, 2026
Ericson2314
reviewed
Mar 23, 2026
73a5876 to
7e6a21a
Compare
7e6a21a to
2c4565d
Compare
This is the major first step of NixOS#5025. Motivation ---------- Right now, there is a bit of conceptual tension between `--store` and `--builders`: - With `--store`, it is very convenient to think that the store knows how to build. One specifies a store, and gets a different method of building (local, some sort of remote) and scheduling (the remote store can take an entire derivation graph, multiple jobs) accordingly. - With `--builders` one has a local scheduler. Stores either act as a passive "workbench" for building (the local case) or we give them a single job (a single ready-to-build derivation) at a time. For historical reasons, this doesn't even use the store interface, except on the "other side" of the build hook. Issue NixOS#5025 is about using the store interface for `--builders`. In this case, we want to invert the relationship between `Store` and `Worker`. We have no use in this case for various `Store` methods creating a `Worker` behind the scenes, because we already have our `Worker`: - `LocalStore`s don't need any build method at all, we can just have our `Worker` directly use the local store, as it does with the default `worker.store` today. - remote stores supporting building (`ssh://` and `ssh-ng://`) are only fed a single job at a time, their remote-side scheduling being overkill for the task at hand. But we can't just delete the building methods of `--store` that we don't need anymore, because that would break `--store` building. We need to support both cases, where some stores effectively build/schedule and `Worker` can also own/borrow stores to be a single, unified scheduler. This change ----------- The way we satisfy both goals is by: - Pulling the building methods out of `Store` into a new `Builder` class - Having some stores also give/implement `Builder` The separation of `Store` vs `Builder` works for the `--builders` use-case, and the project of making that leverage `Store` and other C++ interfaces directly without indirecting through build hook or other ad-hoc implementation swapping methods. Here's how: as opposed to default `Store::` method implementations creating a `Worker` on the fly, `Worker` will implement `Builder`, and those methods, now on `Builder`, will become `Worker`'s own implementation. Local building To implement this conceptual switch, the methods that directly delegated to the worker are now instead ripped off `Store` and put in the new `Builder` class. (For example, `build/entry-points.cc` now contains all `Worker` methods (virtual method impls of `Builder`) and not `Store` method impls.) `Worker` also now owns `ref<...> srcStore, destStore`, directly out of issue NixOS#5025. (`Store & store, evalStore` are kept as aliasing references to reduce churn.) (`Worker` should be renamed to `LocalBuilder`, since it is the local build scheduler, and additionally knows how to build in local stores.) Remote building What about the `--store` case? The remote stores have a new method to provide a `Builder` of their choice given an `evalStore`. (This reflects the fact that `Builder` no longer has `evalStore` parameters on its methods.) That new method is a new interface `BuildStore` which `RemoteStore` and `LegacySSHStore` implement. Each one has an unexposed `Builder` implementation which will just do everything over RPC, like today. Putting it all together Introduce: - `getLocalBuilder`, a freestanding function to make a `Worker`. (Really, the details of `Worker` should be considered private to the `build/*.cc` files) - `getDefaultBuilder`, a freestanding function which will call `BuildStore::getBuilder` for `BuildStore`s, and use `getLocalBuilder` otherwise. Future work ----------- Issue NixOS#1221 The next step of the NixOS#5025 saga is issue NixOS#1221. To solve that issue, `Worker` will not use the build hook, but instead work via C++. In particular, it will do this: - if the builder is a `BuildStore`, use an appropriate method (possibly yet to be created) on the remote building store's (remote) `Builder`. - if the builder is a `LocalStore` `Worker` should *not* create another `Worker` (as the `build-remote` program would do today) but instead directly manage building in that local store, so we avoid **n** `Worker` instances scheduling independently (which is stupid discoordination). - (Otherwise fail, which matches what happens today, actually, just in fewer steps.) Simplifying the RPC case I also suspect that longer term, those stores will just implement `Builder` directly, as `evalStore` doesn't really make sense for RPC endpoints when the remote side has no idea what the caller is doing with other stores. We won't need `BuildStore` anymore then. Other improvements ------------------ Recursive Nix Speaking of avoiding redundant schedulers: `RestrictedStore`, when it used to override the store building methods, would spin up a new `Worker` for each recursive Nix build call. This is again bad --- we should have a central scheduler that takes in dynamic jobs, same for recursive Nix and dynamic derivations. Now this is *almost*, but not quite, fixed. Three changes were made: - `RestrictedBuilder` was split out from `RestrictedStore` to wrap the build methods. - `processConnection` takes an optional `Builder` parameter, using it directly rather than spinning one up with `getDefaultBuilder`. - `DerivationBuilder` took a callback to process the connection for recursive Nix, so the caller could provide the `processConnection` call with the ambient worker in order to reuse it. This would have solved the redundant scheduler problem very nicely! This unfortunately deadlocked, so instead the caller explicitly creates a fresh worker (as before, but not hidden beneath a gazillion abstractions) with a TODO saying the deadlock should be fixed and this should not be done. `LegacySSHStore` fix As a final note, the old `LegacySSHStore` did not override `buildPathsWithResults`, which meant that when specifying an `ssh://` store, the local scheduler was being erroneously used for some commands. Now, `LegacySSHBuilder::buildPathsWithResults` uses a single `buildPathsRaw` call (which sends the serve protocol `BuildPaths` command and returns `std::variant<BuildResultSuccessStatus, BuildError>` with the error message already read from the wire), and then queries realisations to reconstruct the `BuildResult`s --- code similar to the old fallback code for `ssh-ng://`. Use std::shared_ptr for processConnection's builder This avoids the need to pass a raw pointer. Signed-off-by: Lisanna Dettwyler <lisanna.dettwyler@gmail.com> Rename BuildStore to BuildStore Signed-off-by: Lisanna Dettwyler <lisanna.dettwyler@gmail.com> Co-authored-by: John Ericson <John.Ericson@Obsidian.Systems>
2c4565d to
e18f92f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Separate building/scheduling from storage
This is the major first step of #5025.
Motivation
Right now, there is a bit of conceptual tension between
--storeand--builders:With
--store, it is very convenient to think that the store knowshow to build. One specifies a store, and gets a different method of
building (local, some sort of remote) and scheduling (the remote store
can take an entire derivation graph, multiple jobs) accordingly.
With
--buildersone has a local scheduler. Stores either act as apassive "workbench" for building (the local case) or we give them a
single job (a single ready-to-build derivation) at a time.
For historical reasons, this doesn't even use the store interface,
except on the "other side" of the build hook.
Issue #5025 is about using the store interface for
--builders. In thiscase, we want to invert the relationship between
StoreandWorker.We have no use in this case for various
Storemethods creating aWorkerbehind the scenes, because we already have ourWorker:LocalStores don't need any build method at all, we can just have ourWorkerdirectly use the local store, as it does with the defaultworker.storetoday.remote stores supporting building (
ssh://andssh-ng://) are onlyfed a single job at a time, their remote-side scheduling being
overkill for the task at hand.
But we can't just delete the building methods of
--storethat we don'tneed anymore, because that would break
--storebuilding. We need tosupport both cases, where some stores effectively build/schedule and
Workercan also own/borrow stores to be a single, unified scheduler.This change
The way we satisfy both goals is by:
Pulling the building methods out of
Storeinto a newBuilderclassHaving some stores also give/implement
BuilderThe separation of
StorevsBuilderworks for the--buildersuse-case, and the project of making that leverage
Storeand other C++interfaces directly without indirecting through build hook or other
ad-hoc implementation swapping methods. Here's how: as opposed to
default
Store::method implementations creating aWorkeron the fly,Workerwill implementBuilder, and those methods, now onBuilder,will become
Worker's own implementation.Local building
To implement this conceptual switch, the methods that directly delegated
to the worker are now instead ripped off
Storeand put in the newBuilderclass. (For example,build/entry-points.ccnow contains allWorkermethods (virtual method impls ofBuilder) and notStoremethod impls.)
Workeralso now ownsref<...> srcStore, destStore, directly out ofissue #5025. (
Store & store, evalStoreare kept as aliasing referencesto reduce churn.)
(
Workershould be renamed toLocalBuilder, since it is the localbuild scheduler, and additionally knows how to build in local stores.)
Remote building
What about the
--storecase? The remote stores have a new method toprovide a
Builderof their choice given anevalStore. (This reflectsthe fact that
Builderno longer hasevalStoreparameters on itsmethods.) That new method is a new interface
RemoteBuildStorewhichRemoteStoreandLegacySSHStoreimplement. Each one has an unexposedBuilderimplementation which will just do everything over RPC, liketoday.
Putting it all together
Introduce:
getLocalBuilder, a freestanding function to make aWorker.(Really, the details of
Workershould be considered private to thebuild/*.ccfiles)getDefaultBuilder, a freestanding function which will callRemoteBuildStore::getBuilderforRemoteBuildStores, and usegetLocalBuilderotherwise.Future work
Issue #1221
The next step of the #5025 saga is issue #1221. To solve that issue,
Workerwill not use the build hook, but instead work via C++. In particular,it will do this:
if the builder is a
RemoteBuildStore, use an appropriate method (possiblyyet to be created) on the remote building store's (remote)
Builder.if the builder is a
LocalStoreWorkershould not create anotherWorker(as thebuild-remoteprogram would do today) but instead directlymanage building in that local store, so we avoid n
Workerinstancesscheduling independently (which is stupid discoordination).
(Otherwise fail, which matches what happens today, actually, just in fewer
steps.)
Simplifying the RPC case
I also suspect that longer term, those stores will just implement
Builderdirectly, asevalStoredoesn't really make sense for RPCendpoints when the remote side has no idea what the caller is doing with
other stores. We won't need
RemoteBuildStoreanymore then.Other improvements
Recursive Nix
Speaking of avoiding redundant schedulers:
RestrictedStore, when itused to override the store building methods, would spin up a new
Workerfor each recursive Nix build call. This is again bad --- weshould have a central scheduler that takes in dynamic jobs, same for
recursive Nix and dynamic derivations. Now this is almost, but not
quite, fixed. Three changes were made:
RestrictedBuilderwas split out fromRestrictedStoreto wrap thebuild methods.
processConnectiontakes an optionalBuilderparameter, using itdirectly rather than spinning one up with
getDefaultBuilder.DerivationBuildertook a callback to process the connection forrecursive Nix, so the caller could provide the
processConnectioncall with the ambient worker in order to reuse it.
This would have solved the redundant scheduler problem very nicely!
This unfortunately deadlocked, so instead the caller explicitly creates
a fresh worker (as before, but not hidden beneath a gazillion
abstractions) with a TODO saying the deadlock should be fixed and this
should not be done.
LegacySSHStorefixAs a final note, the old
LegacySSHStoredid not overridebuildPathsWithResults, which meant that when specifying anssh://store, the local scheduler was being erroneously used for some commands.
Now,
LegacySSHBuilder::buildPathsWithResultsuses a singlebuildPathsRawcall (which sends the serve protocolBuildPathscommand and returns
std::variant<BuildResultSuccessStatus, BuildError>with the error message already read from the wire), and then queries
realisations to reconstruct the
BuildResults --- code similar to theold fallback code for
ssh-ng://.Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.