feat: nix/doc document Nix CI

picnoir · jfroche · commit e2fc94c30a2e · 2026-02-27T11:20:37.000+01:00
- Document how the various components github_matrix.py/GitHub fit
  together.
- Document workaround for the recurring incidents.
diff --git a/nix/docs/README.md b/nix/docs/README.md
@@ -41,6 +41,10 @@ learn how to play with `postgres` in the [build guide](./build-postgres.md).
 - **[Testing PG Upgrade Scripts](./testing-pg-upgrade-scripts.md)** - Testing PostgreSQL upgrades
 - **[Docker Image testing](./docker-testing.md)** - How to test the docker images against the pg_regress test suite.
 
+## CI
+
+- **[Nix Build Matrix](./nix-build-matrix-ci.md)** - Understand how the CI Nix build matrix works
+
 ## Reference
 
 - **[References](./references.md)** - Useful links and resources
diff --git a/nix/docs/nix-build-matrix-ci.md b/nix/docs/nix-build-matrix-ci.md
@@ -0,0 +1,84 @@
+# Nix Matrix CI
+
+## Incident Q/A
+
+Something broke the Nix CI, you need a quick and dirty fix to unblock you as fast as possible, follow these guides.
+
+### **Q:** A test is failing; how to ignore it and generate an AMI image anyway?
+
+You can adopt the nuclear approach and generate the AMI image regardless of the test outcome. To do that, remove the three conditions checking the `nix-build-checks` resullt in the if clause of the `run-testinfra` step in the `.github/workflows/nix-build.yml` file.
+
+IE. remove the following conditions:
+
+```
+(needs.nix-build-checks-aarch64-linux.result == 'skipped' || needs.nix-build-checks-aarch64-linux.result == 'success')
+(needs.nix-build-checks-aarch64-darwin.result == 'skipped' || needs.nix-build-checks-aarch64-darwin.result == 'success')
+(needs.nix-build-checks-x86_64-linux.result == 'skipped' || needs.nix-build-checks-x86_64-linux.result == 'success')
+```
+
+Note: the merge queue check will block the PR from getting merged to develop.
+
+### **Q:** A hosted runner is down, how to reschedule a job somewhere else?
+
+**A:** Edit the `BUILD_RUNNER_MAP` dictionary in the `github_matrix.py` script and change the `labels` entry to match one of the still functional GitHub runners.
+
+You can see the available runners and their associated labels on [this page](https://github.com/supabase/postgres/actions/runners?tab=self-hosted). Note: the blacksmith runners are considered as "self-hosted" by GitHub.
+
+### **Q:** The eval step is OOM-ing, what should I do?
+
+**A:** The evaluation can be quite costly memory-wise. [nix-eval-jobs](https://github.com/NixOS/nix-eval-jobs) is spinning up multiple nix evaluation in parallel to speed things up. The tradeoff is an increased memory consumption compared to a single-process eval.
+
+An easy way to go around a eval OOM issue is to reduce the number of parallel evals by overriding the `--nb-eval-jobs-workers` flag from the github_matrix call in the `github/workflows/nix-eval.yml` file.
+
+IE. replace
+
+```terminal
+nix run --accept-flake-config .\#github-matrix -- checks legacyPackages
+```
+
+with:
+
+```terminal
+nix run --accept-flake-config .\#github-matrix -- --nb-eval-jobs-workers 30 checks legacyPackages
+```
+
+Note: by default, the `github_matrix.py` will spin up a eval instance per CPU. For a `blacksmith-32vcpu-ubuntu-2404` worker, it means it'll spin up 32 nix eval instances.
+
+## Walkthrough the CI
+
+The Nix artifacts are built from the `Nix CI` workflow defined in the `.github/workflows/nix-build.yml` file.
+
+It's performed in 4 steps. Each step depending on the previous one.
+
+### Step 1: Eval
+
+Conceptually, this workflow evaluates the `legacyPackages` and `checks` flake outputs using [nix-eval-jobs](https://github.com/NixOS/nix-eval-jobs). This step produces a json map containing the jobs to build/check for each architecture. That json map is later consumed by the subsequent build and check steps.
+
+Implementation-wise, most of the code lives in the `/nix/packages/github-matrix/github_matrix.py` python script. The script starts an instance of `nix-eval-jobs` and parses its output. Each parsed job is associated with a builder tag using the following order:
+
+1. KVM packages -> self-hosted runners
+2. Large packages on Linux -> 32vcpu ephemeral runners
+3. Darwin packages -> self-hosted runners
+4. Default -> ephemeral runners
+
+KVM packages and large packages are determined respectively by the `kvm` and `big-parralel` Nix attributes.
+
+GHA-wise, `.github/workflows/nix-eval.yml` is called by the `nix-build.yml` workflow. `github_matrix.py` is instanciated in the `Generate Nix Matrix` step through a `nix run` call. The resulting json map is stored in the workflow output and later used by the subsequent steps.
+
+### Step 2: Build
+
+This step is in charge of building the various Nix packages. Build matrices are instanciated for each system architecture.
+
+Implementation-wise, this step is less complex than the eval one. Most of the magic lies in the machine selection. The previous step attached an instance label to each job on which kind of GitHub runner it should be executed.
+
+The actual build step is a simple `nix build ${job}` invocation. The result of this build is pushed to the `nix-postgres-artifacts` s3 cache. This step is instantiated 3 times, once for each of the supported architectures: aarch64 darwin, aarch64 linux and x86_64 linux.
+
+### Step 3: Check
+
+This step uses again the JSON generated by the evaluation step to run various automated tests. Some of those require virtualization and are run on the self hosted runners able to perform KVM virtualization.
+
+Implementation-wise, this step is very similar to the previous one. A matrix job instantiated once per target architecture. It's "just" running on a different set of Nix jobs. These tests do assume the various plugins have been built and are part of the Nix cache.
+
+### Step 4: Images Build
+
+The last step builds AMI images using the artifacts generated during step 2 and uses the `nix/packages/build-ami.nix` script to generate a AMI image based on ubuntu noble. The generation of the image is done in two steps.
diff --git a/nix/mkdocs.yml b/nix/mkdocs.yml
@@ -38,6 +38,8 @@ nav:
     - Testing PG Upgrade Scripts: testing-pg-upgrade-scripts.md
     - Test Docker Images: docker-testing.md
     - NixOS Tests on macOS: nixos-tests-on-macos.md
+  - CI:
+    - Nix Build Matrix: nix-build-matrix-ci.md
   - References: references.md
 
 validation:
diff --git a/nix/packages/github-matrix/github_matrix.py b/nix/packages/github-matrix/github_matrix.py
@@ -290,15 +290,20 @@ def main() -> None:
         type=int,
         help="Maximum memory per eval worker in MiB. Defaults to 3072 (3 GiB).",
     )
+    parser.add_argument(
+        "-j",
+        "--nb-eval-jobs-workers",
+        default=os.cpu_count() or 1,
+        type=int,
+        help="Number of parallel eval jobs. Defaults to ncpus.",
+    )
     parser.add_argument(
         "flake_outputs", nargs="+", help="Nix flake outputs to evaluate"
     )
 
     args = parser.parse_args()
 
-    max_workers: int = os.cpu_count() or 1
-
-    cmd = build_nix_eval_command(max_workers, args.max_memory_size, args.flake_outputs)
+    cmd = build_nix_eval_command(args.nb_eval_jobs_workers, args.max_memory_size, args.flake_outputs)
 
     # Run evaluation and collect packages, warnings, and errors
     packages, warnings_list, errors_list = run_nix_eval_jobs(cmd)