Skip to content

Commit e2fc94c

Browse files
picnoirjfroche
authored andcommitted
feat: nix/doc document Nix CI
- Document how the various components github_matrix.py/GitHub fit together. - Document workaround for the recurring incidents.
1 parent 70368d2 commit e2fc94c

4 files changed

Lines changed: 98 additions & 3 deletions

File tree

nix/docs/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ learn how to play with `postgres` in the [build guide](./build-postgres.md).
4141
- **[Testing PG Upgrade Scripts](./testing-pg-upgrade-scripts.md)** - Testing PostgreSQL upgrades
4242
- **[Docker Image testing](./docker-testing.md)** - How to test the docker images against the pg_regress test suite.
4343

44+
## CI
45+
46+
- **[Nix Build Matrix](./nix-build-matrix-ci.md)** - Understand how the CI Nix build matrix works
47+
4448
## Reference
4549

4650
- **[References](./references.md)** - Useful links and resources

nix/docs/nix-build-matrix-ci.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Nix Matrix CI
2+
3+
## Incident Q/A
4+
5+
Something broke the Nix CI, you need a quick and dirty fix to unblock you as fast as possible, follow these guides.
6+
7+
### **Q:** A test is failing; how to ignore it and generate an AMI image anyway?
8+
9+
You can adopt the nuclear approach and generate the AMI image regardless of the test outcome. To do that, remove the three conditions checking the `nix-build-checks` resullt in the if clause of the `run-testinfra` step in the `.github/workflows/nix-build.yml` file.
10+
11+
IE. remove the following conditions:
12+
13+
```
14+
(needs.nix-build-checks-aarch64-linux.result == 'skipped' || needs.nix-build-checks-aarch64-linux.result == 'success')
15+
(needs.nix-build-checks-aarch64-darwin.result == 'skipped' || needs.nix-build-checks-aarch64-darwin.result == 'success')
16+
(needs.nix-build-checks-x86_64-linux.result == 'skipped' || needs.nix-build-checks-x86_64-linux.result == 'success')
17+
```
18+
19+
Note: the merge queue check will block the PR from getting merged to develop.
20+
21+
### **Q:** A hosted runner is down, how to reschedule a job somewhere else?
22+
23+
**A:** Edit the `BUILD_RUNNER_MAP` dictionary in the `github_matrix.py` script and change the `labels` entry to match one of the still functional GitHub runners.
24+
25+
You can see the available runners and their associated labels on [this page](https://github.com/supabase/postgres/actions/runners?tab=self-hosted). Note: the blacksmith runners are considered as "self-hosted" by GitHub.
26+
27+
### **Q:** The eval step is OOM-ing, what should I do?
28+
29+
**A:** The evaluation can be quite costly memory-wise. [nix-eval-jobs](https://github.com/NixOS/nix-eval-jobs) is spinning up multiple nix evaluation in parallel to speed things up. The tradeoff is an increased memory consumption compared to a single-process eval.
30+
31+
An easy way to go around a eval OOM issue is to reduce the number of parallel evals by overriding the `--nb-eval-jobs-workers` flag from the github_matrix call in the `github/workflows/nix-eval.yml` file.
32+
33+
IE. replace
34+
35+
```terminal
36+
nix run --accept-flake-config .\#github-matrix -- checks legacyPackages
37+
```
38+
39+
with:
40+
41+
```terminal
42+
nix run --accept-flake-config .\#github-matrix -- --nb-eval-jobs-workers 30 checks legacyPackages
43+
```
44+
45+
Note: by default, the `github_matrix.py` will spin up a eval instance per CPU. For a `blacksmith-32vcpu-ubuntu-2404` worker, it means it'll spin up 32 nix eval instances.
46+
47+
## Walkthrough the CI
48+
49+
The Nix artifacts are built from the `Nix CI` workflow defined in the `.github/workflows/nix-build.yml` file.
50+
51+
It's performed in 4 steps. Each step depending on the previous one.
52+
53+
### Step 1: Eval
54+
55+
Conceptually, this workflow evaluates the `legacyPackages` and `checks` flake outputs using [nix-eval-jobs](https://github.com/NixOS/nix-eval-jobs). This step produces a json map containing the jobs to build/check for each architecture. That json map is later consumed by the subsequent build and check steps.
56+
57+
Implementation-wise, most of the code lives in the `/nix/packages/github-matrix/github_matrix.py` python script. The script starts an instance of `nix-eval-jobs` and parses its output. Each parsed job is associated with a builder tag using the following order:
58+
59+
1. KVM packages -> self-hosted runners
60+
2. Large packages on Linux -> 32vcpu ephemeral runners
61+
3. Darwin packages -> self-hosted runners
62+
4. Default -> ephemeral runners
63+
64+
KVM packages and large packages are determined respectively by the `kvm` and `big-parralel` Nix attributes.
65+
66+
GHA-wise, `.github/workflows/nix-eval.yml` is called by the `nix-build.yml` workflow. `github_matrix.py` is instanciated in the `Generate Nix Matrix` step through a `nix run` call. The resulting json map is stored in the workflow output and later used by the subsequent steps.
67+
68+
### Step 2: Build
69+
70+
This step is in charge of building the various Nix packages. Build matrices are instanciated for each system architecture.
71+
72+
Implementation-wise, this step is less complex than the eval one. Most of the magic lies in the machine selection. The previous step attached an instance label to each job on which kind of GitHub runner it should be executed.
73+
74+
The actual build step is a simple `nix build ${job}` invocation. The result of this build is pushed to the `nix-postgres-artifacts` s3 cache. This step is instantiated 3 times, once for each of the supported architectures: aarch64 darwin, aarch64 linux and x86_64 linux.
75+
76+
### Step 3: Check
77+
78+
This step uses again the JSON generated by the evaluation step to run various automated tests. Some of those require virtualization and are run on the self hosted runners able to perform KVM virtualization.
79+
80+
Implementation-wise, this step is very similar to the previous one. A matrix job instantiated once per target architecture. It's "just" running on a different set of Nix jobs. These tests do assume the various plugins have been built and are part of the Nix cache.
81+
82+
### Step 4: Images Build
83+
84+
The last step builds AMI images using the artifacts generated during step 2 and uses the `nix/packages/build-ami.nix` script to generate a AMI image based on ubuntu noble. The generation of the image is done in two steps.

nix/mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ nav:
3838
- Testing PG Upgrade Scripts: testing-pg-upgrade-scripts.md
3939
- Test Docker Images: docker-testing.md
4040
- NixOS Tests on macOS: nixos-tests-on-macos.md
41+
- CI:
42+
- Nix Build Matrix: nix-build-matrix-ci.md
4143
- References: references.md
4244

4345
validation:

nix/packages/github-matrix/github_matrix.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -290,15 +290,20 @@ def main() -> None:
290290
type=int,
291291
help="Maximum memory per eval worker in MiB. Defaults to 3072 (3 GiB).",
292292
)
293+
parser.add_argument(
294+
"-j",
295+
"--nb-eval-jobs-workers",
296+
default=os.cpu_count() or 1,
297+
type=int,
298+
help="Number of parallel eval jobs. Defaults to ncpus.",
299+
)
293300
parser.add_argument(
294301
"flake_outputs", nargs="+", help="Nix flake outputs to evaluate"
295302
)
296303

297304
args = parser.parse_args()
298305

299-
max_workers: int = os.cpu_count() or 1
300-
301-
cmd = build_nix_eval_command(max_workers, args.max_memory_size, args.flake_outputs)
306+
cmd = build_nix_eval_command(args.nb_eval_jobs_workers, args.max_memory_size, args.flake_outputs)
302307

303308
# Run evaluation and collect packages, warnings, and errors
304309
packages, warnings_list, errors_list = run_nix_eval_jobs(cmd)

0 commit comments

Comments
 (0)