Skip to content

Commit 8d444e4

Browse files
committed
helm(cronjobs): schedule mongo backup/restore jobs on non-preemptible nodes
GKE preemptible VMs can be reclaimed by Google mid-run, causing long-running mongo jobs (sync, backup, restore) to be killed with no error — the Job then reports Failed because the Pod disappeared. Adding nodeAffinity (requiredDuringScheduling, cloud.google.com/gke-preemptible DoesNotExist) to all four mongo workloads so they are never placed on preemptible nodes: - cronjob/sync-mongo-production-data - cronjob/mongo-backup - cronjob/mongo-backup-extra - jobs/mongo-restore The nodeAffinity is values-driven (backup.mongo.nodeAffinity, restore.nodeAffinity, cronJobs.syncMongoProductionData.nodeAffinity) so it can be overridden per environment (set to {} to opt out). Made-with: Cursor
1 parent 4b37264 commit 8d444e4

4 files changed

Lines changed: 27 additions & 0 deletions

File tree

helm-chart/sefaria/templates/cronjob/mongo-backup-extra.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ spec:
2727
values:
2828
- mongo
2929
topologyKey: kubernetes.io/hostname
30+
{{- if not (empty .Values.backup.mongo.nodeAffinity) }}
31+
nodeAffinity:
32+
{{- toYaml .Values.backup.mongo.nodeAffinity | nindent 14 }}
33+
{{- end }}
3034
tolerations:
3135
- key: schedule-on-database-vm
3236
operator: "Equal"

helm-chart/sefaria/templates/cronjob/mongo-backup.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ spec:
2727
values:
2828
- mongo
2929
topologyKey: kubernetes.io/hostname
30+
{{- if not (empty .Values.backup.mongo.nodeAffinity) }}
31+
nodeAffinity:
32+
{{- toYaml .Values.backup.mongo.nodeAffinity | nindent 14 }}
33+
{{- end }}
3034
tolerations:
3135
- key: schedule-on-database-vm
3236
operator: "Equal"

helm-chart/sefaria/templates/jobs/mongo-restore.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,11 @@ spec:
1717
template:
1818
spec:
1919
serviceAccount: {{ .Values.restore.serviceAccount }}
20+
{{- if not (empty .Values.restore.nodeAffinity) }}
21+
affinity:
22+
nodeAffinity:
23+
{{- toYaml .Values.restore.nodeAffinity | nindent 10 }}
24+
{{- end }}
2025
volumes:
2126
- name: shared-volume
2227
emptyDir:

helm-chart/sefaria/values.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,13 @@ restore:
4545
bucket: sefaria-mongo-backup
4646
# tarball:
4747
serviceAccount: database-backup-read
48+
# Avoid GKE preemptible nodes during restore. Set nodeAffinity: {} to allow any pool.
49+
nodeAffinity:
50+
requiredDuringSchedulingIgnoredDuringExecution:
51+
nodeSelectorTerms:
52+
- matchExpressions:
53+
- key: cloud.google.com/gke-preemptible
54+
operator: DoesNotExist
4855
# config to backup environment DB
4956
backup:
5057
mongo:
@@ -56,6 +63,13 @@ backup:
5663
archiveBucket: sefaria-mongo-archive
5764
serviceAccount: database-backup-write
5865
version: 4.4
66+
# Avoid GKE preemptible nodes mid-dump. Set nodeAffinity: {} to allow any pool.
67+
nodeAffinity:
68+
requiredDuringSchedulingIgnoredDuringExecution:
69+
nodeSelectorTerms:
70+
- matchExpressions:
71+
- key: cloud.google.com/gke-preemptible
72+
operator: DoesNotExist
5973
postgres:
6074
enabled: false
6175
version: 10.3

0 commit comments

Comments
 (0)