-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdeploy-dkstest-braincerebellumdata-test.yaml
More file actions
436 lines (435 loc) · 16.1 KB
/
deploy-dkstest-braincerebellumdata-test.yaml
File metadata and controls
436 lines (435 loc) · 16.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
#
# DEPRECATED: In Aug 2023, we moved the test/dev version of the portal to the 'braincerebellumdata' namespace on the
# Duke Azure cluster 'dkstest'. When we're ready to spin up a production version, it will be deployed to the same
# namespace but on a different Azure cluster, 'dks'. The 'braincerebellumdata-test' namespace will be deleted.
#
# ----------------------
# This YAML defines the deployment of the Lisberger lab portal application to the Duke Azure beta-test cluster in the
# braincerebellumdata-test namespace. This namespace is used solely to test and evaluate the application whenever the
# implementation changes in any way (bug fixes, new features, etc).
#
# Duke is in the process of transitioning from Redhat OpenShift to Microsoft Azure. The first step is to verify that
# customer apps can be deployed successfully on the beta-test Azure cluster, and to make the necessary changes in the
# deployment specs required because of differences in the Azure and OpenShift Kubernetes implementations.
#
# The application now consists of 3 interdependent deployments:
# 1) data-store: A single-pod deployment with the application's two persistence mechanisms: the MariaDB server and its
# underlying database files; and a Redis instance for Redis worker queueing and temporary storage of information
# during a long-running experiment commit workflow.
# 2) web-server: A 2-replica deployment of the GUnicorn-based Python/Flask/Dash web backend.
# 3) web-worker: A 2-replica deployment of the Redis Queue workers that handle background tasks for the experiment
# session commit workflow.
# The web-server backend replicas use Redis to store information on in-progress session commit jobs (so that the backend
# itself remains stateless), and the RQ workers handle various background tasks that are queued by the web server
# because they require extensive processing time (preprocessing a session archive; committing a session to the database
# and saving it to the backup repository; backing up the various operational logs).
#
# => NOTES.
# 1) The application uses two PVCs: pvc-db is for the exclusive use of the MariaDB container, while pvc-backend
# provides persistent cluster storage for the portal server's "local workspace". The portal's backup repository
# is hosted on an AWS S3 bucket external to the Azure cluster. The associated persistent volumes are dynamically
# provisioned by Azure. Note that, initially, we used `private-azfile-standard-lrs` as the storage class for both
# PVCs, but this did not work with the MariaDB container, which would fail to start b/c the MariaDB daemon --
# acting as USER 1001 as defined in the Docker file -- was denied access to the storage volume. Once we switched
# the storage class for pvc-db to `default`, the MariaDB container spun up successfully.
# 2) The Ingress resource named `sgl-portal-dkstest-in` provides TLS edge termination for `portal-server`, a ClusterIP
# service that routes to the GUnicorn-served Dash/Flask backend listening on the node's 8050 port. With this
# configuration, you access the portal at https://braincerebellumdata-test.dkstest.dhe.duke.edu/ -- BUT you must be
# on the Duke VPN. We'll want to configure this differently for production. Note that the OpenShift deployment
# used a Route resource to do this. Note the annotation `nginx.ingress.kubernetes.io/proxy-body-size: 10g` in the
# Ingress spec. This allows for large file uploads to the portal backend through the NGINX-mediated ingress.
# 3) Take note of the image names for the backend, mariadb, and redis containers. These images are built in Gitlab
# whenever we push changes to the 'devel' branch. The image tags end in 'devel-<sha>', where <sha> is a shortened
# version of the commit SHA for that image. This is important. Previously, there was just the 'devel' tag, which
# was overwritten each time a commit to the devel branch was pushed. If a pod failed or was kicked out on the
# cluster, Azure will pull images from Gitlab when recreating the pod. This meant loss of control of what is running
# on the cluster. THIS MEANS THAT EVERY TIME WE WANT TO DEPLOY A NEW VERSION, WE MUST UPDATE THE <sha> TAG
# EVERYHWERE IN THIS FILE.
# 4) Secrets. Several environment variables are defined in an opaque Kubernetes Secret, sgl-secret, which must be
# deployed prior to deploying the application. The YAML that defines the secret is excluded from the Git repo to
# protect those secrets.
#
# => Procedure to update the already deployed application IF no changes in database schema or the portal repository.
#
# Scenario: We're merely updating the container images for the backend, MariaDB, and Redis instance. These images
# are kept in a container registry in the portal application's Gitlab project, and this deployment YAML is configured
# to pull those images from Gitlab. When we update the container images, they'll have a new SHA (see discussion above).
# Basically, we have to update the spec of each of the 3 deployments -- `data-store`, `web-server`, and `web-worker` --
# to use the updated images.
# 1) Connect to the Duke VPN.
# 2) Open the Lens Desktop app (already setup to to access the braincerebellumdata-test namespace on Duke's Azure
# cluster, and select "Browse clusters in catalog".
# 3) Connect to cluster "dhd-dhts-dks-dev01-cl". At this point, you should see -- under "Overview", the pods,
# deployments, and replica sets comprising the currently running portal application.
# 4) Go to "Deployments" and edit each of the deployments -- data-store, web-server, and web-worker -- updating the
# container image (basically, you're updating the SHA) in the deployment spec for the redis, mariadb, and backend
# images.
# Azure will automatically begin a rolling update of each deployment once you save the changes.
#
# This approach could cause issues if users are actively accessing the portal, or if any background tasks are running
# in any of the web-worker replica pods. So it is probably NOT a good idea to do things this way in production. Better
# to just bring the application down for a short while, after making sure no background tasks are running.
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: sgl-portal-dkstest-in
namespace: braincerebellumdata-test
labels:
app.kubernetes.io/instance: sgl-portal-dkstest
app.kubernetes.io/name: sgl-portal
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: 10g
spec:
ingressClassName: dhe-nginx
tls:
- hosts:
- braincerebellumdata-test.dkstest.dhe.duke.edu
rules:
- host: braincerebellumdata-test.dkstest.dhe.duke.edu
http:
paths:
- path: /
pathType: ImplementationSpecific
backend:
service:
name: portal-server
port:
number: 8050
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-backend
spec:
accessModes:
- ReadWriteMany
storageClassName: private-azfile-standard-lrs
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-db
spec:
accessModes:
- ReadWriteOnce
storageClassName: default
resources:
requests:
storage: 200Gi
---
apiVersion: v1
kind: Service
metadata:
name: portal-server
labels:
app: sgl-portal
tier: web
spec:
type: ClusterIP
selector:
app: sgl-portal
tier: web
ports:
- protocol: TCP
port: 8050
targetPort: 8050
---
apiVersion: v1
kind: Service
metadata:
name: data-store
labels:
app: sgl-portal
tier: data
spec:
type: ClusterIP
selector:
app: sgl-portal
tier: data
ports:
- protocol: TCP
name: redis
port: 6379
targetPort: 6379
- protocol: TCP
name: mariadb
port: 3306
targetPort: 3306
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: data-store
labels:
app: sgl-portal
tier: data
spec:
replicas: 1
selector:
matchLabels:
app: sgl-portal
tier: data
strategy:
type: Recreate # we must NOT have two MariaDB containers running at once, accessing the same PVC!
template:
metadata:
labels:
app: sgl-portal
tier: data
spec:
imagePullSecrets:
- name: sgl-portal-pull-secret
initContainers:
- name: fix-permissions
image: busybox
command: [ 'sh', '-c', 'chmod -R g+rwX /bitnami/mariadb' ]
volumeMounts:
- name: db-data
mountPath: /bitnami/mariadb
containers:
- name: sgl-db
image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/mariadb:devel-686e3711
imagePullPolicy: Always
volumeMounts:
- name: db-data
mountPath: /bitnami/mariadb
env:
- name: MARIADB_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: sgl-secret
key: mariadb_root_password
ports:
- containerPort: 3360
startupProbe:
exec:
command: [ "sh", "-c", "exec mariadb-admin status -uroot -p$MARIADB_ROOT_PASSWORD" ]
initialDelaySeconds: 20
periodSeconds: 10
failureThreshold: 60 # Give container up to 10 minutes to start up (for crash recovery, etc)
livenessProbe:
exec:
command: ["sh", "-c", "exec mariadb-admin status -uroot -p$MARIADB_ROOT_PASSWORD"]
initialDelaySeconds: 20
periodSeconds: 5
readinessProbe:
exec:
command: ["sh", "-c", "exec mariadb-admin status -uroot -p$MARIADB_ROOT_PASSWORD"]
initialDelaySeconds: 20
periodSeconds: 5
- name: sgl-redis
image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/redis:devel-686e3711
imagePullPolicy: Always
ports:
- containerPort: 6379
livenessProbe:
exec:
command: ["sh", "-c", 'RES=$(redis-cli ping); if [ "$RES" != "PONG" ]; then echo "$RES"; exit 1; fi']
initialDelaySeconds: 20
periodSeconds: 5
readinessProbe:
exec:
command: ["sh", "-c", 'RES=$(redis-cli ping); if [ "$RES" != "PONG" ]; then echo "$RES"; exit 1; fi']
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: db-data
persistentVolumeClaim:
claimName: pvc-db
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-worker
labels:
app: sgl-portal
tier: worker
spec:
replicas: 2
selector:
matchLabels:
app: sgl-portal
tier: worker
template:
metadata:
labels:
app: sgl-portal
tier: worker
spec:
imagePullSecrets:
- name: sgl-portal-pull-secret
initContainers:
- name: wait-for-redis
image: busybox
command: ['sh', '-c', 'echo Waiting for Redis; until nc -z $(REDIS_HOST) $(REDIS_PORT); do printf "."; sleep 1; done; echo Ready!']
env:
- name: REDIS_HOST
value: data-store.braincerebellumdata-test.svc.cluster.local
- name: REDIS_PORT
value: "6379"
containers:
- name: sgl-worker
image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/backend:devel-686e3711
imagePullPolicy: Always
command: ["rq"]
args: ["worker", "--with-scheduler", "--url", "redis://$(REDIS_HOST):$(REDIS_PORT)"]
volumeMounts:
- name: backend-workspace
mountPath: /portal_ws
env:
- name: REDIS_HOST
value: data-store.braincerebellumdata-test.svc.cluster.local
- name: REDIS_PORT
value: "6379"
- name: MARIADB_HOSTNAME
value: data-store.braincerebellumdata-test.svc.cluster.local
- name: WORKSPACE_DIR
value: /portal_ws
- name: APP_CONTAINER
value: sgl-worker
- name: AWS_REGION_NAME
value: us-east-1
- name: REPO_S3_BUCKET_NAME
value: dh-braincerebellumdata-test
- name: MARIADB_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: sgl-secret
key: mariadb_root_password
- name: FLASK_SECRET_KEY
valueFrom:
secretKeyRef:
name: sgl-secret
key: flask_secret_key
- name: JWT_SECRET_KEY
valueFrom:
secretKeyRef:
name: sgl-secret
key: jwt_secret_key
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: sgl-secret
key: aws_access_key_id
- name: AWS_ACCESS_KEY_SECRET
valueFrom:
secretKeyRef:
name: sgl-secret
key: aws_access_key_secret
livenessProbe:
exec:
command: [ "sh", "-c", 'H=$(hostname); rq info -W | grep -c $H' ]
initialDelaySeconds: 20
periodSeconds: 5
readinessProbe:
exec:
command: [ "sh", "-c", 'H=$(hostname); rq info -W | grep -c $H' ]
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: backend-workspace
persistentVolumeClaim:
claimName: pvc-backend
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
labels:
app: sgl-portal
tier: web
spec:
replicas: 2
selector:
matchLabels:
app: sgl-portal
tier: web
template:
metadata:
labels:
app: sgl-portal
tier: web
spec:
imagePullSecrets:
- name: sgl-portal-pull-secret
initContainers:
- name: wait-for-redis
image: busybox
command: ['sh', '-c', 'echo Waiting for Redis; until nc -z $(DATA_HOST) 6379; do printf "."; sleep 1; done; echo Ready!']
env:
- name: DATA_HOST
value: data-store.braincerebellumdata-test.svc.cluster.local
- name: wait-for-mariadb
image: busybox
command: ['sh', '-c', 'echo Waiting for MariaDB; until nc -z $(DATA_HOST) 3306; do printf "."; sleep 1; done; echo Ready!']
env:
- name: DATA_HOST
value: data-store.braincerebellumdata-test.svc.cluster.local
containers:
- name: sgl-server
image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/backend:devel-686e3711
imagePullPolicy: Always
ports:
- containerPort: 8050
volumeMounts:
- name: backend-workspace
mountPath: /portal_ws
env:
- name: REDIS_HOST
value: data-store.braincerebellumdata-test.svc.cluster.local
- name: REDIS_PORT
value: "6379"
- name: MARIADB_HOSTNAME
value: data-store.braincerebellumdata-test.svc.cluster.local
- name: WORKSPACE_DIR
value: /portal_ws
- name: APP_CONTAINER
value: sgl-server
- name: AWS_REGION_NAME
value: us-east-1
- name: REPO_S3_BUCKET_NAME
value: dh-braincerebellumdata-test
- name: MARIADB_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: sgl-secret
key: mariadb_root_password
- name: FLASK_SECRET_KEY
valueFrom:
secretKeyRef:
name: sgl-secret
key: flask_secret_key
- name: JWT_SECRET_KEY
valueFrom:
secretKeyRef:
name: sgl-secret
key: jwt_secret_key
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: sgl-secret
key: aws_access_key_id
- name: AWS_ACCESS_KEY_SECRET
valueFrom:
secretKeyRef:
name: sgl-secret
key: aws_access_key_secret
livenessProbe:
exec:
command: [ "sh", "-c", 'pidof -x gunicorn' ]
initialDelaySeconds: 20
periodSeconds: 5
readinessProbe:
httpGet:
port: 8050
path: /
initialDelaySeconds: 20
periodSeconds: 5
volumes:
- name: backend-workspace
persistentVolumeClaim:
claimName: pvc-backend