djdev/deploy-dkstest-braincerebellumdata.yaml at devel · sruffner/djdev · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
# This YAML defines the deployment of the Lisberger lab portal application to the Duke Azure cluster 'dkstest' in the
# 'braincerebellumdata' namespace. Deployment to this test cluster is intended for testing and evaluating the portal
# application whenever the implementation changes in any way (bug fixes, new features, etc).
#
# Duke is in the process of replacing its RedHat OpenShift cluster with Microsoft Azure. Two distinct clusters will be
# maintained: 'dkstest' for testing/debugging apps and 'dks' for the production versions.
#
# The application now consists of 3 interdependent deployments:
#   1) data-store: A single-pod deployment with the application's two persistence mechanisms: the MariaDB server and its
#      underlying database files; and a Redis instance for Redis worker queueing and temporary storage of information
#      during a long-running experiment commit workflow.
#   2) web-server: A 2-replica deployment of the GUnicorn-based Python/Flask/Dash web backend.
#   3) web-worker: A 2-replica deployment of the Redis Queue workers that handle background tasks for the experiment
#      session commit workflow.
# The web-server backend replicas use Redis to store information on in-progress session commit jobs (so that the backend
# itself remains stateless), and the RQ workers handle various background tasks that are queued by the web server
# because they require extensive processing time (preprocessing a session archive; committing a session to the database
# and saving it to the backup repository; backing up the various operational logs).
#
# => NOTES.
#  1) The application uses two PVCs: pvc-db is for the exclusive use of the MariaDB container, while pvc-backend
#     provides persistent cluster storage for the portal server's "local workspace". The associated persistent volumes
#     are dynamically provisioned by Azure. Note that, initially, we used `private-azfile-standard-lrs` as the storage
#     class for both PVCs, but this did not work with the MariaDB container, which would fail to start b/c the MariaDB
#     daemon -- acting as USER 1001 as defined in the Docker file -- was denied access to the storage volume. Once we
#     switched the storage class for pvc-db to `default`, the MariaDB container spun up successfully.
#  2) The portal's backup repository is hosted on an AWS S3 bucket external to the Azure cluster. For the test/dev
#     version of the portal app deployed to the 'dkstest' cluster, the S3 bucket name is 'dh-braincerebellumdata-test'.
#  3) The Ingress resource named `sgl-portal-dkstest-in` provides TLS edge termination for `portal-server`, a ClusterIP
#     service that routes to the GUnicorn-served Dash/Flask backend listening on the node's 8050 port. With this
#     configuration, you access the portal at https://braincerebellumdata.dkstest.dhe.duke.edu/ -- BUT you must be
#     on the Duke VPN. We'll want to configure this differently for production. Note that the OpenShift deployment
#     used a Route resource to do this. Note the annotation `nginx.ingress.kubernetes.io/proxy-body-size: 10g` in the
#     Ingress spec. This allows for large file uploads to the portal backend through the NGINX-mediated ingress.
#  4) Take note of the image names for the backend, mariadb, and redis containers. These images are built in Gitlab
#     whenever we push changes to the 'devel' branch. The image tags end in 'devel-<sha>', where <sha> is a shortened
#     version of the commit SHA for that image. This is important. Previously, there was just the 'devel' tag, which
#     was overwritten each time a commit to the devel branch was pushed. If a pod failed or was kicked out on the
#     cluster, Azure will pull images from Gitlab when recreating the pod. This meant loss of control of what is running
#     on the cluster. THIS MEANS THAT EVERY TIME WE WANT TO DEPLOY A NEW VERSION, WE MUST UPDATE THE <sha> TAG
#     EVERYHWERE IN THIS FILE.
#  5) Secrets. Several environment variables are defined in an opaque Kubernetes Secret, sgl-secret. Another secret,
#     sgl-portal-pull-secret, contains the access token required to pull container images from the Gitlab repository for
#     the portal app. Both of these secrets must be deployed to the namespace prior to deploying any of the pods. The
#     YAMLs that define these secrets (deploy-sgl-secret.yaml and gitlab-sgl-portal-pull-secret.yaml) are excluded from
#     the Git repo.
#
# => Procedure to update the already deployed application IF no changes in database schema or the portal repository.
#
# Scenario: We're merely updating the container images for the backend, MariaDB, and Redis instance. These images
# are kept in a container registry in the portal application's Gitlab project, and this deployment YAML is configured
# to pull those images from Gitlab. When we update the container images, they'll have a new SHA (see discussion above).
# Basically, we have to update the spec of each of the 3 deployments -- `data-store`, `web-server`, and `web-worker` --
# to use the updated images.
#  1) Connect to the Duke VPN.
#  2) Head to portal.azure.com and log in with Duke credentials (email @duke.edu). You should see the production
#     (dhp-dhts-dks-cl01-aks) and test/dev clusters (dhd-dhts-dks-dev01-cl). If you click on the production cluster,
#     under "overview", there is a tab called "getting started". There instructions for setting up the Azure CLI on your
#     local machine and logging in to either cluster so you can make changes to the app using 'kubectl' from the command
#     line.
#  3) Login to the test/dev cluster "dhd-dhts-dks-dev01-cl". Type 'az login' at the terminal, and a browser tab opens to
#     complete the login - just click the @duke.edu Microsoft account to which you're already signed in.
#  4) Use kubectl to make 'braincerebellumdata' the current namespace and verify everything is running normally:
#     'kubectl config set-context --current --namespace=braincerebellumdata'.
#  5) Make sure this YAML is up-to-date with the container image names as described above, and ensure the current
#     directoy is the folder containing this YAML.
#  6) Then: 'kubectl apply -f deploy-dkstest-braincerebellumdata.yaml'. If successful, Azure/Kubernetes will
#     automatically begin a rolling update of each deployment comprising the portal app.
#
# This approach could cause issues if users are actively accessing the portal, or if any background tasks are running
# in any of the web-worker replica pods. So it is probably NOT a good idea to do things this way in production. Better
# to just bring the application down for a short while, after making sure no background tasks are running.
#
# => Using the 'deploy' selector label to selectively deploy resources
#
# The metadata:labels section of each resource includes a label key 'deploy' that lets you selectively deploy the
# resources using: kubectl apply -f <this file> -l 'deploy in (...)'. This is useful, for example, if you're deploying
# from scratch and want to do it in an orderly manner:
#     1) First, deploy the secrets sgl-secret and sgl-portal-pull-secret, which are defined in separate YAML files.
#     2) Deploy the PVCs and services: kubectl apply -f <this file> -l 'deploy in (pvc, service)'
#     3) Deploy the database and Redis: kubectl apply -f <this file> -l 'deploy in (data-pods)'
#     4) At this point, assuming step 3 succeeds, the portal database 'sgl' will not exist. If you have backed up the
#        entire contents of that database to a file sgl_bup.sql using mysqldump, you could 'kubectl cp' the file into
#        the /bitnami/mariadb directory and restore it by opening a terminal in that container with 'kubectl exec',
#        navigating to /bitnami/mariadb, and running 'mysql -uroot -p<pwd> < sgl_bup.sql.
#     5) Deploy the Redis worker pods: kubectl apply -f <this file> -l 'deploy in (worker-pods)'
#     6) At this point, the PVC for the web server and Redis workers is allocated and ready. On that volume is stored
#        some key files in the directory /portal_ws/logs: database_ops.log, appmessages.log, and api_requests.log. If
#        you have saved these files from a previously running version of the portal and need to restore them as well,
#        'kubectl exec' into the sgl-worker container of one of the web-worker pods and make sure the logs/ folder
#        exists under /portal_ws. Then do:
#           kubectl cp <local path to database_ops.log> <namespace>/<worker pod name>:/portal_ws/logs/database_ops.log
#     7) Deploy the web server pods: kubectl apply -f <this file> -l 'deploy in (web-pods)'
#     8) Finally, 'kubectl apply -f <this file> -l 'deploy in (ingress)' configures the ingress, exposing the portal
#        app to the outside world.
#

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sgl-portal-dkstest-in
  namespace: braincerebellumdata
  labels:
    app.kubernetes.io/instance: sgl-portal-dkstest
    app.kubernetes.io/name: sgl-portal
    deploy: ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: 10g
spec:
  ingressClassName: dhe-nginx
  tls:
    - hosts:
        - braincerebellumdata.dkstest.dhe.duke.edu
  rules:
    - host: braincerebellumdata.dkstest.dhe.duke.edu
      http:
        paths:
          - path: /
            pathType: ImplementationSpecific
            backend:
              service:
                name: portal-server
                port:
                  number: 8050
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-backend
  labels:
    deploy: pvc
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: private-azfile-standard-lrs
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-db
  labels:
    deploy: pvc
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: default
  resources:
    requests:
      storage: 200Gi
---
apiVersion: v1
kind: Service
metadata:
  name: portal-server
  labels:
    app: sgl-portal
    tier: web
    deploy: service
spec:
  type: ClusterIP
  selector:
    app: sgl-portal
    tier: web
  ports:
    - protocol: TCP
      port: 8050
      targetPort: 8050
---
apiVersion: v1
kind: Service
metadata:
  name: data-store
  labels:
    app: sgl-portal
    tier: data
    deploy: service
spec:
  type: ClusterIP
  selector:
    app: sgl-portal
    tier: data
  ports:
    - protocol: TCP
      name: redis
      port: 6379
      targetPort: 6379
    - protocol: TCP
      name: mariadb
      port: 3306
      targetPort: 3306
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: data-store
  labels:
    app: sgl-portal
    tier: data
    deploy: data-pods
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sgl-portal
      tier: data
  strategy:
    type: Recreate   # we must NOT have two MariaDB containers running at once, accessing the same PVC!
  template:
    metadata:
      labels:
        app: sgl-portal
        tier: data
    spec:
      imagePullSecrets:
      - name: sgl-portal-pull-secret
      initContainers:
      - name: fix-permissions
        image: busybox
        command: [ 'sh', '-c', 'chmod -R g+rwX /bitnami/mariadb' ]
        volumeMounts:
          - name: db-data
            mountPath: /bitnami/mariadb
      containers:
      - name: sgl-db
        image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/mariadb:devel-411291c6
        imagePullPolicy: Always
        volumeMounts:
          - name: db-data
            mountPath: /bitnami/mariadb
        env:
          - name: MARIADB_ROOT_PASSWORD
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: mariadb_root_password
        ports:
          - containerPort: 3360
        startupProbe:
          exec:
            command: [ "sh", "-c", "exec mariadb-admin status -uroot -p$MARIADB_ROOT_PASSWORD" ]
          initialDelaySeconds: 20
          periodSeconds: 10
          failureThreshold: 60   # Give container up to 10 minutes to start up (for crash recovery, etc)
        livenessProbe:
          exec:
            command: ["sh", "-c", "exec mariadb-admin status -uroot -p$MARIADB_ROOT_PASSWORD"]
          initialDelaySeconds: 20
          periodSeconds: 5
        readinessProbe:
          exec:
            command: ["sh", "-c", "exec mariadb-admin status -uroot -p$MARIADB_ROOT_PASSWORD"]
          initialDelaySeconds: 20
          periodSeconds: 5
      - name: sgl-redis
        image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/redis:devel-411291c6
        imagePullPolicy: Always
        ports:
          - containerPort: 6379
        livenessProbe:
          exec:
            command: ["sh", "-c", 'RES=$(redis-cli ping); if [ "$RES" != "PONG" ]; then echo "$RES"; exit 1; fi']
          initialDelaySeconds: 20
          periodSeconds: 5
        readinessProbe:
          exec:
            command: ["sh", "-c", 'RES=$(redis-cli ping); if [ "$RES" != "PONG" ]; then echo "$RES"; exit 1; fi']
          initialDelaySeconds: 5
          periodSeconds: 5
      volumes:
        - name: db-data
          persistentVolumeClaim:
            claimName: pvc-db
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-worker
  labels:
    app: sgl-portal
    tier: worker
    deploy: worker-pods
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sgl-portal
      tier: worker
  template:
    metadata:
      labels:
        app: sgl-portal
        tier: worker
    spec:
      imagePullSecrets:
      - name: sgl-portal-pull-secret
      initContainers:
      - name: wait-for-redis
        image: busybox
        command: ['sh', '-c', 'echo Waiting for Redis; until nc -z $(REDIS_HOST) $(REDIS_PORT); do printf "."; sleep 1; done; echo Ready!']
        env:
          - name: REDIS_HOST
            value: data-store.braincerebellumdata.svc.cluster.local
          - name: REDIS_PORT
            value: "6379"
      containers:
      - name: sgl-worker
        image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/backend:devel-411291c6
        imagePullPolicy: Always
        command: ["rq"]
        args: ["worker", "--with-scheduler", "--url", "redis://$(REDIS_HOST):$(REDIS_PORT)"]
        volumeMounts:
          - name: backend-workspace
            mountPath: /portal_ws
        env:
          - name: REDIS_HOST
            value: data-store.braincerebellumdata.svc.cluster.local
          - name: REDIS_PORT
            value: "6379"
          - name: MARIADB_HOSTNAME
            value: data-store.braincerebellumdata.svc.cluster.local
          - name: WORKSPACE_DIR
            value: /portal_ws
          - name: APP_CONTAINER
            value: sgl-worker
          - name: AWS_REGION_NAME
            value: us-east-1
          - name: REPO_S3_BUCKET_NAME
            value: dh-braincerebellumdata-test
          - name: MARIADB_ROOT_PASSWORD
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: mariadb_root_password
          - name: FLASK_SECRET_KEY
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: flask_secret_key
          - name: JWT_SECRET_KEY
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: jwt_secret_key
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: aws_access_key_id
          - name: AWS_ACCESS_KEY_SECRET
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: aws_access_key_secret
        livenessProbe:
          exec:
            command: [ "sh", "-c", 'H=$(hostname); rq info -W | grep -c $H' ]
          initialDelaySeconds: 20
          periodSeconds: 5
        readinessProbe:
          exec:
            command: [ "sh", "-c", 'H=$(hostname); rq info -W | grep -c $H' ]
          initialDelaySeconds: 5
          periodSeconds: 5
      volumes:
        - name: backend-workspace
          persistentVolumeClaim:
            claimName: pvc-backend
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
  labels:
    app: sgl-portal
    tier: web
    deploy: web-pods
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sgl-portal
      tier: web
  template:
    metadata:
      labels:
        app: sgl-portal
        tier: web
    spec:
      imagePullSecrets:
      - name: sgl-portal-pull-secret
      initContainers:
      - name: wait-for-redis
        image: busybox
        command: ['sh', '-c', 'echo Waiting for Redis; until nc -z $(DATA_HOST) 6379; do printf "."; sleep 1; done; echo Ready!']
        env:
          - name: DATA_HOST
            value: data-store.braincerebellumdata.svc.cluster.local
      - name: wait-for-mariadb
        image: busybox
        command: ['sh', '-c', 'echo Waiting for MariaDB; until nc -z $(DATA_HOST) 3306; do printf "."; sleep 1; done; echo Ready!']
        env:
          - name: DATA_HOST
            value: data-store.braincerebellumdata.svc.cluster.local
      containers:
      - name: sgl-server
        image: gitlab.dhe.duke.edu:4567/sar52/sgl-portal/backend:devel-411291c6
        imagePullPolicy: Always
        ports:
          - containerPort: 8050
        volumeMounts:
          - name: backend-workspace
            mountPath: /portal_ws
        env:
          - name: REDIS_HOST
            value: data-store.braincerebellumdata.svc.cluster.local
          - name: REDIS_PORT
            value: "6379"
          - name: MARIADB_HOSTNAME
            value: data-store.braincerebellumdata.svc.cluster.local
          - name: WORKSPACE_DIR
            value: /portal_ws
          - name: APP_CONTAINER
            value: sgl-server
          - name: AWS_REGION_NAME
            value: us-east-1
          - name: REPO_S3_BUCKET_NAME
            value: dh-braincerebellumdata-test
          - name: MARIADB_ROOT_PASSWORD
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: mariadb_root_password
          - name: FLASK_SECRET_KEY
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: flask_secret_key
          - name: JWT_SECRET_KEY
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: jwt_secret_key
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: aws_access_key_id
          - name: AWS_ACCESS_KEY_SECRET
            valueFrom:
              secretKeyRef:
                name: sgl-secret
                key: aws_access_key_secret
        livenessProbe:
          exec:
            command: [ "sh", "-c", 'pidof -x gunicorn' ]
          initialDelaySeconds: 20
          periodSeconds: 5
        readinessProbe:
          httpGet:
            port: 8050
            path: /
          initialDelaySeconds: 20
          periodSeconds: 5
      volumes:
        - name: backend-workspace
          persistentVolumeClaim:
            claimName: pvc-backend