You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Change the configuration in `eks.conf` to use yaml and set it to `eks-dl1.yaml`:
22
+
```
23
+
export CONFIG=yaml
24
+
export EKS_YAML=./eks-dl1.yaml
25
+
```
22
26
23
-
change eks.conf
24
-
export CONFIG=yaml
25
-
export EKS_YAML=./eks-dl1.yaml
27
+
The file `eks-dl1.yaml` describes the node groups that will be created for the cluster and what EC2 key pair will be used in the instances. [Create a EC2 key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) or use one that you already have.
26
28
27
-
change eks-dl1.yaml
28
-
change or create ssh key:
29
-
publicKeyName: DL1_Key
29
+
```yaml
30
+
publicKeyName: DL1_Key
31
+
```
30
32
31
33
### 0.2.3. Build and run aws-do-eks container
32
34
@@ -38,7 +40,7 @@ change eks-dl1.yaml
38
40
```
39
41
40
42
### 0.2.4. Create cluster
41
-
43
+
Within the aws-do-eks container run:
42
44
```
43
45
echo "AWS_PROFILE=$AWS_PROFILE"
44
46
./eks-create.sh
@@ -59,8 +61,7 @@ cd /eks/deployment/csi/efs/
59
61
./deploy.sh
60
62
kubectl apply -f ./efs-pvc.yaml
61
63
```
62
-
** TODO: Improve deploy script to detect existing efs without mounting target in the eks cluster subnets.
63
-
Consequently remove explicit call to ./efs-create.sh.
64
+
64
65
65
66
### 0.2.6. Deploy plugins and operators
66
67
@@ -77,6 +78,7 @@ cd /eks/deployment/efa-device-plugin
77
78
```
78
79
79
80
#### 0.2.6.3. Deploy Kubeflow mpi-operator
81
+
```
80
82
cd /eks/deployment/kubeflow/mpi-operator
81
83
./deploy.sh
82
84
```
@@ -86,14 +88,25 @@ cd /eks/deployment/kubeflow/mpi-operator
86
88
```
87
89
cd /eks/deployment/distributed-training/pytorch/habana/deepspeed-bert
88
90
```
89
-
adjust settings
91
+
In the file `deepspeed-bert.yaml.template`, set the number of desired workers:
92
+
93
+
```yaml
94
+
Worker:
95
+
replicas: 2
96
+
```
97
+
In this case, the number of workers is the number of instances (nodes) you want to run the training.
98
+
99
+
#### Adjust training hyperparameters (Optional)
90
100
91
-
Set the number of desired workers in
92
-
deepspeed-bert.yaml.template
101
+
You can change the DeepSpeed parameters (as `train_batch_size` and `train_micro_batch_size_per_gpu`) in the file `scripts/deepspeed_config_bert_1.5b.json`
93
102
94
-
Set the training hyperparametrs (LR, batch sizes, etc.) in
95
-
scripts/deepspeed_config_bert_1.5b.json
96
-
scripts/launch_train.sh
103
+
Other training parameters can be adjusted in the launch script `scripts/launch_train.sh`:
104
+
105
+
```
106
+
MAX_SEQ_LENGTH=128
107
+
MAX_STEPS=155000
108
+
LR=0.0015
109
+
```
97
110
98
111
## 1. Build and push deep learning container
99
112
@@ -104,6 +117,8 @@ scripts/launch_train.sh
104
117
105
118
## 2. Download data
106
119
120
+
Before running the training, you first have to download and pre-process the dataset:
121
+
107
122
```
108
123
./2-1-data-download.sh
109
124
./2-2-data-status.sh
@@ -116,6 +131,8 @@ Downloading and pre-processing the data takes a long time (could be more than 24
116
131
117
132
### 3.1. Scale up DL1 nodes
118
133
134
+
Once you have the data downloaded and pre-processed, you can prepare the enviroment for the training task.
135
+
Here choose the same number of workers you have set in the `deepspeed-bert.yaml.template` file.
0 commit comments