Skip to content

Commit b1fd3e5

Browse files
committed
Update doc with validated instructions
1 parent 2ca671e commit b1fd3e5

1 file changed

Lines changed: 40 additions & 17 deletions

File tree

  • Container-Root/eks/deployment/distributed-training/pytorch/habana/deepspeed-bert

Container-Root/eks/deployment/distributed-training/pytorch/habana/deepspeed-bert/README.md

Lines changed: 40 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,17 @@ git clone https://github.com/aws-samples/aws-do-eks.git
1818

1919
### 0.2.2. Configure cluster
2020

21-
change config values in .env
21+
Change the configuration in `eks.conf` to use yaml and set it to `eks-dl1.yaml`:
22+
```
23+
export CONFIG=yaml
24+
export EKS_YAML=./eks-dl1.yaml
25+
```
2226

23-
change eks.conf
24-
export CONFIG=yaml
25-
export EKS_YAML=./eks-dl1.yaml
27+
The file `eks-dl1.yaml` describes the node groups that will be created for the cluster and what EC2 key pair will be used in the instances. [Create a EC2 key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) or use one that you already have.
2628

27-
change eks-dl1.yaml
28-
change or create ssh key:
29-
publicKeyName: DL1_Key
29+
```yaml
30+
publicKeyName: DL1_Key
31+
```
3032
3133
### 0.2.3. Build and run aws-do-eks container
3234
@@ -38,7 +40,7 @@ change eks-dl1.yaml
3840
```
3941

4042
### 0.2.4. Create cluster
41-
43+
Within the aws-do-eks container run:
4244
```
4345
echo "AWS_PROFILE=$AWS_PROFILE"
4446
./eks-create.sh
@@ -59,8 +61,7 @@ cd /eks/deployment/csi/efs/
5961
./deploy.sh
6062
kubectl apply -f ./efs-pvc.yaml
6163
```
62-
** TODO: Improve deploy script to detect existing efs without mounting target in the eks cluster subnets.
63-
Consequently remove explicit call to ./efs-create.sh.
64+
6465

6566
### 0.2.6. Deploy plugins and operators
6667

@@ -77,6 +78,7 @@ cd /eks/deployment/efa-device-plugin
7778
```
7879

7980
#### 0.2.6.3. Deploy Kubeflow mpi-operator
81+
```
8082
cd /eks/deployment/kubeflow/mpi-operator
8183
./deploy.sh
8284
```
@@ -86,14 +88,25 @@ cd /eks/deployment/kubeflow/mpi-operator
8688
```
8789
cd /eks/deployment/distributed-training/pytorch/habana/deepspeed-bert
8890
```
89-
adjust settings
91+
In the file `deepspeed-bert.yaml.template`, set the number of desired workers:
92+
93+
```yaml
94+
Worker:
95+
replicas: 2
96+
```
97+
In this case, the number of workers is the number of instances (nodes) you want to run the training.
98+
99+
#### Adjust training hyperparameters (Optional)
90100
91-
Set the number of desired workers in
92-
deepspeed-bert.yaml.template
101+
You can change the DeepSpeed parameters (as `train_batch_size` and `train_micro_batch_size_per_gpu`) in the file `scripts/deepspeed_config_bert_1.5b.json`
93102

94-
Set the training hyperparametrs (LR, batch sizes, etc.) in
95-
scripts/deepspeed_config_bert_1.5b.json
96-
scripts/launch_train.sh
103+
Other training parameters can be adjusted in the launch script `scripts/launch_train.sh`:
104+
105+
```
106+
MAX_SEQ_LENGTH=128
107+
MAX_STEPS=155000
108+
LR=0.0015
109+
```
97110

98111
## 1. Build and push deep learning container
99112

@@ -104,6 +117,8 @@ scripts/launch_train.sh
104117
105118
## 2. Download data
106119
120+
Before running the training, you first have to download and pre-process the dataset:
121+
107122
```
108123
./2-1-data-download.sh
109124
./2-2-data-status.sh
@@ -116,6 +131,8 @@ Downloading and pre-processing the data takes a long time (could be more than 24
116131
117132
### 3.1. Scale up DL1 nodes
118133
134+
Once you have the data downloaded and pre-processed, you can prepare the enviroment for the training task.
135+
Here choose the same number of workers you have set in the `deepspeed-bert.yaml.template` file.
119136
```
120137
eksctl scale nodegroup --cluster=do-eks --nodes=2 --name=dl1
121138
```
@@ -135,10 +152,15 @@ ip-192-168-83-89.ec2.internal Ready <none> 1m23s v1.21.12-eks-5308cf7
135152
136153
### 3.2. Run training
137154
155+
After the nodes are ready, you can run the training task:
156+
138157
```
139158
./3-1-training-launch.sh
140159
watch ./3-2-training-status.sh
141-
# When pods are running, check logs:
160+
```
161+
162+
When pods are running, you can check logs:
163+
```
142164
./3-3-training-logs.sh
143165
```
144166
@@ -180,3 +202,4 @@ cd /eks
180202
./eks-delete.sh
181203
```
182204
205+

0 commit comments

Comments
 (0)