Update doc with validated instructions

iankouls-aws · iankouls-aws · commit b1fd3e5eb183 · 2022-08-22T20:40:59.000-07:00
diff --git a/Container-Root/eks/deployment/distributed-training/pytorch/habana/deepspeed-bert/README.md b/Container-Root/eks/deployment/distributed-training/pytorch/habana/deepspeed-bert/README.md
@@ -18,15 +18,17 @@ git clone https://github.com/aws-samples/aws-do-eks.git
 
 ### 0.2.2. Configure cluster
 
-change config values in .env
+Change the configuration in `eks.conf` to use yaml and set it to `eks-dl1.yaml`:
+```
+export CONFIG=yaml
+export EKS_YAML=./eks-dl1.yaml
+```
 
-change eks.conf
-    export CONFIG=yaml
-    export EKS_YAML=./eks-dl1.yaml
+The file `eks-dl1.yaml` describes the node groups that will be created for the cluster and what EC2 key pair will be used in the instances. [Create a EC2 key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) or use one that you already have.
 
-change eks-dl1.yaml
-    change or create ssh key:
-        publicKeyName: DL1_Key
+```yaml
+publicKeyName: DL1_Key
+```
 
 ### 0.2.3. Build and run aws-do-eks container
 
@@ -38,7 +40,7 @@ change eks-dl1.yaml
 ```
 
 ### 0.2.4. Create cluster
-
+Within the aws-do-eks container run:
 ```
 echo "AWS_PROFILE=$AWS_PROFILE"
 ./eks-create.sh
@@ -59,8 +61,7 @@ cd /eks/deployment/csi/efs/
 ./deploy.sh
 kubectl apply -f ./efs-pvc.yaml
 ```
-** TODO: Improve deploy script to detect existing efs without mounting target in the eks cluster subnets.
-         Consequently remove explicit call to ./efs-create.sh.
+
 
 ### 0.2.6. Deploy plugins and operators
 
@@ -77,6 +78,7 @@ cd /eks/deployment/efa-device-plugin
 ```
 
 #### 0.2.6.3. Deploy Kubeflow mpi-operator
+```
 cd /eks/deployment/kubeflow/mpi-operator
 ./deploy.sh
 ```
@@ -86,14 +88,25 @@ cd /eks/deployment/kubeflow/mpi-operator
 ```
 cd /eks/deployment/distributed-training/pytorch/habana/deepspeed-bert
 ```
-adjust settings
+In the file `deepspeed-bert.yaml.template`, set the number of desired workers:
+
+```yaml
+    Worker:
+      replicas: 2
+```
+In this case, the number of workers is the number of instances (nodes) you want to run the training.
+
+#### Adjust training hyperparameters (Optional)
 
-Set the number of desired workers in 
-deepspeed-bert.yaml.template
+You can change the DeepSpeed parameters (as `train_batch_size` and `train_micro_batch_size_per_gpu`) in the file `scripts/deepspeed_config_bert_1.5b.json`
 
-Set the training hyperparametrs (LR, batch sizes, etc.) in
-scripts/deepspeed_config_bert_1.5b.json
-scripts/launch_train.sh
+Other training parameters can be adjusted in the launch script `scripts/launch_train.sh`:
+
+```
+MAX_SEQ_LENGTH=128
+MAX_STEPS=155000
+LR=0.0015
+```
 
 ## 1. Build and push deep learning container
 
@@ -104,6 +117,8 @@ scripts/launch_train.sh
 
 ## 2. Download data
 
+Before running the training, you first have to download and pre-process the dataset:
+
 ```
 ./2-1-data-download.sh
 ./2-2-data-status.sh
@@ -116,6 +131,8 @@ Downloading and pre-processing the data takes a long time (could be more than 24
 
 ### 3.1. Scale up DL1 nodes
 
+Once you have the data downloaded and pre-processed, you can prepare the enviroment for the training task.
+Here choose the same number of workers you have set in the `deepspeed-bert.yaml.template` file.
 ```
 eksctl scale nodegroup --cluster=do-eks --nodes=2 --name=dl1
 ```
@@ -135,10 +152,15 @@ ip-192-168-83-89.ec2.internal   Ready    <none>   1m23s   v1.21.12-eks-5308cf7
 
 ### 3.2. Run training
 
+After the nodes are ready, you can run the training task:
+
 ```
 ./3-1-training-launch.sh
 watch ./3-2-training-status.sh
-# When pods are running, check logs:
+```
+
+When pods are running, you can check logs:
+```
 ./3-3-training-logs.sh
 ```
 
@@ -180,3 +202,4 @@ cd /eks
 ./eks-delete.sh
 ```
 
+