You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- The python environment in the image automatically includes almost all of the required packages, to install additional required packages, run
302
+
303
+
```bash
304
+
pip install -r requirements.txt
305
+
```
283
306
284
307
- Install infiniband prerequisites
285
308
286
309
```bash
287
310
bash prerequisite.sh
288
311
```
289
312
290
-
- Build the `shm_tensor_new_rdma_multithread` module.
313
+
- Build the `shm_tensor_new_rdma` (for multinode) and `shm_tensor_new_rdma_pre_alloc` module.
291
314
292
315
```bash
293
-
cd megatron/shm_tensor_new_rdma_multithread
316
+
cd megatron/shm_tensor_new_rdma
294
317
pip install -e .
295
318
```
296
319
297
-
### 2. Run
320
+
```bash
321
+
cd megatron/shm_tensor_new_rdma_pre_alloc
322
+
pip install -e .
323
+
```
298
324
299
-
$\quad$ To run distributed training on a single node, go to the project root directory and run
325
+
### Run
326
+
327
+
#### Dataset Preparation
328
+
329
+
The dataset preparation step follows largely from the Megatron framework.
330
+
331
+
First, prepare your dataset in the following `.json` format with one sample per line
332
+
333
+
```json
334
+
{"src": "bloomberg", "text": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage: May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage:", "type": "Eng", "id": "0", "title": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. "}
335
+
{"src": "bloomberg", "text": "Var Energi agrees to buy Exxonmobil's Norway assets for$4.5 bln. MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completedin the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for$4.5 billion. The deal is expected to be completedin the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini", "type": "Eng", "id": "1", "title": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. "}
336
+
{"src": "bloomberg", "text": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam", "type": "Eng", "id": "2", "title": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. "}
337
+
```
338
+
note that we have provided a sample dataset under `datasets_gpt/` and `datasets_bert/`.
339
+
340
+
Then, prepare the vocab file (gpt and bert) and the merges file (gpt-only). We have provided it in the respective directories.
341
+
342
+
For bert, run the following
343
+
```bash
344
+
cd datasets
345
+
python ../tools/preprocess_data.py \
346
+
--input ../datasets_bert/dataset.json \
347
+
--output-prefix bert \
348
+
--vocab-file ../datasets_bert/vocab.txt \
349
+
--tokenizer-type BertWordPieceLowerCase \
350
+
--split-sentences \
351
+
--workers $(nproc)
352
+
```
353
+
where the paths can be changed according to the location of your files and the place where you want the generated files to be.
300
354
355
+
For GPT, run the following
301
356
```bash
302
-
bash run_single.sh
357
+
cd datasets
358
+
python ../tools/preprocess_data.py \
359
+
--input ../datasets_gpt/dataset.json \
360
+
--output-prefix gpt \
361
+
--vocab-file ../datasets_gpt/vocab.json \
362
+
--tokenizer-type GPT2BPETokenizer \
363
+
--merge-file ../datasets_gpt/merges.txt \
364
+
--append-eod \
365
+
--workers $(nproc)
303
366
```
304
367
305
-
note that there is a flag `--use-dpp` in `TRAINING_ARGS`. Remove it to use original pipeline algorithm.
368
+
For other models, please refer to `nvidia/megatron` for the corresponding datasets.
369
+
370
+
#### Single Node Distributed Training
371
+
To run distributed training on a single node, go to the project root directory and run
372
+
373
+
```bash
374
+
bash run_single_gpt.sh
375
+
```
306
376
307
-
$\quad$ To run distributed training on multi nodes, go to the root directory. First run
377
+
for GPT and
308
378
309
379
```bash
310
-
bash run_master.sh
380
+
bash run_single_bert.sh
381
+
```
382
+
383
+
for bert.
384
+
385
+
The `run_single_<model>.sh` files have the following structure:
386
+
387
+
- Parameters include `pipeline_parallel`, `model_chunks` and `tensor_parallel`
388
+
- The `virtual_stage_layer` parameter sets how many layers are there in a single virtual pipeline stage. It is calculated as
389
+
$$
390
+
\frac{\text{total layer of model}}{\text{pipeline parallel}\times\text{model chunks}}
391
+
$$
392
+
where total layer is set under `examples/` under the corresponding model.
393
+
- It gets the IP address of the pod and writes it to the shell script.
394
+
- Finally it runs the shell script under the corresponding model under `examples/`
395
+
396
+
There are also several critical parameters in `examples/gpt3/train_gpt3_175b_distributed.sh` (bert model under the corresponding `bert/` directory)
397
+
398
+
- `--use-dpp` switches to DPP algorithm
399
+
- `--workload` specifies the workload of each single thread, and hence determines the number of threads used in P2P communication
400
+
- `--num-gpus` specify the number of GPUs on the current node (single node training)
401
+
- Other critical parameters include the number of layers of the model (note that currently the value is 16 and is static in `run_single_<model>.sh`, needs to simultaneously modify `run_single_<model>.sh` if adjusting the layers), the global batch size and the sequence length
402
+
403
+
For the remaining models, you can either directly run
404
+
```bash
405
+
bash examples/<model>/<train_file>.sh
406
+
```
407
+
or write a file similar to `run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`
408
+
409
+
#### Multinode Distributed Training
410
+
To run distributed training on multiple nodes, go to the root directory. First run
411
+
412
+
```bash
413
+
bash run_master_<model>.sh
311
414
```
312
415
313
416
and then start another pod and run
314
417
315
418
```bash
316
-
bash run_worker.sh
419
+
bash run_worker_<model>.sh
317
420
```
318
421
319
-
note that remember to change the flag `--node-ips` into the correct infiniband IPs.
422
+
The `run_master_<model>.sh` has the following parameters
423
+
424
+
- Similar to `run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
425
+
- It writes the master pod IP to `examples/gpt3/train_gpt3_175b_distributed_master.sh` and to `train_gpt3_175b_distributed_worker.sh` (bert in the corresponding directory)
426
+
- Set the number of nodes to be 2 and master node has rank 0
427
+
- Starts the shell under `examples`
428
+
429
+
and `run_worker_<model>.sh` does the following
430
+
- Set the number of nodes to be 2 and the worker node has rank 1
431
+
- Starts the shell under `examples`
432
+
433
+
The `examples/gpt3/train_gpt3_175b_distributed_master.sh` and `examples/gpt3/train_gpt3_175b_distributed_worker.sh` is similar to the single node version, except that the `--node-ips` is mandatory, which is the infiniband IPs of the pods in the order of their GPU ranks. And also the `--multi-node` flag should be turned on.
320
434
321
-
### 3. Profiling
435
+
### Profiling
322
436
323
-
$\quad$ Each run will generate a trace dir in `benchmark`. Run
437
+
Each run will generate a trace dir in `benchmark`. Go to the `profiling` directory and run
0 commit comments