You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- The python environment in the image automatically includes almost all of the required packages, to install additional required packages, run
275
+
276
+
```bash
277
+
pip install -r requirements.txt
278
+
```
256
279
257
280
- Install infiniband prerequisites
258
281
259
282
```bash
260
283
bash prerequisite.sh
261
284
```
262
285
263
-
- Build the `shm_tensor_new_rdma_multithread` module.
286
+
- Build the `shm_tensor_new_rdma` (for multinode) and `shm_tensor_new_rdma_pre_alloc` module.
264
287
265
288
```bash
266
-
cd megatron/shm_tensor_new_rdma_multithread
289
+
cd megatron/shm_tensor_new_rdma
267
290
pip install -e .
268
291
```
269
292
270
-
### 2. Run
293
+
```bash
294
+
cd megatron/shm_tensor_new_rdma_pre_alloc
295
+
pip install -e .
296
+
```
271
297
272
-
$\quad$ To run distributed training on a single node, go to the project root directory and run
298
+
## Run
299
+
300
+
### Dataset Preparation
301
+
302
+
The dataset preparation step follows largely from the Megatron framework.
303
+
304
+
First, prepare your dataset in the following `.json` format with one sample per line
305
+
306
+
```json
307
+
{"src": "bloomberg", "text": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage: May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage:", "type": "Eng", "id": "0", "title": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. "}
308
+
{"src": "bloomberg", "text": "Var Energi agrees to buy Exxonmobil's Norway assets for$4.5 bln. MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completedin the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for$4.5 billion. The deal is expected to be completedin the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini", "type": "Eng", "id": "1", "title": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. "}
309
+
{"src": "bloomberg", "text": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam", "type": "Eng", "id": "2", "title": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. "}
310
+
```
311
+
note that we have provided a sample dataset under `datasets_gpt/` and `datasets_bert/`.
312
+
313
+
Then, prepare the vocab file (gpt and bert) and the merges file (gpt-only). We have provided it in the respective directories.
314
+
315
+
For bert, run the following
316
+
```bash
317
+
cd datasets
318
+
python ../tools/preprocess_data.py \
319
+
--input ../datasets_bert/dataset.json \
320
+
--output-prefix bert \
321
+
--vocab-file ../datasets_bert/vocab.txt \
322
+
--tokenizer-type BertWordPieceLowerCase \
323
+
--split-sentences \
324
+
--workers $(nproc)
325
+
```
326
+
where the paths can be changed according to the location of your files and the place where you want the generated files to be.
273
327
328
+
For GPT, run the following
274
329
```bash
275
-
bash run_single.sh
330
+
cd datasets
331
+
python ../tools/preprocess_data.py \
332
+
--input ../datasets_gpt/dataset.json \
333
+
--output-prefix gpt \
334
+
--vocab-file ../datasets_gpt/vocab.json \
335
+
--tokenizer-type GPT2BPETokenizer \
336
+
--merge-file ../datasets_gpt/merges.txt \
337
+
--append-eod \
338
+
--workers $(nproc)
276
339
```
277
340
278
-
note that there is a flag `--use-dpp` in `TRAINING_ARGS`. Remove it to use original pipeline algorithm.
341
+
For other models, please refer to `nvidia/megatron` for the corresponding datasets.
342
+
343
+
### Single Node Distributed Training
344
+
To run distributed training on a single node, go to the project root directory and run
345
+
346
+
```bash
347
+
bash run_single_gpt.sh
348
+
```
279
349
280
-
$\quad$ To run distributed training on multi nodes, go to the root directory. First run
350
+
for GPT and
281
351
282
352
```bash
283
-
bash run_master.sh
353
+
bash run_single_bert.sh
354
+
```
355
+
356
+
for bert.
357
+
358
+
The `run_single_<model>.sh` files have the following structure:
359
+
360
+
- Parameters include `pipeline_parallel`, `model_chunks` and `tensor_parallel`
361
+
- The `virtual_stage_layer` parameter sets how many layers are there in a single virtual pipeline stage. It is calculated as
362
+
$$
363
+
\frac{\text{total layer of model}}{\text{pipeline parallel}\times\text{model chunks}}
364
+
$$
365
+
where total layer is set under `examples/` under the corresponding model.
366
+
- It gets the IP address of the pod and writes it to the shell script.
367
+
- Finally it runs the shell script under the corresponding model under `examples/`
368
+
369
+
There are also several critical parameters in `examples/gpt3/train_gpt3_175b_distributed.sh` (bert model under the corresponding `bert/` directory)
370
+
371
+
- `--use-dpp` switches to DPP algorithm
372
+
- `--workload` specifies the workload of each single thread, and hence determines the number of threads used in P2P communication
373
+
- `--num-gpus` specify the number of GPUs on the current node (single node training)
374
+
- Other critical parameters include the number of layers of the model (note that currently the value is 16 and is static in `run_single_<model>.sh`, needs to simultaneously modify `run_single_<model>.sh` if adjusting the layers), the global batch size and the sequence length
375
+
376
+
For the remaining models, you can either directly run
377
+
```bash
378
+
bash examples/<model>/<train_file>.sh
379
+
```
380
+
or write a file similar to `run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`
381
+
382
+
### Multinode Distributed Training
383
+
To run distributed training on multiple nodes, go to the root directory. First run
384
+
385
+
```bash
386
+
bash run_master_<model>.sh
284
387
```
285
388
286
389
and then start another pod and run
287
390
288
391
```bash
289
-
bash run_worker.sh
392
+
bash run_worker_<model>.sh
290
393
```
291
394
292
-
note that remember to change the flag `--node-ips` into the correct infiniband IPs.
395
+
The `run_master_<model>.sh` has the following parameters
396
+
397
+
- Similar to `run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
398
+
- It writes the master pod IP to `examples/gpt3/train_gpt3_175b_distributed_master.sh` and to `train_gpt3_175b_distributed_worker.sh` (bert in the corresponding directory)
399
+
- Set the number of nodes to be 2 and master node has rank 0
400
+
- Starts the shell under `examples`
401
+
402
+
and `run_worker_<model>.sh` does the following
403
+
- Set the number of nodes to be 2 and the worker node has rank 1
404
+
- Starts the shell under `examples`
405
+
406
+
The `examples/gpt3/train_gpt3_175b_distributed_master.sh` and `examples/gpt3/train_gpt3_175b_distributed_worker.sh` is similar to the single node version, except that the `--node-ips` is mandatory, which is the infiniband IPs of the pods in the order of their GPU ranks. And also the `--multi-node` flag should be turned on.
293
407
294
-
### 3. Profiling
408
+
### Profiling
295
409
296
-
$\quad$ Each run will generate a trace dir in `benchmark`. Run
410
+
Each run will generate a trace dir in `benchmark`. Go to the `profiling` directory and run
0 commit comments