Skip to content

Commit a13f13d

Browse files
authored
Merge pull request #32 from OpenSQZ/DPP-Merged
Fix README.md
2 parents e41c1ee + 635305e commit a13f13d

1 file changed

Lines changed: 127 additions & 13 deletions

File tree

README.md

Lines changed: 127 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -279,48 +279,162 @@ The similar support for visualization during training process are provided as we
279279
280280
## MegaDPP
281281
282-
### 1. Install
282+
### Environment Configuration
283+
284+
- The following is the pod configuration.
285+
286+
```yaml
287+
ContainerImage: ngc.nju.edu.cn/nvidia/pytorch:25.03-py3
288+
GPU: RTX4090
289+
290+
NVMEStorage: 50G
291+
Limits:
292+
CPU: 28
293+
memory: 100Gi
294+
GPU: 4
295+
UseShm: true
296+
ShmSize: 16Gi
297+
298+
UseIB: true
299+
```
300+
301+
- The python environment in the image automatically includes almost all of the required packages, to install additional required packages, run
302+
303+
```bash
304+
pip install -r requirements.txt
305+
```
283306
284307
- Install infiniband prerequisites
285308
286309
```bash
287310
bash prerequisite.sh
288311
```
289312
290-
- Build the `shm_tensor_new_rdma_multithread` module.
313+
- Build the `shm_tensor_new_rdma` (for multinode) and `shm_tensor_new_rdma_pre_alloc` module.
291314
292315
```bash
293-
cd megatron/shm_tensor_new_rdma_multithread
316+
cd megatron/shm_tensor_new_rdma
294317
pip install -e .
295318
```
296319
297-
### 2. Run
320+
```bash
321+
cd megatron/shm_tensor_new_rdma_pre_alloc
322+
pip install -e .
323+
```
298324
299-
$\quad$ To run distributed training on a single node, go to the project root directory and run
325+
### Run
326+
327+
#### Dataset Preparation
328+
329+
The dataset preparation step follows largely from the Megatron framework.
330+
331+
First, prepare your dataset in the following `.json` format with one sample per line
332+
333+
```json
334+
{"src": "bloomberg", "text": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage: May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage:", "type": "Eng", "id": "0", "title": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. "}
335+
{"src": "bloomberg", "text": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini", "type": "Eng", "id": "1", "title": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. "}
336+
{"src": "bloomberg", "text": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam", "type": "Eng", "id": "2", "title": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. "}
337+
```
338+
note that we have provided a sample dataset under `datasets_gpt/` and `datasets_bert/`.
339+
340+
Then, prepare the vocab file (gpt and bert) and the merges file (gpt-only). We have provided it in the respective directories.
341+
342+
For bert, run the following
343+
```bash
344+
cd datasets
345+
python ../tools/preprocess_data.py \
346+
--input ../datasets_bert/dataset.json \
347+
--output-prefix bert \
348+
--vocab-file ../datasets_bert/vocab.txt \
349+
--tokenizer-type BertWordPieceLowerCase \
350+
--split-sentences \
351+
--workers $(nproc)
352+
```
353+
where the paths can be changed according to the location of your files and the place where you want the generated files to be.
300354
355+
For GPT, run the following
301356
```bash
302-
bash run_single.sh
357+
cd datasets
358+
python ../tools/preprocess_data.py \
359+
--input ../datasets_gpt/dataset.json \
360+
--output-prefix gpt \
361+
--vocab-file ../datasets_gpt/vocab.json \
362+
--tokenizer-type GPT2BPETokenizer \
363+
--merge-file ../datasets_gpt/merges.txt \
364+
--append-eod \
365+
--workers $(nproc)
303366
```
304367
305-
note that there is a flag `--use-dpp` in `TRAINING_ARGS`. Remove it to use original pipeline algorithm.
368+
For other models, please refer to `nvidia/megatron` for the corresponding datasets.
369+
370+
#### Single Node Distributed Training
371+
To run distributed training on a single node, go to the project root directory and run
372+
373+
```bash
374+
bash run_single_gpt.sh
375+
```
306376
307-
$\quad$ To run distributed training on multi nodes, go to the root directory. First run
377+
for GPT and
308378
309379
```bash
310-
bash run_master.sh
380+
bash run_single_bert.sh
381+
```
382+
383+
for bert.
384+
385+
The `run_single_<model>.sh` files have the following structure:
386+
387+
- Parameters include `pipeline_parallel`, `model_chunks` and `tensor_parallel`
388+
- The `virtual_stage_layer` parameter sets how many layers are there in a single virtual pipeline stage. It is calculated as
389+
$$
390+
\frac{\text{total layer of model}}{\text{pipeline parallel}\times\text{model chunks}}
391+
$$
392+
where total layer is set under `examples/` under the corresponding model.
393+
- It gets the IP address of the pod and writes it to the shell script.
394+
- Finally it runs the shell script under the corresponding model under `examples/`
395+
396+
There are also several critical parameters in `examples/gpt3/train_gpt3_175b_distributed.sh` (bert model under the corresponding `bert/` directory)
397+
398+
- `--use-dpp` switches to DPP algorithm
399+
- `--workload` specifies the workload of each single thread, and hence determines the number of threads used in P2P communication
400+
- `--num-gpus` specify the number of GPUs on the current node (single node training)
401+
- Other critical parameters include the number of layers of the model (note that currently the value is 16 and is static in `run_single_<model>.sh`, needs to simultaneously modify `run_single_<model>.sh` if adjusting the layers), the global batch size and the sequence length
402+
403+
For the remaining models, you can either directly run
404+
```bash
405+
bash examples/<model>/<train_file>.sh
406+
```
407+
or write a file similar to `run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`
408+
409+
#### Multinode Distributed Training
410+
To run distributed training on multiple nodes, go to the root directory. First run
411+
412+
```bash
413+
bash run_master_<model>.sh
311414
```
312415
313416
and then start another pod and run
314417
315418
```bash
316-
bash run_worker.sh
419+
bash run_worker_<model>.sh
317420
```
318421
319-
note that remember to change the flag `--node-ips` into the correct infiniband IPs.
422+
The `run_master_<model>.sh` has the following parameters
423+
424+
- Similar to `run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
425+
- It writes the master pod IP to `examples/gpt3/train_gpt3_175b_distributed_master.sh` and to `train_gpt3_175b_distributed_worker.sh` (bert in the corresponding directory)
426+
- Set the number of nodes to be 2 and master node has rank 0
427+
- Starts the shell under `examples`
428+
429+
and `run_worker_<model>.sh` does the following
430+
- Set the number of nodes to be 2 and the worker node has rank 1
431+
- Starts the shell under `examples`
432+
433+
The `examples/gpt3/train_gpt3_175b_distributed_master.sh` and `examples/gpt3/train_gpt3_175b_distributed_worker.sh` is similar to the single node version, except that the `--node-ips` is mandatory, which is the infiniband IPs of the pods in the order of their GPU ranks. And also the `--multi-node` flag should be turned on.
320434
321-
### 3. Profiling
435+
### Profiling
322436
323-
$\quad$ Each run will generate a trace dir in `benchmark`. Run
437+
Each run will generate a trace dir in `benchmark`. Go to the `profiling` directory and run
324438
325439
```python
326440
python aggregate.py --benchmark_dir benchmark/your-benchmark-dir

0 commit comments

Comments
 (0)