Skip to content

Commit 5e2925b

Browse files
authored
Update README.md
1 parent c3241be commit 5e2925b

1 file changed

Lines changed: 128 additions & 14 deletions

File tree

README.md

Lines changed: 128 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -252,48 +252,162 @@ The preliminary support for visualization during training process are provided a
252252
253253
## MegaDPP
254254
255-
### 1. Install
255+
## Environment Configuration
256+
257+
- The following is the pod configuration.
258+
259+
```yaml
260+
ContainerImage: ngc.nju.edu.cn/nvidia/pytorch:25.03-py3
261+
GPU: RTX4090
262+
263+
NVMEStorage: 50G
264+
Limits:
265+
CPU: 28
266+
memory: 100Gi
267+
GPU: 4
268+
UseShm: true
269+
ShmSize: 16Gi
270+
271+
UseIB: true
272+
```
273+
274+
- The python environment in the image automatically includes almost all of the required packages, to install additional required packages, run
275+
276+
```bash
277+
pip install -r requirements.txt
278+
```
256279
257280
- Install infiniband prerequisites
258281
259282
```bash
260283
bash prerequisite.sh
261284
```
262285
263-
- Build the `shm_tensor_new_rdma_multithread` module.
286+
- Build the `shm_tensor_new_rdma` (for multinode) and `shm_tensor_new_rdma_pre_alloc` module.
264287
265288
```bash
266-
cd megatron/shm_tensor_new_rdma_multithread
289+
cd megatron/shm_tensor_new_rdma
267290
pip install -e .
268291
```
269292
270-
### 2. Run
293+
```bash
294+
cd megatron/shm_tensor_new_rdma_pre_alloc
295+
pip install -e .
296+
```
271297
272-
$\quad$ To run distributed training on a single node, go to the project root directory and run
298+
## Run
299+
300+
### Dataset Preparation
301+
302+
The dataset preparation step follows largely from the Megatron framework.
303+
304+
First, prepare your dataset in the following `.json` format with one sample per line
305+
306+
```json
307+
{"src": "bloomberg", "text": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage: May 26 (Reuters) - Coach Inc: * Coach Inc launches tender offer to acquire Kate Spade & Company for $18.50 per share in cash * Coach Inc launches tender offer to acquire kate spade & company for $18.50 per share in cash * Coach Inc - tender offer will expire at 11:59 P.M. Edt on June 23, 2017, unless extended * Coach Inc - Chelsea Merger Sub Inc, has commenced a tender offer for all of outstanding shares of common stock, par value $1.00 per share, of Kate Spade & Company Source text for Eikon: Further company coverage:", "type": "Eng", "id": "0", "title": "BRIEF-Coach Inc launches tender offer to acquire Kate Spade & Co for $18.50 per share in cash. "}
308+
{"src": "bloomberg", "text": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini", "type": "Eng", "id": "1", "title": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. "}
309+
{"src": "bloomberg", "text": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam", "type": "Eng", "id": "2", "title": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. "}
310+
```
311+
note that we have provided a sample dataset under `datasets_gpt/` and `datasets_bert/`.
312+
313+
Then, prepare the vocab file (gpt and bert) and the merges file (gpt-only). We have provided it in the respective directories.
314+
315+
For bert, run the following
316+
```bash
317+
cd datasets
318+
python ../tools/preprocess_data.py \
319+
--input ../datasets_bert/dataset.json \
320+
--output-prefix bert \
321+
--vocab-file ../datasets_bert/vocab.txt \
322+
--tokenizer-type BertWordPieceLowerCase \
323+
--split-sentences \
324+
--workers $(nproc)
325+
```
326+
where the paths can be changed according to the location of your files and the place where you want the generated files to be.
273327
328+
For GPT, run the following
274329
```bash
275-
bash run_single.sh
330+
cd datasets
331+
python ../tools/preprocess_data.py \
332+
--input ../datasets_gpt/dataset.json \
333+
--output-prefix gpt \
334+
--vocab-file ../datasets_gpt/vocab.json \
335+
--tokenizer-type GPT2BPETokenizer \
336+
--merge-file ../datasets_gpt/merges.txt \
337+
--append-eod \
338+
--workers $(nproc)
276339
```
277340
278-
note that there is a flag `--use-dpp` in `TRAINING_ARGS`. Remove it to use original pipeline algorithm.
341+
For other models, please refer to `nvidia/megatron` for the corresponding datasets.
342+
343+
### Single Node Distributed Training
344+
To run distributed training on a single node, go to the project root directory and run
345+
346+
```bash
347+
bash run_single_gpt.sh
348+
```
279349
280-
$\quad$ To run distributed training on multi nodes, go to the root directory. First run
350+
for GPT and
281351
282352
```bash
283-
bash run_master.sh
353+
bash run_single_bert.sh
354+
```
355+
356+
for bert.
357+
358+
The `run_single_<model>.sh` files have the following structure:
359+
360+
- Parameters include `pipeline_parallel`, `model_chunks` and `tensor_parallel`
361+
- The `virtual_stage_layer` parameter sets how many layers are there in a single virtual pipeline stage. It is calculated as
362+
$$
363+
\frac{\text{total layer of model}}{\text{pipeline parallel}\times\text{model chunks}}
364+
$$
365+
where total layer is set under `examples/` under the corresponding model.
366+
- It gets the IP address of the pod and writes it to the shell script.
367+
- Finally it runs the shell script under the corresponding model under `examples/`
368+
369+
There are also several critical parameters in `examples/gpt3/train_gpt3_175b_distributed.sh` (bert model under the corresponding `bert/` directory)
370+
371+
- `--use-dpp` switches to DPP algorithm
372+
- `--workload` specifies the workload of each single thread, and hence determines the number of threads used in P2P communication
373+
- `--num-gpus` specify the number of GPUs on the current node (single node training)
374+
- Other critical parameters include the number of layers of the model (note that currently the value is 16 and is static in `run_single_<model>.sh`, needs to simultaneously modify `run_single_<model>.sh` if adjusting the layers), the global batch size and the sequence length
375+
376+
For the remaining models, you can either directly run
377+
```bash
378+
bash examples/<model>/<train_file>.sh
379+
```
380+
or write a file similar to `run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`
381+
382+
### Multinode Distributed Training
383+
To run distributed training on multiple nodes, go to the root directory. First run
384+
385+
```bash
386+
bash run_master_<model>.sh
284387
```
285388
286389
and then start another pod and run
287390
288391
```bash
289-
bash run_worker.sh
392+
bash run_worker_<model>.sh
290393
```
291394
292-
note that remember to change the flag `--node-ips` into the correct infiniband IPs.
395+
The `run_master_<model>.sh` has the following parameters
396+
397+
- Similar to `run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
398+
- It writes the master pod IP to `examples/gpt3/train_gpt3_175b_distributed_master.sh` and to `train_gpt3_175b_distributed_worker.sh` (bert in the corresponding directory)
399+
- Set the number of nodes to be 2 and master node has rank 0
400+
- Starts the shell under `examples`
401+
402+
and `run_worker_<model>.sh` does the following
403+
- Set the number of nodes to be 2 and the worker node has rank 1
404+
- Starts the shell under `examples`
405+
406+
The `examples/gpt3/train_gpt3_175b_distributed_master.sh` and `examples/gpt3/train_gpt3_175b_distributed_worker.sh` is similar to the single node version, except that the `--node-ips` is mandatory, which is the infiniband IPs of the pods in the order of their GPU ranks. And also the `--multi-node` flag should be turned on.
293407
294-
### 3. Profiling
408+
### Profiling
295409
296-
$\quad$ Each run will generate a trace dir in `benchmark`. Run
410+
Each run will generate a trace dir in `benchmark`. Go to the `profiling` directory and run
297411
298412
```python
299413
python aggregate.py --benchmark_dir benchmark/your-benchmark-dir
@@ -368,4 +482,4 @@ Provide contact information, including
368482
- Email(user/dev email addresses, with self-subscribe service)
369483
- Discord / Slack
370484
- WeChat / DingTalk
371-
- Twitter / Zhihu...
485+
- Twitter / Zhihu...

0 commit comments

Comments
 (0)