Skip to content

Commit 81e3e39

Browse files
authored
Merge pull request #35 from OpenSQZ/mvp
Add Docker Users Guidance
2 parents a13f13d + 5fa5dd2 commit 81e3e39

3 files changed

Lines changed: 408 additions & 0 deletions

File tree

DockerUsage.md

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# Quick Start for MegatronApp Docker Usage
2+
3+
This guide gives you a minimal, end-to-end path to run MegatronApp with Docker—just enough to get training and visualization up and running smoothly.
4+
5+
## Docker Installation
6+
7+
We strongly recommend using the official [PyTorch NGC Container]((https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)). It bundles compatible dependencies and tuned configurations for NVIDIA GPUs.
8+
9+
Our custom environment is based on **nvcr.io/nvidia/pytorch:25.04-py3**.
10+
11+
```
12+
# Run container with mounted directories
13+
docker run --runtime --nvidia --gpus all -it --rm \
14+
-v /path/to/megatron:/workspace/megatron \
15+
-v /path/to/dataset:/workspace/dataset \
16+
-v /path/to/checkpoints:/workspace/checkpoints \
17+
nvcr.io/nvidia/pytorch:25.04-py3
18+
```
19+
20+
Install any additional Python packages:
21+
22+
```
23+
pip install -r requirements.txt
24+
```
25+
26+
We plan to publish a prebuilt MegatronApp image to a public registry (e.g., Docker Hub) soon.
27+
28+
Note:
29+
30+
- **Default hardware assumption**: one machine with **4 GPUs**.
31+
32+
- **RDMA C++ extensions**: `shm_tensor_new_rdma` and `shm_tensor_new_rdma_pre_alloc` are only invoked when DPP (Dynamic Pipeline Planning) is enabled. They are primarily used in the `MegaDPP` module (see the references in `megatron/training/training.py`).
33+
34+
- **MegaFBD compatibility**: Although `MegaFBD` doesn’t execute DPP code paths, its installation may prompt you to build the RDMA extensions so imports resolve cleanly. Regular training—including `MegaFBD` —works **without** those extensions. Just ensure no script enables `--use-dpp` or other flags that trigger DPP; otherwise you’ll get runtime errors.
35+
36+
- **MegaScope / MegaScan**: These focus on visualization and slow-node detection rather than core training. You may comment out the RDMA extension lines referenced in [training.py](https://github.com/OpenSQZ/MegatronApp/blob/main/megatron/training/training.py#L120) and still run these components successfully.
37+
38+
### Data Preparation
39+
40+
Below is a minimal example using the GPT samples provided in the repository.
41+
42+
```bash
43+
set -euo pipefail
44+
cd /workspace/megatronapp
45+
46+
# Prepare shared directories (for inputs, outputs, and traces)
47+
mkdir -p /workspace/shared/datasets /workspace/shared/outputs /workspace/shared/traces
48+
49+
# Preprocessed binaries from Megatron’s scripts will be produced here
50+
mkdir -p datasets
51+
52+
# Example: preprocess GPT sample data (datasets_gpt/ and datasets_bert/ provided)
53+
cd /workspace/megatronapp/datasets
54+
python ../tools/preprocess_data.py \
55+
--input ../datasets_gpt/dataset.json \
56+
--output-prefix gpt \
57+
--vocab-file ../datasets_gpt/vocab.json \
58+
--tokenizer-type GPT2BPETokenizer \
59+
--merge-file ../datasets_gpt/merges.txt \
60+
--append-eod \
61+
--workers "$(nproc)"
62+
```
63+
64+
To use **your own large dataset**, prepare a `.jsonl` file with **one sample per line** and point `--input` to your file path.
65+
66+
Please refer to [README_Megatron.md](https://github.com/OpenSQZ/MegatronApp/blob/main/README_Megatron.md) for more details.
67+
68+
## MegaScan
69+
70+
MegaScan requires enabling trace-related flags during training. Start with the single-node GPT example (easiest to verify).
71+
72+
```bash
73+
cd /workspace/megatronapp
74+
75+
# Define MegaScan related flags
76+
TRACE_FLAGS="\
77+
--trace \
78+
--trace-dir trace_output \
79+
--trace-interval 5 \
80+
--continuous-trace-iterations 2 \
81+
--trace-granularity full \
82+
--transformer-impl local"
83+
84+
bash ./DockerUsage_MegaScan.sh
85+
```
86+
87+
Note:
88+
- **Single machine, multi-GPU**: If your node has multiple A40s, the script will detect GPU count automatically. To force a value, set `--num-gpus` inside the script to your machine’s GPU count.
89+
90+
- **Multi-node**: Use `run_master_<model>.sh` / `run_worker_<model>.sh` and set `--multi-node` and `--node-ips` (in InfiniBand order) in `examples/.../train_*_master/worker.sh`.
91+
92+
You can also consider **elastic training** (see `torchrun` documentation).
93+
94+
After training, per-rank trace files will be produced in the current directory with names like:
95+
96+
```
97+
benchmark-data-{}-pipeline-{}-tensor-{}.json
98+
```
99+
100+
Aggregate them into one file:
101+
102+
```bash
103+
python scripts/aggregate.py --b trace_output --output benchmark.json
104+
```
105+
106+
To visualize, open the JSON trace with Chrome Tracing (chrome://tracing) or [Perfetto UI](https://ui.perfetto.dev/). You can zoom, filter, and inspect timelines token-by-token to analyze distributed performance.
107+
108+
<p align="center">
109+
<img src="images/trace1.png" alt="trace1" width="49%">
110+
<img src="images/trace2.png" alt="trace2" width="49%">
111+
</p>
112+
113+
### Fault Injection (for demonstration)
114+
115+
You can simulate GPU downclocking with `scripts/gpu_control.sh` to illustrate the detection algorithm:
116+
117+
```bash
118+
# Downclock GPU 0 to 900 MHz
119+
bash scripts/gpu_control.sh limit 0 900
120+
```
121+
122+
Re-run training, then aggregate with detection enabled:
123+
124+
```bash
125+
python scripts/aggregate.py \
126+
-b . \ # Equivalent to --bench-dir
127+
-d # Enable detection (equivalent to --detect)
128+
```
129+
130+
You should see output indicating a potential anomaly on GPU 0:
131+
132+
![1](images/result.png)
133+
134+
135+
## MegaScope
136+
137+
First, we use existed data to launch this example. You need to move `/workspace/megatronapp/datasets` 下的 `gpt_text_document.bin` and `gpt_text_document.idx` to `/workspace/megatronapp/datasets_gpt`.
138+
139+
MegaScope requires a backend (Megatron) and a frontend (Vue) service.
140+
141+
### Backend(Megatron) Training Mode
142+
143+
```bash
144+
TP=1 PP=2 NNODES=1 NCCL_DEBUG=INFO MASTER_ADDR=127.0.0.1 MASTER_PORT=29500 bash DockerUsage_MegaScope.sh
145+
```
146+
147+
Important: The tutorial defaults to 1 node × 4 GPUs. On your server, set a consistent combination of `TP` (tensor parallel size), `PP` (pipeline parallel size), and `world size`.
148+
149+
After training, list saved checkpoints:
150+
151+
```bash
152+
ls -lah ngc_models/release_gpt_base
153+
```
154+
155+
Expected output (example):
156+
157+
```
158+
total 16K
159+
drwxr-xr-x 3 root root 4.0K Oct 13 12:25 .
160+
drwxr-xr-x 3 root root 4.0K Oct 13 12:05 ..
161+
drwxr-xr-x 4 root root 4.0K Oct 13 12:25 iter_0000020
162+
-rw-r--r-- 1 root root 2 Oct 13 12:25 latest_checkpointed_iteration.txt
163+
```
164+
165+
### Backend (Megatron) Inference Mode
166+
167+
For inference mode, run the text generation server script, pointing it to your model and tokenizer paths, **and make sure to turn on the switch `--enable-ws-server` in the argument**.
168+
169+
```bash
170+
bash examples/inference/a_text_generation_server_bash_script.sh /path/to/model /path/to/tokenizer
171+
```
172+
173+
For example, you can apply and download [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
174+
175+
```bash
176+
mkdir -p /workspace/models/llama3_hf
177+
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
178+
--local-dir /workspace/models/llama3_hf \
179+
--local-dir-use-symlinks False
180+
```
181+
Downloading can take a while; ensure you have roughly `40 GB` of free disk space.
182+
183+
Convert the HF checkpoint to `Megatron` format:
184+
185+
```bash
186+
python tools/checkpoint/convert.py \
187+
--model-type GPT \
188+
--loader llama_mistral \
189+
--saver core \
190+
--checkpoint-type hf \
191+
--model-size llama3 \
192+
--load-dir /path/to/Meta-Llama-3-8B-Instruct \
193+
--save-dir /path/to/Meta-Llama-3-8B-Instruct-megatron \
194+
--tokenizer-model /path/to/Meta-Llama-3-8B-Instruct/tokenizer.model \
195+
--bf16
196+
```
197+
其中,
198+
- `--loader` llama_mistral: use the built-in LLaMA/Mistral conversion logic.
199+
200+
- `--checkpoint-type` hf: input is a Hugging Face checkpoint.
201+
202+
- `--model-size`: choose according to the model (e.g., `llama2-7B`, `llama3`, `mistral`).
203+
204+
- Output goes to `--save-dir` and is directly loadable by Megatron inference/training.
205+
206+
During conversion, per-shard read/write progress is printed. When finished, your Megatron checkpoint directory (e.g., `/workspace/models/llama3_megatron`) is ready.
207+
208+
Start the MegaScope inference service:
209+
210+
```bash
211+
bash examples/inference/llama_mistral/run_text_generation_llama3.sh /gfshome/llama3-ckpts/Meta-Llama-3-8B-Instruct-megatron-core-v0.12.0-TP1PP1 /root/llama3-ckpts/Meta-Llama-3-8B-Instruct
212+
```
213+
214+
When the terminal shows **“MegatronServer started”** and a listening **PORT**, the backend is ready.
215+
216+
### Frontend (Vue): Navigate to the frontend directory and start the development server.
217+
218+
```bash
219+
cd transformer-visualize
220+
npm run dev
221+
```
222+
After launching both, open your browser to the specified address (usually http://localhost:5173). You will see the main interface.
223+
224+
#### Generating Text and Visualizing Intermediate States
225+
In the input prompts area, enter one or more prompts. Each text box represents a separate batch, allowing for parallel processing and comparison.
226+
![](images/prompts.jpg)
227+
228+
In the control panel, set the desired number of tokens to generate. Also enable or disable the real-time display of specific internal states, such as QKV vectors and MLP outputs. This helps manage performance and focus on relevant data. The filter expressions of vectors can be customized by the input box below.
229+
![](images/controls.jpg)
230+
231+
After starting generation, the visualization results will update token-by-token. In the first tab, the intermediate vector heatmaps are displayed and the output probabilities are shown in the expandable sections.
232+
![](images/visualization.jpg)
233+
234+
The second tab contains attention matrices. Use the dropdown menus to select the layer and attention head you wish to inspect.
235+
![](images/attention.jpg)
236+
237+
The third tab is the PCA dimensionality reduction feature where you can visually inspect the clustering of tokens and understand how the model groups similar concepts. The displayed layer can also be selected.
238+
![](images/pca.jpg)
239+
240+
#### Injecting Model Perturbations
241+
The expandable perturbation control panel can introduce controlled noise into the model's forward pass. Each kind of perturbation has an independent switch, controlling the noise type and intensity.
242+
243+
The currently supported noise types include:
244+
- Additive Gaussian Noise (noise1): output = input + N(0, coef²), where N is a random value from a Gaussian (normal) distribution with mean 0.
245+
- Multiplicative Uniform Noise (noise2): output = input * U(1 - val, 1 + val), where U is a random value from a uniform distribution.
246+
![](images/perturbation.jpg)
247+
248+
#### Support for training process
249+
The similar support for visualization during training process are provided as well. The overall control is the same, and the training process will be controlled on the frontend page. Critical intermediate results and perturbations are supported in training.
250+
![](images/training.jpg)
251+
252+
### MegaDPP
253+
254+
TBD
255+
256+
### MegaFBD
257+
258+
TBD

DockerUsage_MegaScan.sh

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
export CUDA_DEVICE_MAX_CONNECTIONS=1
5+
export OMP_NUM_THREADS=1
6+
# Use PyTorch's recommended variable names to avoid deprecate warnings in your logs
7+
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
8+
export NCCL_DEBUG=WARN
9+
export NCCL_IB_DISABLE=1
10+
11+
# The primary network card in a standalone container is usually eth0. If it is different, please replace it.
12+
export GLOO_SOCKET_IFNAME=eth0
13+
export NCCL_SOCKET_IFNAME=eth0
14+
15+
nvidia-smi -L || true
16+
mkdir -p trace_output /workspace/shared/ckpts_gpt /workspace/shared/tensorboard_gpt
17+
18+
torchrun --standalone --nproc_per_node=4 pretrain_gpt.py \
19+
--num-layers 16 \
20+
--hidden-size 2048 \
21+
--num-attention-heads 32 \
22+
--seq-length 2048 \
23+
--max-position-embeddings 2048 \
24+
--micro-batch-size 2 \
25+
--global-batch-size 16 \
26+
--train-iters 5 \
27+
--tensor-model-parallel-size 2 \
28+
--pipeline-model-parallel-size 2 \
29+
--num-layers-per-virtual-pipeline-stage 2 \
30+
--untie-embeddings-and-output-weights \
31+
--no-ckpt-fully-parallel-save \
32+
--tokenizer-type GPT2BPETokenizer \
33+
--vocab-file datasets_gpt/vocab.json \
34+
--merge-file datasets_gpt/merges.txt \
35+
--data-path datasets/gpt_text_document \
36+
--split 949,50,1 \
37+
--fp16 \
38+
--save /workspace/shared/ckpts_gpt \
39+
--save-interval 50 \
40+
--tensorboard-dir /workspace/shared/tensorboard_gpt \
41+
--transformer-impl transformer_engine \
42+
--lr 3e-4 --min-lr 3e-4 --lr-decay-style constant --lr-warmup-iters 0 \
43+
--trace --trace-dir trace_output --trace-interval 5 \
44+
--continuous-trace-iterations 2 --trace-granularity full

0 commit comments

Comments
 (0)