This repository is a step-by-step breakdown of how distributed deep learning actually works under the hood using PyTorch.
It starts from the lowest-level communication primitives (send/recv) and builds up to full-scale Distributed Data Parallel (DDP) training with multiple GPUs.
The goal is not just to run distributed training, but to understand:
how independent processes coordinate to behave like a single training system.
In single-GPU training:
- one process
- one model
- one dataset
In distributed training:
- multiple processes (one per GPU)
- each process runs the same model code
- each processes different data
- communication keeps everything synchronized
The system only works because of explicit communication primitives.
Distributed training is NOT:
multiple models learning independently
It IS:
multiple workers computing independently but staying synchronized through communication
So:
- computation is parallel
- learning is unified
- simplest form of distributed communication
- one process sends data to another
- fully explicit, blocking communication
Conceptual meaning:
one worker directly passes memory to another worker
This teaches:
- no shared memory exists between processes
- all coordination is explicit
- rank 0 becomes source of truth
- all other processes copy its tensor
Used for:
- model initialization
- parameter synchronization before training
Conceptual meaning:
ensure all workers start from identical state
- all processes compute local gradients
- gradients are summed (or averaged)
- result is distributed back to all processes
Used for:
- gradient synchronization in training
Conceptual meaning:
convert multiple independent gradients into one shared update signal
This is the core operation behind DDP.
Each GPU must see different data.
Without sampling:
- every GPU processes the full dataset → wasted compute
With DistributedSampler:
- dataset is partitioned across GPUs
- each GPU sees a unique subset
Conceptual meaning:
parallelism comes from splitting data, not splitting the model
DDP combines everything:
-
Model replication
- same model on every GPU
-
Data partitioning
- each GPU gets different mini-batches
-
Independent computation
- forward + backward pass per GPU
-
Automatic synchronization
- gradients are all-reduced internally
- all models stay identical
- processes launched (one per GPU)
- model created on each GPU
- weights broadcast from rank 0
DistributedSamplerensures unique batches per GPU
- each GPU computes outputs independently
- each GPU computes gradients locally
- gradients are all-reduced across GPUs
- every GPU receives identical averaged gradients
- each GPU applies identical update
- models remain perfectly synchronized
Even though computation is distributed:
the model state is always globally consistent
This is achieved through synchronization after every training step.
Think of distributed training as:
- multiple workers
- each doing partial computation
- but forced to agree after every step
So the system behaves like:
one large GPU, split across many machines
This architecture enables:
- faster training (parallel compute)
- larger batch sizes
- scalable deep learning systems
Without communication primitives (broadcast + all_reduce), distributed training would not be possible.
This repo builds understanding in order:
send_receive.py→ raw process communicationbroadcast.py→ state synchronizationall_reduce.py→ gradient aggregationdistributed_sampler.py→ data parallelismddp_training.py→ full system
Distributed training is not about running multiple models.
It is about:
coordinating multiple independent processes so they behave like a single coherent optimization system through structured communication.