Distributed-Training

Resources

Types of distributed training:

Data Parallel:: Replicate model across gpu and distribute data across each gpu.
- DP, DDP: User if model fits in single gpu
- Deepspeed ZeRO Data Parallel::
  - Zero - 1, 2, 3(=FSDP in pytorch)
Pipeline Parallel (Model Parallel):
Tensor Parallel:
2D & 3D Parallelism:

Feature	DataParallel (DP)	DistributedDataParallel (DDP)
Process Model	Single process, multiple threads	Multiple processes, each handling one or more GPUs
Model Replication	Replicated on each GPU at each forward pass	Replicated once per process
Input Data Handling	Splits input data across GPUs	Splits input data across processes
Gradient Aggregation	Gradients averaged on the CPU/Single GPU	Gradients synchronized across processes using NCCL
Performance	Better for smaller models.	More efficient, better scaling across multiple GPUs and nodes.
Scalability	Best for single-node, multi-GPU setups	Scales well across multiple nodes and GPUs
Synchronization	Implicit, handled by the framework	Explicit, requires setting up distributed process groups
Code Example	`model = nn.DataParallel(model).cuda()`	`dist.init_process_group(backend='nccl'); model = DDP(model, device_ids=[local_rank])`

Conclusion:

The only communication DDP performs per batch is sending gradients, whereas DP does 5 different data exchanges per batch.
Under DP gpu 0 performs a lot more work than the rest of the gpus, thus resulting in under-utilization of gpus.

By default pytorch recommends DDP over DP, even for single node, multi gpu setup due to python GIL restrictions over multi threading.

Core deep learning algorithm involves three components for any model apart from inputs:

Stage 1 - Optimizer State Partitioning (Pos), sharding optimizer states across gpus.
Stage 2 - Add Gradient Partitioning (Pos+g), sharding Gradients across gpus.
Stage 3 - Add Parameter Partitioning (Pos+g+p), sharding Parameters across gpus.

Speed vs Memory:

Splits the model across gpus, useful if model size larger than single gpu memory.

Good for loading very large models
Higher GPU idle time, it can be reduced using micro batches as shown above, still GPU utilization is poor compared to other techniques.

Split a weight tensor into N chunks, parallelize computation and aggregate results via all reduce.

There are libraries which support 3D parallelism out of the box, below are a few.