MVPavan’s Notes

Resources

Types of distributed training:

Data Parallel:: Replicate model across gpu and distribute data across each gpu.
- DP, DDP: User if model fits in single gpu
- Deepspeed ZeRO Data Parallel::
  - Zero - 1, 2, 3(=FSDP in pytorch)
Pipeline Parallel (Model Parallel):
Tensor Parallel:
2D & 3D Parallelism:

![[Pasted image 20240703184727.png]]

Feature	DataParallel (DP)	DistributedDataParallel (DDP)
Process Model	Single process, multiple threads	Multiple processes, each handling one or more GPUs
Model Replication	Replicated on each GPU at each forward pass	Replicated once per process
Input Data Handling	Splits input data across GPUs	Splits input data across processes
Gradient Aggregation	Gradients averaged on the CPU/Single GPU	Gradients synchronized across processes using NCCL
Performance	Better for smaller models.	More efficient, better scaling across multiple GPUs and nodes.
Scalability	Best for single-node, multi-GPU setups	Scales well across multiple nodes and GPUs
Synchronization	Implicit, handled by the framework	Explicit, requires setting up distributed process groups
Code Example	`model = nn.DataParallel(model).cuda()`	`dist.init_process_group(backend='nccl'); model = DDP(model, device_ids=[local_rank])`

![[Pasted image 20240703173906.png]]

![[Pasted image 20240703171020.png]]

![[Pasted image 20240703174204.png]]

Conclusion:

The only communication DDP performs per batch is sending gradients, whereas DP does 5 different data exchanges per batch.
Under DP gpu 0 performs a lot more work than the rest of the gpus, thus resulting in under-utilization of gpus.

By default pytorch recommends DDP over DP, even for single node, multi gpu setup due to python GIL restrictions over multi threading.

^82e76e

Core deep learning algorithm involves three components for any model apart from inputs:

Parameters
Gradient
Optimizer state Thus Deepspeed zero is divided into three stages:
- Stage 1 - Optimizer State Partitioning (Pos), sharding optimizer states across gpus.
- Stage 2 - Add Gradient Partitioning (Pos+g), sharding Gradients across gpus.
- Stage 3 - Add Parameter Partitioning (Pos+g+p), sharding Parameters across gpus.

![[Pasted image 20240703171554.png]]

Speed vs Memory:

Splits the model across gpus, useful if model size larger than single gpu memory.

![[Pasted image 20240703183650.png]]

![[Pasted image 20240703183721.png]]

Good for loading very large models
Higher GPU idle time, it can be reduced using micro batches as shown above, still GPU utilization is poor compared to other techniques.

Split a weight tensor into N chunks, parallelize computation and aggregate results via all reduce.

![[Pasted image 20240703184517.png]]

![[Pasted image 20240703184944.png]] ![[Pasted image 20240703185009.png]] ![[Pasted image 20240703185030.png]]

There are libraries which support 3D parallelism out of the box, below are a few.