We use cookies on this site to enhance your user experience
By clicking the Accept button, you agree to us doing so. More info on our cookie policy
We use cookies on this site to enhance your user experience
By clicking the Accept button, you agree to us doing so. More info on our cookie policy
[[#Resources | Resources]] |
[[#Data Parallel | Data Parallel]] |
[[#Deepspeed ZeRO Data Parallel | Deepspeed ZeRO Data Parallel]] |
[[#Pipeline Parallel (Model Parallel) | Pipeline Parallel (Model Parallel)]] |
[[#Tensor Parallel | Tensor Parallel]] |
[[#2D & 3D Parallelism | 2D & 3D Parallelism]] |
Types of distributed training:
![[Pasted image 20240703184727.png]]
Feature | DataParallel (DP) | DistributedDataParallel (DDP) |
---|---|---|
Process Model | Single process, multiple threads | Multiple processes, each handling one or more GPUs |
Model Replication | Replicated on each GPU at each forward pass | Replicated once per process |
Input Data Handling | Splits input data across GPUs | Splits input data across processes |
Gradient Aggregation | Gradients averaged on the CPU/Single GPU | Gradients synchronized across processes using NCCL |
Performance | Better for smaller models. | More efficient, better scaling across multiple GPUs and nodes. |
Scalability | Best for single-node, multi-GPU setups | Scales well across multiple nodes and GPUs |
Synchronization | Implicit, handled by the framework | Explicit, requires setting up distributed process groups |
Code Example | model = nn.DataParallel(model).cuda() |
dist.init_process_group(backend='nccl'); model = DDP(model, device_ids=[local_rank]) |
![[Pasted image 20240703173906.png]]
![[Pasted image 20240703171020.png]]
![[Pasted image 20240703174204.png]]
Conclusion:
By default pytorch recommends DDP over DP, even for single node, multi gpu setup due to python GIL restrictions over multi threading.
^82e76e
Core deep learning algorithm involves three components for any model apart from inputs:
![[Pasted image 20240703171554.png]]
Speed vs Memory:
Splits the model across gpus, useful if model size larger than single gpu memory.
![[Pasted image 20240703183650.png]]
![[Pasted image 20240703183721.png]]
Split a weight tensor into N chunks, parallelize computation and aggregate results via all reduce.
![[Pasted image 20240703184517.png]]
![[Pasted image 20240703184944.png]] ![[Pasted image 20240703185009.png]] ![[Pasted image 20240703185030.png]]
There are libraries which support 3D parallelism out of the box, below are a few.
Latest Posts