To prevent out-of-memory (OOM) errors when running the Transformer models! Change from distributed data parallel (DDP) to a data+model parallel strategy
Current State
As of 979358749bdbca1587322ff4cd14504e882aa3b3, we have been using distributed data parallel (DDP) to split the data batch-wise across multiple GPUs. However when running on a full-size Sentinel-2 image (batch_size=1) during test phase (#1), this can already cause out-of-memory issues for our Super-Resolution Segmentation task.
Future State
One possible solution is to shard the neural network model itself across multiple GPUs. This reduces the GPU memory requirements and allows for larger models and/or bigger datasets to be used for training/inference.
Specifically, we'll be switching to use DeepSpeed (https://github.com/microsoft/DeepSpeed) which offers several 'levels' of model sharding, and . See https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71 and https://huggingface.co/blog/zero-deepspeed-fairscale for a good explainer
Main DeepSpeed stages (from https://pytorch-lightning.readthedocs.io/en/1.6.3/advanced/model_parallel.html#deepspeed):
- DeepSpeed ZeRO Stage 1 - Shard optimizer states, remains at speed parity with DDP whilst providing memory improvement
- DeepSpeed ZeRO Stage 2 - Shard optimizer states and gradients, remains at speed parity with DDP whilst providing even more memory improvement
- DeepSpeed ZeRO Stage 3 - Shard optimizer states, gradients, parameters and optionally activations. Increases distributed communication volume, but provides even more memory improvement
:bulb: Suggest to use Stage 2 instead of Stage 3 because while Stage 3 improves memory use, it comes with increased latency from the cost of extra distributed communication.
Other benefits of using DeepSpeed:
- Stage 2 and Stage 3 also has an 'Offload' to CPU feature to save on memory, in cases when the GPU memory is simply not enough
- Allows me to train the model on just 16GB of GPU RAM on my workstation :exploding_head:
Alternative strategies (and why they were not considered)
Pytorch-Lightning offers several other advanced training strategies. These might work well for other cases, but probably not for our specific project.
- Bagua (https://github.com/BaguaSys/bagua)
- Data Distributed Parallel
- Why not use this? Due to it not being model parallel (though they may be working on it)
- https://pytorch-lightning.readthedocs.io/en/1.6.3/accelerators/gpu.html#bagua
- https://devblog.pytorchlightning.ai/bagua-a-new-efficient-distributed-training-strategy-available-in-pytorch-lightning-1-6-d6392633b15
- Fairscale (https://github.com/facebookresearch/fairscale)
- Model parallel training, a close competitor to DeepSpeed
- Why not use this? I did try, but the conda-forge package couldn't work because of some ABI compatibility issue.
- https://pytorch-lightning.readthedocs.io/en/1.6.3/advanced/model_parallel.html#fully-sharded-training
TODO:
- [x] Add deepspeed dependency (0a666012ffd523f8fffed541612d85ed0e151e7d)
- [x] Switch model to use DeepSpeed ZeRO Stage 2 (6394c12fc92a4837f8f453b5192d038928dffd46)
- [ ] ~~Use Meta Tensors, c.f. https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71~~
- Nope, doesn't work. Error given is
NotImplementedError: Could not run 'aten::_local_scalar_dense' with arguments from the 'Meta' backend
. See also https://github.com/pytorch/pytorch/issues/77764
- [ ] Decide whether to remove the Super-Resolution branch :thinking:
enhancement