The Ultimate PyTorch Source-Build Template
Translations: 한국어
TL;DR
PyTorch built from source can be x4 faster than a naïve PyTorch install. This repository provides a template for building PyTorch pip wheel binaries from source for any PyTorch version on any CUDA version on any environment. These can be used in any project environment, including on local conda
environments, on any CUDA GPU.
In addition, a new MLOps paradigm for deep learning development using Docker Compose is also proposed here. Hopefully, this method will become best practice in both academia and industry.
Preamble
Recent years have seen tremendous academic effort go into the design and implementation of efficient neural networks to cope with the ever-increasing amount of data on ever-smaller and more efficient devices. Yet, as of the time of writing, most deep learning practitioners are unaware of even the most basic GPU acceleration techniques. Especially in academia, many do not even use Automatic Mixed Precision (AMP), which can reduce memory requirements to 1/4 and increase speeds by x4~5. This is the case even though AMP can be enabled without much hassle using the HuggingFace Accelerate or PyTorch Lightning libraries. The Accelerate library in particular can be integrated into any pre-existing PyTorch project with only a few lines of code.
Even the novice who has only just dipped their toes into the mysteries of deep learning knows that more compute is a key ingredient for success. No matter how brilliant the scientist, outperforming a rival with x10 more compute is no mean feat. This template was created with the aim of enabling researchers and engineers without much knowledge of GPUs, CUDA, Docker, etc. to squeeze every last drop of performance from their GPUs using the same hardware and neural networks.
Although Docker images with PyTorch source builds are already available in the official PyTorch Docker Hub repository and the NVIDIA NGC repository, these images have a multitude of other packages installed with them, making it difficult to integrate them into pre-existing projects. Moreover, many practitioners prefer local environments over Docker images.
This project is different from any other. It has no additional libraries to work with except for those installed by the user. Even better, the generated wheels can be extracted for use in any environment with no need to use Docker, though the second part of this project provides a docker-compose.yaml
file to make using Docker much easier.
If you are among those who could but only yearn for a quicker end to the long hours endured staring at Tensorboard as your models inched past the epochs, this project may be the answer to your woes. When using a source build of PyTorch with the latest version of CUDA, combined with AMP, one may achieve compute times x10 faster than a naïve PyTorch environment.
I sincerely hope that my project will be of service to practitioners in both academia and industry. Users who find my work beneficial are more than welcome to show their appreciation by starring this repository.
Warning
Before using this template, first check whether you are actually using your GPU!
In most scenarios, slow training is caused by an inefficient Extract, Transform, Load (ETL) pipeline. Training is slow because the data is not getting to the GPU fast enough, not because the GPU is running slowly. Run watch nvidia-smi
to check whether GPU utilization is high enough to justify compute optimizations. If GPU utilization is low or peaks sporadically, first design an efficient ETL pipeline. Otherwise, faster compute will not help very much as it will not be the bottleneck.
See https://www.tensorflow.org/guide/data_performance for a guide on designing an efficient ETL pipeline.
The NVIDIA DALI library may also be helpful. The DALI PyTorch plugin provides an API for efficient ETL pipelines in PyTorch.
Introduction
To use this template for a new project, press the green Use this template
button on the top. This is more convenient than forking or cloning this repository. Delete any unnecessary files and start making your project.
The first part of the README
will explain the purpose of the Dockerfile
and the advantages of using a custom source build of PyTorch. The second part proposes a new paradigm for deep learning development using Docker Compose.
PyTorch built from source can be much faster than PyTorch installed from pip
/conda
but building from source is an arduous and bug-prone process.
This repository is a highly modular template to build any version of PyTorch from source on any version of CUDA. It provides an easy-to-use Dockerfile that can be integrated into any Linux-based image or project.
For researchers unfamiliar with Docker, the generated wheel files, located in /tmp/dist/
, can be extracted to install PyTorch on their local environments. Windows users may also use this project via WSL.
A Makefile
is provided both as an interface for easy use and as a tutorial for building custom images. A docker-compose.yaml
file is also provided as a simple MLOps system. It provides a convenient interactive development experience using Docker. See here to get started with Docker Compose on your system.
The speed gains from this template come from the following factors:
- Using the latest version of CUDA and associated libraries (cuDNN, cuBLAS, etc.).
- Using a source build made specifically for the target machine with the latest software customizations instead of a build that must be compatible with different hardware and software environments.
- Using the latest version of PyTorch and subsidiary libraries. Many users do not update their PyTorch version because of compatibility issues with their pre-existing environment.
- Informing users on where to look for solutions to their speed problems (this may be the most important factor).
Combined with techniques such as AMP and cuDNN benchmarking, computational throughput can be increased dramatically (possibly x10) on the same hardware.
Even if you do not wish to use Docker in your project, you may still find this template useful.
The wheel files generated by the build can be used in any Python environment with no dependency on Docker.
This project can thus be used to generate custom wheel files, providing dramatic compute speedups for any environment (conda, pip, etc.).
Quickstart
This project is a template, and users are expected to customize it to fit their needs. Users are free to customize the train
stage of the Dockerfile
as they please. However, do not change the build
stages unless absolutely necessary as this will cause a build cache miss. If a new package must be built, add a new build
layer.
The code is assumed to be running on a Linux host with the necessary NVIDIA Drivers and a recent version of Docker & Docker Compose V2 pre-installed. If this is not the case, install these first. Older versions may not be compatible with this project. The NVIDIA drivers are especially prone to error. Please check the compatibility matrix to verify that your driver version is compatible with your GPU hardware and the CUDA version of the image.
To build a training image, first edit the Dockerfile train
stage and requirements.txt
file to include desired packages from apt
/conda
/pip
.
Then, visit https://developer.nvidia.com/cuda-gpus to find the Compute Capability (CC) of the target GPU device.
Finally, run make all CC=TARGET_CC(s)
.
Examples
(1) make all CC="8.6"
for RTX 3090, (2) make all CC="7.5 8.6"
for both RTX 2080Ti and RTX 3090 (building for many GPU CCs will lengthen build times).
This will result in an image, pytorch_source:train
, which can be used for training. Note that CCs for devices not available during the build can be used to build the image. For example, if the image must be used on an RTX 2080Ti machine but the user only has an RTX 3090, the user can set CC="7.5"
to enable the image to operate on the RTX 2080Ti GPU. See https://pytorch.org/docs/stable/cpp_extension.html for an in-depth guide on how to set TORCH_CUDA_ARCH_LIST
, which is specified by CC
in the Makefile
.
Makefile Explanation
The provided Makefile
is designed to simplify the user experience. Many practitioners use custom shell scripts for their environment setup. However, this often leads to a clutter of script files that only the author knows how to use, and which even the author forgets after a while. The Makefile
gathers all instructions and environment variables into a single file, making project management much simpler.
The first image to be created is pytorch_source:build_install
, which contains all packages necessary for the build. The installation image is created separately to cache downloads.
The second image is pytorch_source:build_torch-$(PYTORCH_VERSION_TAG)
(by default), which contains the wheels for PyTorch, TorchVision, TorchText, and TorchAudio. The second image exists merely to cache the build process artifacts. No programs or environment variables will be available, only the artifacts. Any attempt to run python
or any other program in this image will therefore fail.
If you do not wish to use Docker and would like to only extract the .whl
wheel files for a pip install on your environment, the generated wheel files can be found in the /tmp/dist
directory.
The final image is pytorch_source:train
, which is the image to be used for actual training. It relies on the previous stages only for the build artifacts (wheels, etc.) and nothing else. This makes it very simple to create separate training images optimized for different environments and GPU devices.
Because PyTorch has already been built, the training image only needs to download the remaining apt
/conda
/pip
packages. Caching is also implemented to speed up even this process.
Timezone Settings
International users may find this section helpful.
The train
image has its timezone set by the TZ
variable using the tzdata
package. The default timezone is Asia/Seoul
but this can be changed by specifying the TZ
variable when calling make
. Use IANA timezone names to specify the desired timezone.
Example: make all CC="8.6" TZ=America/Los_Angeles
uses L.A. time on the training image.
N.B. Only the training image has timezone settings. The installation and build images do not use timezone information.
In addition, the training image has apt
and pip
installation URLs updated for Korean users. If you wish to speed up your installs, please find URLs optimized for your location, though the installation caches may make this unnecessary.
Specific PyTorch Version
PyTorch subsidiary libraries only work with matching versions of PyTorch.
To change the version of PyTorch, set the PYTORCH_VERSION_TAG
, TORCHVISION_VERSION_TAG
, TORCHTEXT_VERSION_TAG
, and TORCHAUDIO_VERSION_TAG
variables to matching versions.
The *_TAG
variables must be GitHub tags or branch names of those repositories. Visit the GitHub repositories of each library to find the appropriate tags.
Example: To build on an RTX 3090 GPU with PyTorch 1.9.1, use the following command:
make all CC="8.6" PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1
.
The resulting image, pytorch_source:train
, can be used for training with PyTorch 1.9.1 on GPUs with Compute Capability 8.6.
Multiple Training Images
To use multiple training images on the same host, give a different name to TRAIN_NAME
, which has a default value of train
.
New training images can be created without having to rebuild PyTorch if the same build image is used for different training images. Creating new training images takes only a few minutes.
This is useful for the following use cases.
- Allowing different users with different UID/GIDs to use separate training images.
- Using different versions of the final training image with different library installations and configurations.
- Using this template for multiple PyTorch projects, each with different libraries and settings.
For example, if pytorch_source:build_torch-v1.9.1
has already been built, Alice and Bob would use the following commands to create separate images.
Alice: make build-train CC="8.6" TORCH_NAME=build_torch-v1.9.1 PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1 TRAIN_NAME=train_alice
Bob: make build-train CC="8.6" TORCH_NAME=build_torch-v1.9.1 PYTORCH_VERSION_TAG=v1.9.1 TORCHVISION_VERSION_TAG=v0.10.1 TORCHTEXT_VERSION_TAG=v0.10.1 TORCHAUDIO_VERSION_TAG=v0.9.1 TRAIN_NAME=train_bob
This way, Alice's image would have her UID/GID while Bob's image would have his UID/GID. This procedure is necessary because training images have their users set during the build. Also, different users may install different libraries in their training images. Their environment variables and other settings may also be different.
Word of Caution
When using build images such as pytorch_source:build_torch-v1.9.1
as a build cache for creating new training images, the user must re-specify all build arguments (variables specified by ARG and ENV using --build-arg) of all previous layers.
Otherwise, the default values for these arguments will be given to the Dockerfile and a cache miss will occur because of the different input values. This will both waste time rebuilding previous layers and, more importantly, cause inconsistency in the training images due to environment mismatch.
This includes the docker-compose.yaml
file as well. All arguments given to the Dockerfile
during the build must be respecified. This includes default values present in the Makefile
but not present in the Dockerfile
, such as the version tags.
If Docker starts to rebuild layers that you have already built, suspect that build arguments have been specified incorrectly.
See https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache for more information.
Users must set BUILDKIT_INLINE_CACHE=1
during the image build to use it as a cache later. See https://docs.docker.com/engine/reference/commandline/build/#specifying-external-cache-sources for more information.
Advanced Usage
The Makefile
provides the *-full
commands for advanced usage.
make all-full CC=YOUR_GPU_CC TRAIN_NAME=full
will create pytorch_source:build_install-ubuntu18.04-cuda10.2-cudnn8-py3.9
, pytorch_source:build_torch-$(PYTORCH_VERSION_TAG)-ubuntu18.04-cuda10.2-cudnn8-py3.9
, and pytorch_source:full
by default.
The default images shown above can be used for training/deployment on CUDA 10 devices such as the GTX 1080Ti.
Also, the *-clean
commands are provided to check for cache reliance on previous builds.
Specific CUDA Version
Set CUDA_VERSION
, CUDNN_VERSION
, and MAGMA_VERSION
to change CUDA versions. PYTHON_VERSION
may also be changed if necessary.
This will create a build image that can be used as a cache to create training images with the build-train
command.
Also, the extensive use of caching in the Dockerfile
means that the second build is much faster than the first build. This may be advantageous if many images must be created for multiple PyTorch/CUDA versions.
Specific Linux Distro
CentOS and UBI images can be created with only minor edits to the Dockerfile
. Read the Dockerfile
for full instructions.
Set the LINUX_DISTRO
and DISTRO_VERSION
arguments afterwards.
Windows
Windows users may use this template by updating to Windows 11 and installing Windows Subsystem for Linux (WSL). WSL on Windows 11 gives a similar experience to using native Linux.
This project has been tested on Windows 11 WSL with the Windows CUDA driver and Docker Desktop for Windows. There is no need to install a separate WSL CUDA driver or Docker for Linux inside WSL.
N.B. Windows Security real-time protection causes significant slowdown if enabled. Disable any active antivirus programs on Windows for best performance. However, this will create obvious security risks.
Interactive Development & MLOps with Docker Compose
Raison d'Être
The purpose of this section is to introduce a new paradigm for deep learning development. I hope that, eventually, using Docker Compose for deep learning development will become best practice.
Developing in local environments with conda
or pip
is commonplace in the deep learning community. However, this risks rendering the development environment, and the code meant to run on it, unreproducible. This is a serious detriment to scientific progress that many readers of this article will have experienced at first-hand.
Docker containers are the standard method for providing reproducible programs across different computing environments. They create isolated environments where programs can run without interference from the host or from one another. See https://www.docker.com/resources/what-container for details.
But in practice, Docker containers are often misused. Containers are meant to be transient. Best practice dictates that a new container be created for each run. This, however, is very inconvenient for development, especially for deep learning applications, where new libraries must constantly be installed and bugs are often only evident at runtime. This leads many researchers to develop inside interactive containers. Docker users often have run.sh
files with commands such as docker run -v my_data:/mnt/data -p 8080:22 -t my_container my_image:latest /bin/bash
(look familiar, anyone?) and use SSH to connect to running containers. VSCode also provides a remote development mode to code inside containers.
The problem with this approach is that these interactive containers become just as unreproducible as local development environments. A running container cannot connect to a new port or attach a new volume. But if the computing environment within the container was created over several months of installs and builds, the only way to keep it is to save the container as an image and create a new container from the saved image. After a few iterations of this process, the resulting images become bloated and no less scrambled than the local environments that they were meant to replace.
Problems become even more evident when preparing for deployment. MLOps, defined as a set of practices that aims to deploy and maintain machine learning models reliably and efficiently, has gained enormous popularity of late as many practitioners have come to realize the importance of continuously maintaining ML systems long after the initial development phase ends.
However, bad practices such as those mentioned above mean that much coffee has been spilled turning research code into anything resembling a production-ready product. Often, even the original developers cannot retrain the same model after a few months. Many firms thus have entire teams dedicated to model translation, a huge expenditure.
To alleviate these problems, I propose the use of Docker Compose as a simple MLOps solution. Using Docker and Docker Compose, the entire training environment can be reproduced. Compose has not yet caught on in the deep learning community, possibly because it is usually advertised as a multi-container solution. This is a misunderstanding as it can be used for single-container development just as well.
A docker-compose.yaml
file is provided for easy management of containers. Using the provided docker-compose.yaml
file will create an interactive environment, providing a programming experience very similar to using a terminal on a remote server. Integrations with popular IDEs (PyCharm, VSCode) are also available. Moreover, it also allows the user to specify settings for both build and run, removing the need to manage the environment with custom shell scripts. Connecting a new volume is as simple as removing the current container, adding a line in the docker-compose.yaml
/Dockerfile
file, then creating a new container from the same image. Build caches allow new images to be built very quickly, removing another barrier to Docker adoption, the long initial build time. For more information on Compose, visit the documentation.
Docker Compose can also be used directly for deployment with swarm mode, which is useful for small-scale deployments. See https://docs.docker.com/engine/swarm for documentation. If and when large-scale deployments using Kubernetes becomes necessary, using Docker from the very beginning will accelerate the development process and smooth the path to MLOps adoption. Accelerating time-to-market by streamlining the development process is a competitive edge for any firm, whether lean startup or tech titan.
With luck, the techniques I propose here will enable the deep learning community to "write once, train anywhere". But even if I fail in persuading the majority of users of the merits of my method, I may still spare many a hapless grad student from the sisyphean labor of setting up their conda
environment, only to have it crash and burn right before their paper submission is due.
Usage
Docker images created by the Makefile
are fully compatible with the docker-compose.yaml
file. Do not erase them when using Docker Compose.
Initial Setup
If this is your first time using this project, follow these steps:
- Install Docker Compose V2 for Linux as described in https://docs.docker.com/compose/cli-command/#install-on-linux. Visit the website for the latest installation information. Installation does not require
root
permissions. Please check the version and architecture tags in the URL before installing. The following commands will install Docker Compose V2 (v2.1.0, Linux x86_64) for a single user.
mkdir -p ~/.docker/cli-plugins/
curl -SL https://github.com/docker/compose/releases/download/v2.1.0/docker-compose-linux-x86_64 -o ~/.docker/cli-plugins/docker-compose
chmod +x ~/.docker/cli-plugins/docker-compose
The instructions above are for Linux hosts. WSL users should instead enable "Use Docker Compose V2" on Docker Desktop for Windows.
-
Run
make env
on the terminal to create a basic.env
file. Environment variables can be saved in a.env
file placed on the project root, allowing different projects and different users to set their own variables as required. To create a basic.env
file with the UID and GID, runmake env
. Then read thedocker-compose.yaml
file to fill in extra variables. Also editdocker-compose.yaml
as necessary for your project. Feel free to change session names, hostnames, etc. for different projects and configurations. -
Run
docker compose up -d --build train
ordocker compose up -d --build full
. Thetrain
service corresponds to the defaultmake all ...
build while thefull
service corresponds to themake all-full ...
build. If you have already runmake all ...
ormake all-full ...
, check that thedocker-compose.yaml
file has the same configurations as themake
command used to create the Docker images. Otherwise, a cache miss will occur, rebuilding the image with the new configurations. -
After
docker compose up -d --build SERVICE_NAME
has finished and if you have not yet runmake all(-full) ...
, run themake
build with the same settings as thedocker-compose.yaml
and.env
file settings. This will save the build cache as images, preventing them from being cleared by the system later on. If no cache miss occurs, this will only take a few minutes. -
Run
docker compose exec SERVICE_NAME zsh
and start coding.
General Usage
Using Docker Compose V2 (see https://docs.docker.com/compose/cli-command), run the following two commands, where train
is the default service name in the provided docker-compose.yaml
file.
- Read
docker-compose.yaml
and set variables in the.env
file (first time only). docker compose up -d train
docker compose exec train zsh
This will open an interactive shell with settings specified by the train
service in the docker-compose.yaml
file.
Example .env
file for RTX 3090 GPUs:
UID=1000
GID=1000
CC=8.6
This is extremely convenient for managing reproducible development environments. For example, if a new pip
or apt
package must be installed for the project, users can simply edit the train
layer of the Dockerfile
by adding the package to the apt-get install
or pip install
commands, then run the following command:
docker compose up -d --build train
.
This will remove the current train
session, rebuild the image, and start a new train
session. It will not, however, rebuild PyTorch (assuming no cache miss occurs). Users thus need only wait a few minutes for the additional downloads, which are accelerated by caching and fast mirror URLs.
To stop and restart a service after editing the Dockerfile
or docker-compose.yaml
file, simply run docker compose up -d --build train
again.
To stop services and remove containers, use the following command:
docker compose down
.
Users with remote servers may use Docker contexts (see https://docs.docker.com/engine/context/working-with-contexts) to access their containers from their local environments. For more information on Docker Compose, see https://github.com/compose-spec/compose-spec/blob/master/spec.md. For more information on Docker Compose CLI commands, see https://docs.docker.com/compose/reference.
Also, if an error occurs because BuildKit is not available, add COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1
to any docker compose
commands being used.
Tip
The .env
file does not work with the Makefile
by default. However, typing in the configurations for each run can be tedious. To use the .env
file for the make
commands, use the following technique to give all the variables in the .env
file to the make
command.
make COMMAND $(tr '\n' ' ' < .env)
Example: make all-full $(tr '\n' ' ' < .env)
.
Compose as Best Practice
Docker Compose is a far superior option to using custom shell scripts for each environment. Not only does it gather all variables and commands for both build and run into a single file, but its native integration with Docker means that it makes complicated Docker build/run setups simple to implement.
I wish to emphasize that using Docker Compose this way is a general-purpose technique that does not depend on anything about this project. As an example, an image from the NVIDIA NGC PyTorch repository has been used as the base image in ngc.Dockerfile
. The NVIDIA NGC PyTorch images contain many optimizations for the latest GPU architectures and provide a multitude of pre-installed machine learning libraries. For those starting new projects, and thus with no dependencies, using the latest NGC image is recommended.
To use the NGC images, use the following commands:
docker compose up -d ngc
docker compose exec ngc zsh
The only difference with the previous example is the session name.
Using Compose with PyCharm and VSCode
The Docker Compose container environment can be used with popular Python IDEs, not just in the terminal. PyCharm and Visual Studio Code, both very popular in the deep learning community, are compatible with Docker Compose.
-
If you are using a remote server, first create a Docker context to connect your local Docker with the remote Docker.
-
PyCharm (Professional only): Both Docker and Docker Compose are natively available as Python interpreters. See tutorials for Docker and Compose for details. JetBrains Gateway can also be used to connect to running containers. JetBrains Fleet IDE, with much more advanced features, will become available in early 2022. N.B. PyCharm Professional and other JetBrains IDEs are available free of charge to anyone with a valid university e-mail address.
-
VSCode: Install the Remote Development extension pack. See tutorial for details.
Known Issues
-
Connecting to a running container by
ssh
will remove all variables set byENV
. This is becausesshd
starts a new environment, wiping out all previous variables. Usingdocker
/docker compose
to enter containers is strongly recommended. -
Building on CUDA 11.4.x is not available as of December 2021 because
magma-cuda114
has not been released on thepytorch
channel of anaconda. Bizarrely, magma-cuda115 is available. Users may attempt building with older versions ofmagma-cuda
or try the version available onconda-forge
. A source build ofmagma
would be welcome as a pull request. The NVIDIA NGC images use NVIDIA's in-house build ofmagma
. -
Ubuntu 16.04 build fails because the default
git
installed byapt
on Ubuntu 16.04 does not support the--jobs
flag. Remove the--jobs 0
argument from thegit clone
commands to make it work. Also, PyTorch v1.9+ may not build on Ubuntu 16.04. Lower the version tag to v1.8.2 to build. This project will not be modified to accommodate Ubuntu 16.04 builds as Xenial Xerus has already reached EOL. -
If the Docker Compose build fails with an error message that
BuildKit
is required, addCOMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1
in front of the command. This issue occurs because Docker Compose V2 is not configured to use BuildKit on the host by default. One can tell if BuildKit is enabled by checking if the terminal outputs are in color. BuildKit outputs are colored blue, whereas the old Docker has no color. Example command:COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker compose up -d --build train
. -
WSL users using Compose must disable
ipc: host
. WSL cannot use this option. -
torch.cuda.is_available()
will return a... UserWarning: CUDA initialization:...
error or the image will simply not start if the CUDA driver on the host is incompatible with the CUDA version on the Docker image. Either upgrade the host CUDA driver or downgrade the CUDA version of the image. Check the compatibility matrix to see if the host CUDA driver is compatible with the desired version of CUDA. Also check if the CUDA driver has been configured correctly on the host. The CUDA driver version can be found using thenvidia-smi
command.
Desiderata
-
MORE STARS. If you are reading this, please star this repository immediately. No Contribution Without Appreciation!
-
CentOS and UBI images have not been implemented yet. As they require only simple modifications, pull requests implementing them are very much welcome.
-
Translations into other languages and updates to existing translations are welcome. Please make a separate
LANG.README.md
file and create a PR. -
A method to build
magma
from source would be greatly appreciated. Although the code for building themagma
package is available at https://github.com/pytorch/builder/tree/main/magma, it is updated several months after a new CUDA version is released. A source build as a layer on the image would be welcome. -
Please feel free to share this project! I wish you good luck and happy coding!