A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

FFCV

Last update: Dec 31, 2022

Related tags

Deep Learning ffcv-imagenet

Overview

`ffcv` ImageNet Training

A minimal, single-file PyTorch ImageNet training script designed for hackability. Run train_imagenet.py to get...

...high accuracies on ImageNet
...with as many lines of code as the PyTorch ImageNet example
...in 1/10th the time.

Results

Train models more efficiently, either with 8 GPUs in parallel or by training 8 ResNet-18's at once.

See benchmark setup here: https://docs.ffcv.io/benchmarks.html.

Citation

If you use this setup in your research, cite:

@misc{leclerc2022ffcv,
    author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},
    title = {ffcv},
    year = {2022},
    howpublished = {\url{https://github.com/libffcv/ffcv/}},
    note = {commit xxxxxxx}
}

Configurations

The configuration files corresponding to the above results are:

Link to Config	top_1	top_5	# Epochs	Time (mins)	Architecture	Setup
Link	0.784	0.941	88	77.2	ResNet-50	8 x A100
Link	0.780	0.937	56	49.4	ResNet-50	8 x A100
Link	0.772	0.932	40	35.6	ResNet-50	8 x A100
Link	0.766	0.927	32	28.7	ResNet-50	8 x A100
Link	0.756	0.921	24	21.7	ResNet-50	8 x A100
Link	0.738	0.908	16	14.9	ResNet-50	8 x A100
Link	0.724	0.903	88	187.3	ResNet-18	1 x A100
Link	0.713	0.899	56	119.4	ResNet-18	1 x A100
Link	0.706	0.894	40	85.5	ResNet-18	1 x A100
Link	0.700	0.889	32	68.9	ResNet-18	1 x A100
Link	0.688	0.881	24	51.6	ResNet-18	1 x A100
Link	0.669	0.868	16	35.0	ResNet-18	1 x A100

Training Models

First pip install the requirements file in this directory:

pip install -r requirements.txt

Then, generate an ImageNet dataset; make the dataset used for the results above with the following command (IMAGENET_DIR should point to a PyTorch style ImageNet dataset:

# Required environmental variables for the script:
export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/
export WRITE_DIR=/your/path/here/

# Starting in the root of the Git repo:
cd examples;

# Serialize images with:
# - 500px side length maximum
# - 50% JPEG encoded, 90% raw pixel values
# - quality=90 JPEGs
./write_dataset.sh 500 0.50 90

Then, choose a configuration from the configuration table. With the config file path in hand, train as follows:

# 8 GPU training (use only 1 for ResNet-18 training)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Set the visible GPUs according to the `world_size` configuration parameter
# Modify `data.in_memory` and `data.num_workers` based on your machine
python train_imagenet.py --config-file rn50_configs/<your config file>.yaml \
    --data.train_dataset=/path/to/train/dataset.ffcv \
    --data.val_dataset=/path/to/val/dataset.ffcv \
    --data.num_workers=12 --data.in_memory=1 \
    --logging.folder=/your/path/here

Adjust the configuration by either changing the passed YAML file or by specifying arguments via fastargs (i.e. how the dataset paths were passed above).

Training Details

System setup. We trained on p4.24xlarge ec2 instances (8 A100s).

Dataset setup. Generally larger side length will aid in accuracy but decrease throughput:

ResNet-50 training: 50% JPEG 500px side length
ResNet-18 training: 10% JPEG 400px side length

Algorithmic details. We use a standard ImageNet training pipeline (à la the PyTorch ImageNet example) with only the following differences/highlights:

SGD optimizer with momentum and weight decay on all non-batchnorm parameters
Test-time augmentation over left/right flips
Progressive resizing from 160px to 192px: 160px training until 75% of the way through training (by epochs), then 192px until the end of training.
Validation set sizing according to "Fixing the train-test resolution discrepancy": 224px at test time.
Label smoothing
Cyclic learning rate schedule

Refer to the code and configuration files for a more exact specification. To obtain configurations we first gridded for hyperparameters at a 30 epoch schedule. Fixing these parameters, we then varied only the number of epochs (stretching the learning rate schedule across the number of epochs as motivated by Budgeted Training) and plotted the results above.

FAQ

Why is the first epoch slow?

The first epoch can be slow for the first epoch if the dataset hasn't been cached in memory yet.

What if I can't fit my dataset in memory?

See this guide here.

Other questions

Please open up a GitHub discussion for non-bug related questions; if you find a bug please report it on GitHub issues.

Comments

Imagenet dataset preparation size
In an attempt to replicate results as a sanity test, I ran the data preparation script as ./write_imagenet.sh 500 0.50 90 in its default configuration on Imagenet dataset. I can see from the documentation provided at https://docs.ffcv.io/benchmarks.html that initializing the writer with RGBImageField(write_mode=proportion, compress_probability=0.5, max_resolution= 512, jpeg_quality=90) should generate a dataset of size 202.04 GB. However when I ran this myself, I got a train dataset size 337 GB and val 15 GB.

I am wondering if the compress_probability value used in the documentation at https://docs.ffcv.io/benchmarks.html was higher than 0.5, which leads to a smaller dataset size than I got? It's a little unclear why I have a 40% larger dataset using similar configuration values.

I'm also a bit confused with the comment below, as per my understanding using prob=0.5 means that you use JPEG encoding for 50% of the images, and raw pixel values for 50% of the images (not 90%?)

# Serialize images with: # - 500px side length maximum # - 50% JPEG encoded, 90% raw pixel values # - quality=90 JPEGs ./write_imagenet.sh 500 0.50 90
opened by aniketrege 8
ImportError: libopencv_imgproc.so
Hi, After downgrading torch (and torchvision) version from 1.10 to 1.9, the import ffcv command raises ImportError:

File "/my_home/miniconda3/envs/test/lib/python3.9/site-packages/ffcv/libffcv.py", line 5, in <module> import ffcv._libffcv ImportError: libopencv_imgproc.so.405: cannot open shared object file: No such file or directory

To reproduce:

conda create -y -n test python=3.9 cupy pkg-config compilers libjpeg-turbo opencv pytorch torchvision cudatoolkit=11.3 numba -c pytorch -c conda-forge conda activate test pip install ffcv

Running python and importing ffcv works fine at this point. But if we try to reinstall pytorch for 1.9 version with:

conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge

the import ffcv command breaks. Reinstalling ffcv doesn't help.

Do you have an idea how to resolve this? (installing 1.9 from the beginning works fine and doesn't break ffcv, but I guess that shouldn't be the preferred way of solving this issue)
opened by eldarkurtic 4
Question about the parameter of the `write_imagenet.py`
@lengstrom Thanks for your wonderful work!

I have a question about the parameters of the write_imagenet.py.

From the repo of ffcv, we can see https://github.com/libffcv/ffcv/blob/bfd9b3d85e31360fada2ecf63bea5602e4774ba3/ffcv/fields/rgb_image.py#L337

write_mode = self.write_mode as_jpg = None if write_mode == 'smart': as_jpg = encode_jpeg(image, self.jpeg_quality) write_mode = 'raw' if self.smart_threshold is not None: if image.nbytes > self.smart_threshold: write_mode = 'jpg' elif write_mode == 'proportion': if np.random.rand() < self.proportion: write_mode = 'jpg' else: write_mode = 'raw'

The default write mode in https://github.com/libffcv/ffcv-imagenet/blob/main/write_imagenet.py is smart, and the smart_threshold is None. So the script is actually running in RAW write mode?

Related issues are https://github.com/libffcv/ffcv-imagenet/issues/1
opened by mzhaoshuai 2
Reproducing Validation Numbers

In an attempt to replicate your numbers, we trained for 40 epochs on a single A100 GPU with the ffcv dataset files generated from the bash script provided with the config specified in rn50_40_epochs.yaml .

After training for ~5 hours, we observed top1=0.729 and top5 = 0.915, in contrast to your quoted numbers of 0.772 and 0.932 from the configuration table in the README. The primary difference was we used 1xA100 instead of 8xA100 that you used, and observed a total training roughly 8x of what you quote (35.6 minutes for 8xA100).

I don't believe that using a single GPU instead of 8 should impact validation accuracy to this extent (5.5% for top 1 and 1.5% for top 5). Could you suggest why this might be happening, or if it is indeed due to using a single A100 GPU instead of 8?

opened by aniketrege 1

A complete example for imagenet data loading

I've been trying to use your FFCV data loader for imagenet training. I find the provided example hard to follow as you use progressive resizing. I wonder if you could provide a complete example with the most commonly used resolution 224.

I have also coded it up myself, but I found the validation accuracy is significantly lower than the training accuracy in my case (see attached code snippet below). For example, after 3 epochs, the training ACC is around 40%, but the validation is only 15%.

def get_ffcv_trainloader(train_dataset, device, batch_size, num_workers=12, in_memory=True):
    train_path = Path(train_dataset)
    assert train_path.is_file()

    decoder = RandomResizedCropRGBImageDecoder((224, 224))
    image_pipeline: List[Operation] = [
        decoder,
        RandomHorizontalFlip(),
        ToTensor(),
        ToDevice(device, non_blocking=True),
        ToTorchImage(),
        NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)
    ]

    label_pipeline: List[Operation] = [
        IntDecoder(),
        ToTensor(),
        Squeeze(),
        ToDevice(device, non_blocking=True)
    ]

    order = OrderOption.QUASI_RANDOM
    loader = Loader(train_dataset,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    order=order,
                    os_cache=in_memory,
                    drop_last=True,
                    pipelines={
                        'image': image_pipeline,
                        'label': label_pipeline
                    })

    return loader


def get_ffcv_valloader(val_dataset, device, batch_size, num_workers=12):
    val_path = Path(val_dataset)
    assert val_path.is_file()
    cropper = CenterCropRGBImageDecoder((224, 224), ratio=224/256)
    image_pipeline = [
        cropper,
        ToTensor(),
        ToDevice(device, non_blocking=True),
        ToTorchImage(),
        NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)
    ]

    label_pipeline = [
        IntDecoder(),
        ToTensor(),
        Squeeze(),
        ToDevice(device, non_blocking=True)
    ]

    loader = Loader(val_dataset,
                    batch_size=batch_size,
                    num_workers=num_workers,
                    order=OrderOption.SEQUENTIAL,
                    drop_last=False,
                    pipelines={
                        'image': image_pipeline,
                        'label': label_pipeline
                    })
    return loader

opened by gd-zhang 1

Potential minor speedup: Gaussian blur is a separable 2d convolution
https://github.com/libffcv/ffcv-imagenet/blob/e97289fdacb4b049de8dfefefb250cc35abb6550/train_imagenet.py#L124

Not to be nitpicky, but this could actually be replaced with two "1d" convolutions, one for width and one for height, which would use ~2K operations instead of ~K^2:

def separable_conv2d(inputs: Tensor, k_h: Tensor, k_w: Tensor) -> Tensor: kernel_size = max(k_h.shape[-2:]) pad_amount = kernel_size // 2 #'same' padding. # Gaussian filter is separable: out_1 = F.conv2d(inputs, k_h, padding=(0, pad_amount)) out_2 = F.conv2d(out_1, k_w, padding=(pad_amount, 0)) return out_2
opened by lebrice 1

batch_size=1 causes error when Squeeze() is in the "label" pipeline

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 79, in run
    result = self.run_pipeline(b_ix, ixes, slot, events[slot])
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 138, in run_pipeline
    return tuple(x[:len(batch_indices)] for x in args)
  File "/home/cbotos/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 138, in <genexpr>
    return tuple(x[:len(batch_indices)] for x in args)
IndexError: slice() cannot be applied to a 0-dim tensor.

I would say that this error is sorta unexpected, but I could have anticipated it since the Squeeze is also squishing the batch dimension in this case (if I understood the situation correctly)

opened by botcs 1

ValueError(“total size of new array must be changed”)

Hi, I’m trying to train model by the guide line. But I got the ValueError and SystemError when I try to load the following codes:

python train_imagenet.py --config-file rn50_configs/<your config file>.yaml \ --data.train_dataset=/path/to/train/dataset.ffcv \ --data.val_dataset=/path/to/val/dataset.ffcv \ --data.num_workers=12 --data.in_memory=1 \ --logging.folder=/your/path/here

How can I solve this? Thanks in advance for your replies.

opened by cindy-cheng0214 0

Training extremely slow

Hello,

I followed closely the README and launched a training using the following command on a server with 8 V100 GPUs:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train_imagenet.py --config-file rn50_configs/rn50_88_epochs.yaml \
    --data.train_dataset=$HOME/data/imagenet_ffcv/train_500_0.50_90.ffcv \
    --data.val_dataset=$HOME/data/imagenet_ffcv/val_500_0.50_90.ffcv \
    --data.num_workers=3 --data.in_memory=1 \
    --logging.folder=$HOME/experiments/ffcv/rn50_88_epochs

Training took almost an hour per epoch, and the second epoch is almost as slow as the first one. The output of the log file is as follows:

cat ~/experiments/ffcv/rn50_88_epochs/d9ef0d7f-17a3-4e57-8d93-5e7c9a110d66/log 
{"timestamp": 1650641704.0822473, "relative_time": 2853.3256430625916, "current_lr": 0.8473609134615385, "top_1": 0.07225999981164932, "top_5": 0.19789999723434448, "val_time": 103.72948884963989, "train_loss": null, "epoch": 0}
{"timestamp": 1650644358.3394542, "relative_time": 5507.582849979401, "current_lr": 1.6972759134615385, "top_1": 0.16143999993801117, "top_5": 0.3677400052547455, "val_time": 92.9171462059021, "train_loss": null, "epoch": 1}

Is there anything I should check?

Thank you in advance for your response.

opened by netw0rkf10w 4

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory?
Description

Hi, @lengstrom . Thanks for your wonderful work!

My goal is to run a ResNet18 under ImageNet on my server using a multi-GPU training strategy to speed up the training process. The server has 4 RTX 2080 Ti GPUs with a 46G memory, which is not large enough to load ImageNet into the memory.

I have read the instructions on https://docs.ffcv.io/parameter_tuning.html (Scenario: Large scale datasets and Scenario: Multi-GPU training (1 model, multiple GPUs)

Right now, I can run a ResNet18 on a single card by using os_cache=False. However, if I use in_memory=0 and distributed = 1 to run the provided train_imagenet.py code as follows, some errors are reported, which are listed at the bottom. Would you please tell me how to solve this issue?

Command

python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \ ... \ --data.in_memory=0 \ --training.distributed=1

Message

Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.

=> Logging in ...

Not enough memory; try setting quasi-random ordering (OrderOption.QUASI_RANDOM) in the dataloader constructor's order argument.

Full error below: 0%| | 0/1251 [00:01<?, ?it/s] Exception ignored in: <function EpochIterator.del at 0x7f528d4f04c0> Traceback (most recent call last): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 161, in del self.close() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 158, in close self.memory_context.exit(None, None, None) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 59, in exit self.executor.exit(*args) AttributeError: 'ProcessCacheContext' object has no attribute 'executor' Traceback (most recent call last): File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 510, in ImageNetTrainer.launch_from_args() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, **kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, **filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 461, in launch_from_args ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 468, in _exec_wrapper cls.exec(*args, **kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, **kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, **filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 478, in exec trainer.train() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, **kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, **filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 300, in train train_loss = self.train_loop(epoch) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, **kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, **filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 361, in train_loop for ix, (images, target) in enumerate(iterator): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/loader.py", line 214, in iter return EpochIterator(self, selected_order) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 43, in init raise e File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 37, in init self.memory_context.enter() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 32, in enter self.memory = np.zeros((self.schedule.num_slots, self.page_size), numpy.core._exceptions._ArrayMemoryError: Unable to allocate 229. GiB for an array with shape (29251, 8388608) and data type uint8
opened by fantasysee 3

A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

Related tags

Overview

ffcv ImageNet Training

Results

Citation

Configurations

Training Models

Training Details

FAQ

Why is the first epoch slow?

What if I can't fit my dataset in memory?

Other questions

Comments

Description

Command

Message

Owner

FFCV

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Speed-Test - You can check your intenet speed using this tool

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

State of the Art Neural Networks for Deep Learning

Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

State of the art Semantic Sentence Embeddings

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

State-of-the-art data augmentation search algorithms in PyTorch

A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

A state of the art of new lightweight YOLO model implemented by TensorFlow 2.

`ffcv` ImageNet Training