[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

Overview

RIDE: Long-tailed Recognition by Routing Diverse Distribution-Aware Experts.

by Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu and Stella X. Yu at UC Berkeley/ICSI and NTU

International Conference on Learning Representations (ICLR), 2021. Spotlight Presentation

Project Page | PDF | Preprint | OpenReview | Slides | Citation

This repository contains an official re-implementation of RIDE from the authors, while also has plans to support other works on long-tailed recognition. Further information please contact Xudong Wang and Long Lian.

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star and a citation.

@inproceedings{wang2021longtailed,
  title={Long-tailed Recognition by Routing Diverse Distribution-Aware Experts},
  author={Xudong Wang and Long Lian and Zhongqi Miao and Ziwei Liu and Stella Yu},
  booktitle={International Conference on Learning Representations},
  year={2021},
  url={https://openreview.net/forum?id=D9I3drBz4UC}
}

Supported Methods for Long-tailed Recognition:

  • RIDE
  • Cross-Entropy (CE) Loss
  • Focal Loss
  • LDAM Loss
  • Decouple: cRT (limited support for now)
  • Decouple: tau-normalization (limited support for now)

Updates

[04/2021] Pre-trained models are avaliable in model zoo.

[12/2020] We added an approximate GFLops counter. See usages below. We also refactored the code and fixed a few errors.

[12/2020] We have limited support on cRT and tau-norm in load_stage1 option and t-normalization.py, please look at the code comments for instructions while we are still working on it.

[12/2020] Initial Commit. We re-implemented RIDE in this repo. LDAM/Focal/Cross-Entropy loss is also re-implemented (instruction below).

Table of contents

Requirements

Packages

  • Python >= 3.7, < 3.9
  • PyTorch >= 1.6
  • tqdm (Used in test.py)
  • tensorboard >= 1.14 (for visualization)
  • pandas
  • numpy

Hardware requirements

8 GPUs with >= 11G GPU RAM are recommended. Otherwise the model with more experts may not fit in, especially on datasets with more classes (the FC layers will be large). We do not support CPU training, but CPU inference could be supported by slight modification.

Dataset Preparation

CIFAR code will download data automatically with the dataloader. We use data the same way as classifier-balancing. For ImageNet-LT and iNaturalist, please prepare data in the data directory. ImageNet-LT can be found at this link. iNaturalist data should be the 2018 version from this repo (Note that it requires you to pay to download now). The annotation can be found at here. Please put them in the same location as below:

data
├── cifar-100-python
│   ├── file.txt~
│   ├── meta
│   ├── test
│   └── train
├── cifar-100-python.tar.gz
├── ImageNet_LT
│   ├── ImageNet_LT_open.txt
│   ├── ImageNet_LT_test.txt
│   ├── ImageNet_LT_train.txt
│   ├── ImageNet_LT_val.txt
│   ├── test
│   ├── train
│   └── val
└── iNaturalist18
    ├── iNaturalist18_train.txt
    ├── iNaturalist18_val.txt
    └── train_val2018

How to get pretrained checkpoints

We have a model zoo available.

Training and Evaluation Instructions

Imbalanced CIFAR 100/CIFAR100-LT

RIDE Without Distill (Stage 1)
python train.py -c "configs/config_imbalance_cifar100_ride.json" --reduce_dimension 1 --num_experts 3

Note: --reduce_dimension 1 means set reduce dimension to True. The template has an issue with bool arguments so int argument is used here. However, any non-zero value will be equivalent to bool True.

RIDE With Distill (Stage 1)
python train.py -c "configs/config_imbalance_cifar100_distill_ride.json" --reduce_dimension 1 --num_experts 3 --distill_checkpoint path_to_checkpoint

Distillation is not required but could be performed if you'd like further improvements.

RIDE Expert Assignment Module Training (Stage 2)
python train.py -c "configs/config_imbalance_cifar100_ride_ea.json" -r path_to_stage1_checkpoint --reduce_dimension 1 --num_experts 3

Note: different runs will result in different EA modules with different trade-off. Some modules give higher accuracy but require higher FLOps. Although the only difference is not underlying ability to classify but the "easiness to satisfy and stop". You can tune the pos_weight if you think the EA module consumes too much compute power or is using too few expert.

ImageNet-LT

RIDE Without Distill (Stage 1)

ResNet 10
python train.py -c "configs/config_imagenet_lt_resnet10_ride.json" --reduce_dimension 1 --num_experts 3
ResNet 50
python train.py -c "configs/config_imagenet_lt_resnet50_ride.json" --reduce_dimension 1 --num_experts 3
ResNeXt 50
python train.py -c "configs/config_imagenet_lt_resnext50_ride.json" --reduce_dimension 1 --num_experts 3

RIDE With Distill (Stage 1)

ResNet 10
python train.py -c "configs/config_imagenet_lt_resnet10_distill_ride.json" --reduce_dimension 1 --num_experts 3 --distill_checkpoint path_to_checkpoint
ResNet 50
python train.py -c "configs/config_imagenet_lt_resnet50_distill_ride.json" --reduce_dimension 1 --num_experts 3 --distill_checkpoint path_to_checkpoint
ResNeXt 50
python train.py -c "configs/config_imagenet_lt_resnext50_distill_ride.json" --reduce_dimension 1 --num_experts 3 --distill_checkpoint path_to_checkpoint

RIDE Expert Assignment Module Training (Stage 2)

ResNet 10
python train.py -c "configs/config_imagenet_lt_resnet10_ride_ea.json" -r path_to_stage1_checkpoint --reduce_dimension 1 --num_experts 3
ResNet 50
python train.py -c "configs/config_imagenet_lt_resnet50_ride_ea.json" -r path_to_stage1_checkpoint --reduce_dimension 1 --num_experts 3
ResNeXt 50
python train.py -c "configs/config_imagenet_lt_resnext50_ride_ea.json" -r path_to_stage1_checkpoint --reduce_dimension 1 --num_experts 3

iNaturalist

RIDE Without Distill (Stage 1)

python train.py -c "configs/config_iNaturalist_resnet50_ride.json" --reduce_dimension 1 --num_experts 3

RIDE With Distill (Stage 1)

python train.py -c "configs/config_iNaturalist_resnet50_distill_ride.json" --reduce_dimension 1 --num_experts 3 --distill_checkpoint path_to_checkpoint

RIDE Expert Assignment Module Training (Stage 2)

python train.py -c "configs/config_iNaturalist_resnet50_ride_ea.json" -r path_to_stage1_checkpoint --reduce_dimension 1 --num_experts 3

Using Other Methods with RIDE

  • Focal Loss: switch the loss to Focal Loss
  • Cross Entropy: switch the loss to Cross Entropy Loss

Test

To test a checkpoint, please put it with the corresponding config file.

python test.py -r path_to_checkpoint

Please see the pytorch template that we use for additional more general usages of this project (e.g. loading from a checkpoint, etc.).

GFLops calculation

We provide an experimental support for approximate GFLops calculation. Please open an issue if you encounter any problem or meet inconsistency in GFLops.

You need to install thop package first. Then, according to your model, run python -m utils.gflops (args) in the project directory.

Examples and explanations

Use python -m utils.gflops to see the documents as well as explanations for this calculator.

ImageNet-LT
python -m utils.gflops ResNeXt50Model 0 --num_experts 3 --reduce_dim True --use_norm False

To change model, switch ResNeXt50Model to the ones used in your config. use_norm comes with LDAM-based methods (including RIDE). reduce_dim is used in default RIDE models. The 0 in the command line indicates the dataset.

All supported datasets:

  • 0: ImageNet-LT
  • 1: iNaturalist
  • 2: Imbalance CIFAR 100
iNaturalist
python -m utils.gflops ResNet50Model 1 --num_experts 3 --reduce_dim True --use_norm True
Imbalance CIFAR 100
python -m utils.gflops ResNet32Model 2 --num_experts 3 --reduce_dim True --use_norm True
Special circumstances: calculate the approximate GFLops in models with expert assignment module

We provide a ea_percentage for specifying the percentage of data that pass each expert. Note that you need to switch to the EA model as well since you actually use EA model instead of the original model in training and inference.

An example:

python -m utils.gflops ResNet32EAModel 2 --num_experts 3 --reduce_dim True --use_norm True --ea_percentage 40.99,9.47,49.54

FAQ

See FAQ.

How to get support from us?

If you have any general questions, feel free to email us at longlian at berkeley.edu and xdwang at eecs.berkeley.edu. If you have code or implementation-related questions, please feel free to send emails to us or open an issue in this codebase (We recommend that you open an issue in this codebase, because your questions may help others).

Pytorch template

This is a project based on this pytorch template. The readme of the template explains its functionality, although we try to list most frequently used ones in this readme.

License

This project is licensed under the MIT License. See LICENSE for more details. The parts described below follow their original license.

Acknowledgements

This is a project based on this pytorch template. The pytorch template is inspired by the project Tensorflow-Project-Template by Mahmoud Gemy

The ResNet and ResNeXt in fb_resnets are based on from Classifier-Balancing/Decouple. The ResNet in ldam_drw_resnets/LDAM loss/CIFAR-LT are based on LDAM-DRW. KD implementation takes references from CRD/RepDistiller.

Comments
  • About the change of network structure

    About the change of network structure

    I observed that in the experiment, you used more experts, such as 3 or 4 experts, and in this case, the number of parameters and computation of the base network increased, but I think you should add a more comparative experiment to show whether the overall performance gain is more attributed to your proposed training method or to the increase of the network parameters. For example, when comparing with other methods, you should keep the network computation and number of parameters consistent while also keeping the network structure consistent, because I noticed that your ResNet network structure is different from the original network structure, although you claim that your computation is consistent. But I suspect that the change in network structure may have a significant impact on the overall performance as well. Anyway, my idea is that when comparing with the baseline method, the network structure used in the baseline method should be the same as your network structure, so that we can exclude the effect of the change of the network structure, and finally we can conclude that the various loss functions proposed in your paper are meaningful and valid.

    question 
    opened by abababa-ai 7
  • [Error] An python exception is encountered at stage 2

    [Error] An python exception is encountered at stage 2

    Describe the error A clear and concise description of what the problem is. Hi, thanks for your great work.

    After training Stage 1 and obtain the checkpoint, I start running the code for stage 2. However, there is an error "RuntimeError: grad can be implicitly created only for scalar outputs" in self._train_epoch(epoch). Google said this is because the loss is not scalar. I have no idea what is wrong with this code since it is packaged very well. Expect your reply.

    opened by Vanint 5
  • The diversity loss has no effect?

    The diversity loss has no effect?

    Hi, authors! Thank you for doing such an inspiring job and opening the source code! There is a problem when I using your code, the diversity loss seems not having a great effect.

    I run your code "RIDE Without Distill (Stage 1)" of 3 experts on CIFAR100-LT using your config, and got validation accuracy 47.8%. And I tried to do some ablation, I make "additional_diversity_factor"=0.45 (the original setting is -0.45), got validation accuracy 48.0%, which is even 0.2% higher than 47.8%. I didn't change any thing else of your codes. Could you help me figure out the problem?

    Thanks a lot!

    opened by yypurpose 5
  • [GPU Utilization] DataLoader iteration speed quite low at the start of every epoch

    [GPU Utilization] DataLoader iteration speed quite low at the start of every epoch

    Hi, there Thanks for your great job! During my train (ResNext50 on Imagenet-LT), I used 8 A100 (total-batch size: 1024). I found at the start of every epoch, the dataloader will stuck for around 30 seconds. Even I change the code into DDP mode, there still get stuck 30 seconds. I wonder whether there's some problem related to the ImageNetLTDataLoader?

    Best,

    opened by CiaoHe 2
  • How to understand Formula 13 and 14 in your paper?

    How to understand Formula 13 and 14 in your paper?

    Describe the error A clear and concise description of what your question is. Thank you for your work in long-tailed recognition problem. Your work is excellent, and i want to use RIDE in my research file. However, i was confusion in understanding formula 13 and 14. For example, what do γ means? and How to compute it? as well as what do α means? and How to compute it?

    question 
    opened by MRZHANG-1997 2
  • Question about Diversity loss

    Question about Diversity loss

    Thanks for your great work! But I am confused about the reason why diversity loss works . The paper says that it's a regularization term to encourage complementary decisions. Why we need complementary decisions from experts? Don't we want the same correct answer? If the experts' decision is diverse , how can we assure the final output is correct?

    question 
    opened by singing4you 2
  • [Error] An pRuntimeError: invalid argument 5: k not in range for dimension at C:/w/b/windows/pytorch/aten/src\THC/generic/THCTensorTopK.cu:26ython exception is encountered at ...

    [Error] An pRuntimeError: invalid argument 5: k not in range for dimension at C:/w/b/windows/pytorch/aten/src\THC/generic/THCTensorTopK.cu:26ython exception is encountered at ...

    Describe the error A clear and concise description of what the problem is. This question is caused by the top_choices_num which is from stage 2 of training, when I use my data of 37 category for training. This error will be solved when the top_choices_num is set 30. I try to understand this parameter, but I am failure by the code. I would be grateful if you would explain this parameter. Your work is very meaningful to solve the long tail problem, thank you for sharing.

    opened by MRZHANG-1997 2
  • [Error] n_gpu model encounted errors

    [Error] n_gpu model encounted errors

    When I use n_gpu=1, everything is OK. When I use n_gpu=4 in trainning, the procedure makes error as bellow:

    Traceback (most recent call last): File "train.py", line 110, in main(config) File "train.py", line 75, in main trainer.train() File "/home/xxx/workspace/Oracle/RIDE_IR/base/base_trainer.py", line 76, in train result = self._train_epoch(epoch) File "/home/xxx/workspace/Oracle/RIDE_IR/trainer/trainer.py", line 133, in _train_epoch "logits": self.real_model.backbone.logits File "/home/xxx/.conda/envs/pytorch_jhon/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in getattr type(self).name, name)) AttributeError: 'ResNet_s' object has no attribute 'logits'

    When I try to fix this, I add "self.logits = []" at Resnet_s intit stage of ride_resenet_cifa.py , another mistake is occured as below:

    Traceback (most recent call last): File "train.py", line 110, in main(config) File "train.py", line 75, in main trainer.train() File "/home/xxx/workspace/Oracle/RIDE_IR/base/base_trainer.py", line 76, in train result = self._train_epoch(epoch) File "/home/xxx/workspace/Oracle/RIDE_IR/trainer/trainer.py", line 148, in _train_epoch loss.backward() AttributeError: 'int' object has no attribute 'backward'

    Have you ever encounted errors like this and could you offer any help? Thanks so much. My conda torch related packages version are as below: ffmpeg 4.3 hf484d3e_0 pytorch pytorch 1.9.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch torchaudio 0.9.0 py37 pytorch torchvision 0.10.0 py37_cu102 pytorch python 3.7.11 h12debd9_0 defaults

    opened by wujunnan0929 2
  • Mismatched hyper-parameter settings for

    Mismatched hyper-parameter settings for

    Hi @TonyLianLong,

    I noticed that the hyper-parameters reported in your paper is same with LDAM:

    1. weight decay is 2e-4
    2. lr decay steps are 120 and 160

    but in your config files, they are changed into:

    1. weight decay is 5e-4
    2. lr decay steps are 160, 180

    I am quite confused, could you please explain it?

    opened by mitming 2
  • [Error] RIDE Expert Assignment Module Training (Stage 2)

    [Error] RIDE Expert Assignment Module Training (Stage 2)

    Thanks for your great work, here is an error when I try to use the model as given in the model zoo on iNaturalist data set. The model is based on ResNet50 backbone with 4 experts and distillation. When I try to use the model as the pretrained model to initialize the model during stage 2. It produces the following errors. It seems that the parameters of some layers are not well loaded. My command lines are: python train.py -c "configs/config_iNaturalist_resnet50_ride_ea.json" -r afs/RIDE/RIDE_model/imagenet_4experts_distill/checkpoint-epoch5.pth --reduce_dimension 1 --num_experts 4

    The reported errors are as follows:

    Loading checkpoint: ./RIDE/RIDE_model/imagenet_4experts_distill/checkpoint-epoch5.pth ...
    Traceback (most recent call last):
    File "/root/workspace/env_run/utils/util.py", line 59, in load_state_dict
    own_state["module."+name].copy_(param)
    RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 0
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File "train.py", line 110, in
    main(config)
    File "train.py", line 73, in main
    lr_scheduler=lr_scheduler)
    File "/root/workspace/env_run/trainer/trainer.py", line 14, in init
    super().init(model, criterion, metric_ftns, optimizer, config)
    File "/root/workspace/env_run/base/base_trainer.py", line 59, in init
    self._resume_checkpoint(config.resume, state_dict_only=state_dict_only)
    File "/root/workspace/env_run/base/base_trainer.py", line 211, in _resume_checkpoint
    load_state_dict(self.model, state_dict)
    File "/root/workspace/env_run/utils/util.py", line 63, in load_state_dict
    print("Error in copying parameter {}, source shape: {}, destination shape: {}".format(name, param.shape, own_sta te[name].shape))
    KeyError: 'backbone.layer1.0.conv1.weight'

    opened by abababa-ai 2
  • How to implement cRT and τ-norm with this framework?

    How to implement cRT and τ-norm with this framework?

    Hi, @TonyLianLong :

    Thanks for your contribution, is there any instruction for the implementations of cRT and τ-norm? or could you please kindly provide the hyperparameters of cRT and τ-norm used in in your paper?

    question 
    opened by mitming 2
Owner
Xudong (Frank) Wang
Ph.D. Student @ EECS, UC Berkeley; Graduate Student Researcher @ International Computer Science Institute, Berkeley, USA
Xudong (Frank) Wang
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

null 7 Dec 5, 2022
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
Transformers implementation for Fall 2021 Clinic

Installation Download miniconda3 if not already installed You can check by running typing conda in command prompt. Use conda to create an environment

Aakash Tripathi 1 Oct 28, 2021
Mirco Ravanelli 2.3k Dec 27, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
SAINT PyTorch implementation

SAINT-pytorch A Simple pyTorch implementation of "Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing" based on https://arx

Arshad Shaikh 63 Dec 25, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Phil Wang 5k Jan 2, 2023
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Pytorch implementation of Tacotron

Tacotron-pytorch A pytorch implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. Requirements Install python 3 Install pytorc

soobin seo 203 Dec 2, 2022
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 7, 2023
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 3, 2023
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 6, 2023