[ICML 2021, Long Talk] Delving into Deep Imbalanced Regression

Yuzhe Yang

Last update: Dec 30, 2022

Related tags

Deep Learning natural-language-processing computer-vision regression healthcare imbalanced-data icml imbalanced-learning long-tail imbalance imbalanced-classification imbalanced-regression icml-2021

Overview

Delving into Deep Imbalanced Regression

This repository contains the implementation code for paper:
Delving into Deep Imbalanced Regression
Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, Dina Katabi
38th International Conference on Machine Learning (ICML 2021), Long Oral
[Project Page] [Paper] [Video] [Blog Post]

Deep Imbalanced Regression (DIR) aims to learn from imbalanced data with continuous targets,
tackle potential missing data for certain regions, and generalize to the entire target range.

Beyond Imbalanced Classification: Brief Introduction for DIR

Existing techniques for learning from imbalanced data focus on targets with categorical indices, i.e., the targets are different classes. However, many real-world tasks involve continuous and even infinite target values. We systematically investigate Deep Imbalanced Regression (DIR), which aims to learn continuous targets from natural imbalanced data, deal with potential missing data for certain target values, and generalize to the entire target range.

We curate and benchmark large-scale DIR datasets for common real-world tasks in computer vision, natural language processing, and healthcare domains, ranging from single-value prediction such as age, text similarity score, health condition score, to dense-value prediction such as depth.

Usage

We separate the codebase for different datasets into different subfolders. Please go into the subfolders for more information (e.g., installation, dataset preparation, training, evaluation & models).

IMDB-WIKI-DIR | AgeDB-DIR | NYUD2-DIR | STS-B-DIR

Highlights

(1) ✔️ New Task: Deep Imbalanced Regression (DIR)

(2) ✔️ New Techniques:


Label distribution smoothing (LDS)	Feature distribution smoothing (FDS)

(3) ✔️ New Benchmarks:

Computer Vision: 💡 IMDB-WIKI-DIR (age) / AgeDB-DIR (age) / NYUD2-DIR (depth)
Natural Language Processing: 📋 STS-B-DIR (text similarity score)
Healthcare: 🏥 SHHS-DIR (health condition score)

IMDB-WIKI-DIR	AgeDB-DIR	NYUD2-DIR	STS-B-DIR	SHHS-DIR

Updates

[06/2021] We provide a hands-on tutorial of DIR. Check it out!
[05/2021] We create a Blog post for this work (version in Chinese is also available here). Check it out for more details!
[05/2021] Paper accepted to ICML 2021 as a Long Talk. We have released the code and models. You can find all reproduced checkpoints via this link, or go into each subfolder for models for each dataset.
[02/2021] arXiv version posted. Please stay tuned for updates.

Citation

If you find this code or idea useful, please cite our work:

@inproceedings{yang2021delving,
  title={Delving into Deep Imbalanced Regression},
  author={Yang, Yuzhe and Zha, Kaiwen and Chen, Ying-Cong and Wang, Hao and Katabi, Dina},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2021}
}

Contact

If you have any questions, feel free to contact us through email (yuzhe@mit.edu & kzha@mit.edu) or Github issues. Enjoy!

Comments

Hi @YyzHarry,

Hi @YyzHarry, I want to use your code of LDS to solve my problem,now I have a question to ask you：whether the input data format of LDS must be csv？Is npz format data OK？can it be applied to high-dimensional data？ I would appreciate it if you could give me some guidance.

opened by ytkmy5555 6
Bins in FDS and LDS - not usable in general approach, only for given datasets
Hi Team, I liked the ideas in your paper, but from reading the paper and provided code it sounds like the provided FDS and LDS code can be applied to any dataset/model? Is it really true?

It looks like you are using only integers(as you are predicting age) to make a dictionary of histogram bins in both FDS and LDS. In the paper you say : "We use a minimum bin size of 1, i.e., yb+1 − yb = 1, and group features with the same target value in the same bin." I imagine this makes a lot of things easier but if you are facing imbalanced regression problem and your labels are float between 0 and 5 this version of code won't help you. Do you by any chance have code with general approach?

see following parts of code with usecase specific histogram bins:

https://github.com/YyzHarry/imbalanced-regression/blob/055a7b3804bbaf903ed25a55c11ab8acc6e142e1/agedb-dir/datasets.py#L60

https://github.com/YyzHarry/imbalanced-regression/blob/055a7b3804bbaf903ed25a55c11ab8acc6e142e1/agedb-dir/fds.py#L120

I did not find an explanation for this clipping(maybe empirically it gave better results?): https://github.com/YyzHarry/imbalanced-regression/blob/055a7b3804bbaf903ed25a55c11ab8acc6e142e1/agedb-dir/datasets.py#L67

there is also another clipping here(I guess again better empirical results?): https://github.com/YyzHarry/imbalanced-regression/blob/055a7b3804bbaf903ed25a55c11ab8acc6e142e1/agedb-dir/utils.py#L102

Note: I like the ideas in the paper, but due to lack of documentation/explanation I am right now spending a lot of time on generalizing the code and trying to figure out why you made some of the operations(eg. clippings)
opened by 5uperpalo 5
prediction value processing

hello, I read your paper interestingly and want to ask you a question about the prediction result processing. I would like to ask how to limit the last prediction y^ to be between 0 and 99, or to get it directly from the regression function without any processing?

opened by huangbingyang2020 4
How to use this method in a multi-dimension regression problem?

Hi, Amazing for your great job in the imblanced regression problem. But I notice that this work discusses more on the 1D regression problems. What if the output is more than 1D (like Batch_Size x 10) ? Any suggestion will be helpful. Thanks.

opened by semi-supervised-paper 3
about test error

Hello, I have some questions about the error pdf. Can I know how to get the right error pdf? Each labels have different numbers of samples, so should I apply a mean method for each label error or make them have the same amount?

opened by W-rudder 2
The reproduced benchmark and model seem to be damaged

Hi @YyzHarry ，The reproduced benchmark and model seem to be damaged。I use this link（ https://drive.google.com/file/d/1CPDlcRCQ1EC4E3x9w955cmILSaVOlkyz/view ）The downloaded model cannot be opened. It indicates that the data has been damaged. Can you update the model? Thank you！

opened by GXNU156489 2
Applying LDS/FDS to classic machine learning models

Hi! This work is really fantastic! However, I found it hard to apply LDS/FDS to classic machine learning models like random forest. For example, after getting the effective label density with LDS, how should I use this?

opened by luopx 2
Hi, confusion about the computation of the feature statistics similarity (mean&variance)

I'm dealing my unbalanced data with your study; A little confusion: how to compute the Feature statistics similarity between the many-shot region and few-shot region? Cuz the features in the few-shot region bins are less than that in many-shot region.

Thanks to your attention!

opened by lixingang 2
Incorrect Focal-R mse loss?
Hi authors,

page 6 from your paper: Precisely, Focal-R loss based on L1 distance can be written as 1/n∑n i=1 σ(|βei|)γ ei, where ei is the L1 error for i-th sample, σ(·)

QUESTION 1 : in the focal_mse loss: https://github.com/YyzHarry/imbalanced-regression/blob/055a7b3804bbaf903ed25a55c11ab8acc6e142e1/agedb-dir/loss.py#L24 should be torch.abs((inputs - targets)**2) and not only torch.abs(inputs - targets), am I correct?

QUESTION 2 : why is there 2*torch.abs(...)-1 ? you do not have and -1 or 2* in the function in your paper?
opened by 5uperpalo 2
Does this method apply to linear regression models like elastic net?

Hi, I'm wondering whether I can use this method as a preprocessing step for non-DNN models, namely simple linear regression or elastic net regression? If so, how should I adopt this method?

Thank you so much!

opened by albert-ying 2
About SHHS-DIR dataset

Thanks a lot for your contribution, your works are really awesome. I am very interested in your work. However, during the code reading, I did not find the SHHS-DIR dataset. Could you publish the SHHS-DIR dataset or its sampling method, thank you!

opened by axi345 2
Using FDS in ML project

Hello, in your paper on the problem of deep imbalance regression, I have the privilege of learning about the smoothing methods of LDS and FDS. In one of my machine learning projects predicting convective cloud precipitation, I wanted to use FDS to play a role in it because of the imbalance between non-precipitation samples and precipitation samples. I wonder how to smooth the feature statistic without knowing its label in the test set, in my data, my data is very unbalanced (70% of the data without precipitation), which results in the characteristic statistics of each label interval are particularly similar (about 98%), so that the smoothing effect is still not significant, and if there are some important points to pay attention to if using FDS in machine learning?

opened by thebluewind 1
validation question

hello, I read your paper interestingly and want to apply it to my custom data. In this regard, two problems occurred.

first, when i evaluate my custom data, i can suffer this error do you know what is?

And can i extract each prediction results, not average prediction result?.

thank you

opened by namasang1 1
Using FDS/LDS with a custom model and data

Hi @YyzHarry,

I am trying to adapt the example from here https://github.com/YyzHarry/imbalanced-regression/tree/main/agedb-dir with my custom model and data. Thus, I would like to ask you whether this would be feasible and if yes if there are any example showing explicitly how to do that.

Thanks.

opened by ttsesm 16

Owner

Yuzhe Yang

Ph.D. student at MIT CSAIL

GitHub http://dir.csail.mit.edu

PyTorch evaluation code for Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

Out-of-distribution Generalization Investigation on Vision Transformers This repository contains PyTorch evaluation code for Delving Deep into the Gen

72 Dec 13, 2022

imbalanced-DL: Deep Imbalanced Learning in Python

imbalanced-DL: Deep Imbalanced Learning in Python Overview imbalanced-DL (imported as imbalanceddl) is a Python package designed to make deep imbalanc

19 Dec 28, 2022

Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

Delving into Localization Errors for Monocular 3D Detection By Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, Wanli Ouyang. Intr

124 Jan 4, 2023

[ICLR 2021] Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization

Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization Kaidi Cao, Yining Chen, Junwei Lu, Nikos Arechiga, Adrien Gaidon, Tengyu Ma

29 Oct 20, 2022

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

80 Sep 17, 2022

Hitters Linear Regression - Hitters Linear Regression With Python

Hitters_Linear_Regression Kullanacağımız veri seti Carnegie Mellon Üniversitesi'

2 Jan 26, 2022

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning DouZero is a reinforcement learning framework for DouDizhu (斗地主), t

3.1k Jan 4, 2023

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Self-Tuning for Data-Efficient Deep Learning This repository contains the implementation code for paper: Self-Tuning for Data-Efficient Deep Learning

101 Dec 11, 2022

A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

120 Dec 12, 2022

[NeurIPS 2021] “Improving Contrastive Learning on Imbalanced Data via Open-World Sampling”,

Improving Contrastive Learning on Imbalanced Data via Open-World Sampling Introduction Contrastive learning approaches have achieved great success in

24 Dec 17, 2022

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

python-pylontech Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485 What is this lib ? This lib is meant to talk to P

26 Dec 28, 2022

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

DS3L This is the code for paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020. Setups The code is implem

36 Oct 19, 2022

[ICML 2021, Long Talk] Delving into Deep Imbalanced Regression

Related tags

Overview

Delving into Deep Imbalanced Regression

Beyond Imbalanced Classification: Brief Introduction for DIR

Usage

IMDB-WIKI-DIR | AgeDB-DIR | NYUD2-DIR | STS-B-DIR

Highlights

Updates

Citation

Contact

Comments

Owner

Yuzhe Yang

PyTorch evaluation code for Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

imbalanced-DL: Deep Imbalanced Learning in Python

Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

[ICLR 2021] Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Hitters Linear Regression - Hitters Linear Regression With Python

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

[NeurIPS 2021] “Improving Contrastive Learning on Imbalanced Data via Open-World Sampling”,

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)

The repo contains the code of the ACL2020 paper `Dice Loss for Data-imbalanced NLP Tasks`

MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.

Official implementation of Influence-balanced Loss for Imbalanced Visual Classification in PyTorch.

Imbalanced Gradients: A Subtle Cause of Overestimated Adversarial Robustness

BESS: Balanced Evolutionary Semi-Stacking for Disease Detection via Partially Labeled Imbalanced Tongue Data