A Domain-Agnostic Benchmark for Self-Supervised Learning

Alex Tamkin

Last update: Dec 9, 2022

Related tags

Deep Learning dabs

Overview

DABS: A Domain Agnostic Benchmark for Self-Supervised Learning

This repository contains the code for DABS, a benchmark for domain-agnostic self-supervised learning algorithms. The basic components of the benchmark can be found in datasets, encoders, and algorithms. Training is implemented with the PyTorch Lightning framework, logging with Weights and Biases, and configuration management with Hydra.

Usage

We provide support for Python >= 3.7. Install requirements with

python -m pip install -r requirements.txt

For instructions on how to install PyTorch versions compatible with your CUDA versions, see pytorch.org.

Datasets

We provide a set of dataset implementations (in src/datasets) from image, text, speech, sensor, medical imaging, and image-text domains. Preprocessing operations on these datasets are minimal and hard-coded as simple resizing (i.e. of images) and truncations (i.e. of text, audio). These should not be changed so as to maintain fair comparisons across other users of the benchmark.

See conf/datasets/*.yaml for all dataset configs, including the loss, metrics, and batch size used for each dataset.

Almost all datasets will download automatically when the dataset class is instantiated. The exceptions are the CheXpert, ImageNet, and CU Birds datasets, where manual registration or download is required. See the respective dataset files for specific instructions.

Pretraining Dataset (unlabeled)	Transfer Dataset (labeled)
CIFAR10	Aircraft, CIFAR10, CU Birds, DTD, Traffic Sign, VGG Flower
PAMAP2	PAMAP2
MSCOCO	MSCOCO (mismatched detection), VQA (Binary classification)
Wikitext-103	GLUE (10 Tasks)
mC4	PAWS-X (7 Tasks)
CheXpert	CheXpert (atelectasis, cardiomegaly, consolidation, edema, and pleural effusion), ChestX-ray8 (atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax)
LibriSpeech	Audio MNIST, Fluent Speech (Action, Object, Location), Google Speech Commands, LibriSpeech, VoxCeleb1

Pretraining

During the pretraining phase, self-supervised encoders are trained to learn good representations from unlabeled data. We currently support seven datasets for pretraining, one for each domain: MS COCO, ImageNet, CheXpert, PAMAP2, mC4, WikiText-103, and LibriSpeech. If the pretraining dataset has associated labels, an online linear evaluator is jointly trained with the encoder to provide a heuristic of transfer performance.

Run pretraining with commands like

python pretrain.py exp.name=<experiment-name> dataset=<dataset> algorithm=<algorithm>

Each dataset and encoder has its own config file, so to train a Transformer on the CheXpert dataset with the e-Mix algorithm, run

python pretrain.py exp.name=emix-chexpert encoder=transformer dataset=chexpert algorithm=emix

See conf/pretrain.yaml for all pretraining configuration fields.

For more information on the datasets, encoders, and algorithms, see the following section.

Pretraining Dataset	Modality	Label type (unused)	Input Type
CIFAR10	Natural images	Single label	2d
PAMAP2	Sensor	Single label	2d
MSCOCO	Captioned images	Single label	2d + tokens
WikiText-103	English Text	No label	tokens
mC4	Multilingual Text	No label	tokens
CheXpert	Medical images	Multi label	2d
LibriSpeech	Speech	No label	2d

Transfer Learning

After pretraining, a small linear classifier is trained on top of the frozen encoder. Run transfer learning from a randomly initialized encoder with

python transfer.py exp.name=<experiment-name> dataset=<dataset> ckpt=null

See conf/transfer.yaml for all transfer learning configuration fields and optionally replace null with the path to your pretrained encoder checkpoint.

Dataset	Modality	Label type	Evaluation metric	Input Type
Aircraft	Natural images	Single label	Accuracy	2d
CU Birds	Natural images	Single label	Accuracy	2d
DTD	Natural images	Single label	Accuracy	2d
Traffic Sign	Natural images	Single label	Accuracy	2d
VGG Flower	Natural images	Single label	Accuracy	2d
Pamap2	Sensor	Single label	Accuracy	2d
MS COCO	Captioned images	Binary label	Accuracy	2d + tokens
VQA	Captioned images	Binary label	Accuracy	2d + tokens
CheXpert	Medical images	Multi label	AUROC	2d
ChestX-ray8	Medical images	Multi label	AUROC	2d
PAWS-X	Multilingual Text	Binary label	Accuracy	tokens
COLA	English Text	Binary label	Pearson correlation	tokens
MNLI Matched	English Text	Single label	Accuracy	tokens
MNLI Mismatched	English Text	Single label	Accuracy	tokens
MRPC	English Text	Binary label	Accuracy	tokens
QNLI	English Text	Binary label	Accuracy	tokens
QQP	English Text	Binary label	Accuracy	tokens
RTE	English Text	Binary label	Accuracy	tokens
SST2	English Text	Binary label	Accuracy	tokens
STSB	English Text	Regression	Spearman correlation	tokens
WNLI	English Text	Binary label	Accuracy	tokens
Audio MNIST	Speech	Single label	Accuracy	2d
Fluent Speech	Speech	Single label	Accuracy	2d
Google Speech Commands	Speech	Single label	Accuracy	2d
LibriSpeech	Speech	Single label	Accuracy	2d
VoxCeleb1	Speech	Single label	Accuracy	2d

Encoders

A domain-agnostic SSL method should have an encoder which remains as constant as possible across domains. We provide a general transformer encoder baseline (in src/encoders). The transformer operates on a sequence of vectors that are produced by a small set of embedding modules (e.g. patch or token embeddings).

Pretraining algorithms

The pretraining algorithm is the framework and objective that the encoder is trained with. Examples of domain-specific algorithms include SimCLR, BYOL, and MoCo, but these are not domain-agnostic methods as they depend on vision-specific augmentations. We provide our own domain-agnostic implementations of recent algorithms, including e-mix (a generalization of i-mix) and Shuffled Embedding Detection (ShED; a generalization of ELECTRA), which randomly permutes a subset of the input embeddings and trains the model to identify the permuted embeddings.

Results

Below are results for algorithms trained on each dataset in DABS. The baseline performance is obtained via a randomly initialized encoder.

Pretrain Dataset	Transfer Dataset	Encoder	Baseline Performance	e-mix Performance	ShED Performance
ImageNet	CIFAR10	Transformer	24.20%	39.43%	39.63%
ImageNet	CU Birds	Transformer	1.62%	3.86%	2.95%
ImageNet	VGG Flowers	Transformer	9.03%	25.96%	13.03%
ImageNet	DTD	Transformer	7.39%	8.83%	18.35%
ImageNet	Traffic Sign	Transformer	14.33%	65.07%	27.51%
ImageNet	Aircraft	Transformer	2.70%	10.15%	5.60%
PAMAP2	PAMAP2	Transformer	69.81%	79.48%	88.69%
MSCOCO	VQA	Transformer	57.50%	48.90%	54.30%
CheXpert	CheXpert	Transformer	68.14%	72.40%	72.40%
CheXpert	ChestX-ray8	Transformer	57.00%	63.00%	63.70%
Wikitext-103	GLUE (average)	Transformer	42.29%	44.08%	48.37%
mC4	PAWS-X (average)	Transformer	58.11%	56.16%	59.91%
LibriSpeech	Audio MNIST	Transformer	33.13%	80.35%	67.33%
LibriSpeech	Fluent Locations	Transformer	62.09%	60.93%	60.24%
LibriSpeech	Fluent Actions	Transformer	26.15%	29.87%	30.53%
LibriSpeech	Fluent Objects	Transformer	30.13%	39.89%	39.36%
LibriSpeech	Google Speech Commands	Transformer	4.87%	19.22%	20.73%
LibriSpeech	LibriSpeech	Transformer	17.12%	60.18%	34.77%
LibriSpeech	VoxCeleb1	Transformer	0.59%	2.43%	2.81%

You might also like...

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Patch-Rotation(PatchRot) Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models Submitted to Neurips2021 To

4 Jul 12, 2021

Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

OpenSelfSup News Downstream tasks now support more methods(Mask RCNN-FPN, RetinaNet, Keypoints RCNN) and more datasets(Cityscapes). 'GaussianBlur' is

332 Jan 3, 2023

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

22 Nov 9, 2022

Pytorch implementation of four neural network based domain adaptation techniques: DeepCORAL, DDC, CDAN and CDAN+E. Evaluated on benchmark dataset Office31.

Deep-Unsupervised-Domain-Adaptation Pytorch implementation of four neural network based domain adaptation techniques: DeepCORAL, DDC, CDAN and CDAN+E.

49 Dec 20, 2022

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

174 Dec 22, 2022

Comments

CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0

How to load a pretrained ckpt for transfer learning?

Hi,

Thanks for sharing the great work. I have a question in reproducing the results:

I can run the code when I set ckpt=null but fail in loading a pretrained ckpt


Error executing job with overrides: ['exp.name=pamap2-transfer-ckptnull', 'dataset=pamap2', 'ckpt=model.ckpt
']
Traceback (most recent call last):
  File "transfer.py", line 42, in run
    system = transfer.TransferSystem(config)
  File "/data/home/v-clei/gitrepos/dabs/src/systems/transfer.py", line 56, in __init__
    self.load_state_dict(torch.load(config.ckpt)['state_dict'], strict=False)
TypeError: 'EMixSystem' object is not subscriptable

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Thank you very much.

opened by ChenyangLEI 0

Pretraining dataset for natural images

Hello,

Thank you so much for your contribution.

It is not clear to me which dataset is actually used for pretraining on natural image domain. In section 4, you claimed that ImageNet is the pretraining dataset. But in the first two tables in the readme file, you mentioned cifar10 is used for pretraining.

Also, in Table 3 of your supplementary, it is specified that cifar10 is used in "both" phases.

By the way, the accuracy on downstream datasets does not average to 27.9 as reported in Table 1.

I am curious about where the difference comes from since you have seeded your training and different runs cannot explain such a variation.

Thank you for your time.

opened by Huiimin5 0

A Domain-Agnostic Benchmark for Self-Supervised Learning

Related tags

Overview

DABS: A Domain Agnostic Benchmark for Self-Supervised Learning

Usage

Datasets

Pretraining

Transfer Learning

Encoders

Pretraining algorithms

Results

You might also like...

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

Pytorch implementation of four neural network based domain adaptation techniques: DeepCORAL, DDC, CDAN and CDAN+E. Evaluated on benchmark dataset Office31.

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning

MARS: Learning Modality-Agnostic Representation for Scalable Cross-media Retrieva

Comments

CVE-2007-4559 Patch

Patching CVE-2007-4559

How to load a pretrained ckpt for transfer learning?

Pretraining dataset for natural images

Owner

Alex Tamkin

PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML)

Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation (CVPR 2022)

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

Self-Supervised Learning for Domain Adaptation on Point-Clouds

Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang