Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Varun Nair

Last update: Dec 30, 2022

Related tags

Deep Learning FISH

Overview

Fisher Induced Sparse uncHanging (FISH) Mask

This repo contains the code for Fisher Induced Sparse uncHanging (FISH) Mask training, from "Training Neural Networks with Fixed Sparse Masks" by Yi-Lin Sung, Varun Nair, and Colin Raffel. To appear in Neural Information Processing Systems (NeurIPS) 2021.

Abstract: During typical gradient-based training of deep neural networks, all of the model's parameters are updated at each iteration. Recent work has shown that it is possible to update only a small subset of the model's parameters during training, which can alleviate storage and communication requirements. In this paper, we show that it is possible to induce a fixed sparse mask on the model’s parameters that selects a subset to update over many iterations. Our method constructs the mask out of the parameters with the largest Fisher information as a simple approximation as to which parameters are most important for the task at hand. In experiments on parameter-efficient transfer learning and distributed training, we show that our approach matches or exceeds the performance of other methods for training with sparse updates while being more efficient in terms of memory usage and communication costs.

Setup

pip install transformers/.
pip install datasets torch==1.8.0 tqdm torchvision==0.9.0

FISH Mask: GLUE Experiments

Parameter-Efficient Transfer Learning

To run the FISH Mask on a GLUE dataset, code can be run with the following format:

$ bash transformers/examples/text-classification/scripts/run_sparse_updates.sh <dataset-name> <seed> <top_k_percentage> <num_samples_for_fisher>

An example command used to generate Table 1 in the paper is as follows, where all GLUE tasks are provided at a seed of 0 and a FISH mask sparsity of 0.5%.

$ bash transformers/examples/text-classification/scripts/run_sparse_updates.sh "qqp mnli rte cola stsb sst2 mrpc qnli" 0 0.005 1024

Distributed Training

To use the FISH mask on the GLUE tasks in a distributed setting, one can use the following command.

$ bash transformers/examples/text-classification/scripts/distributed_training.sh <dataset-name> <seed> <num_workers> <training_epochs> <gpu_id>

Note the <dataset-name> here can only contain one task, so an example command could be

$ bash transformers/examples/text-classification/scripts/distributed_training.sh "mnli" 0 2 3.5 0

FISH Mask: CIFAR10 Experiments

To run the FISH mask on CIFAR10, code can be run with the following format:

Distributed Training

$ bash cifar10-fast/scripts/distributed_training_fish.sh <num_samples_for_fisher> <top_k_percentage> <training_epochs> <worker_updates> <learning_rate> <num_workers>

For example, in the paper, we compute the FISH mask of the 0.5% sparsity level by 256 samples and distribute the job to 2 workers for a total of 50 epochs training. Then the command would be

$ bash cifar10-fast/scripts/distributed_training_fish.sh 256 0.005 50 2 0.4 2

Efficient Checkpointing

$ bash cifar10-fast/scripts/small_checkpoints_fish.sh <num_samples_for_fisher> <top_k_percentage> <training_epochs> <learning_rate> <fix_mask>

The hyperparameters are almost the same as distributed training. However, the <fix_mask> is to indicate to fix the mask or not, and a valid input is either 0 or 1 (1 means to fix the mask).

Replicating Results

Replicating each of the tables and figures present in the original paper can be done by running the following:

# Table 1 - Parameter Efficient Fine-Tuning on GLUE

$ bash transformers/examples/text-classification/scripts/run_table_1.sh

# Figure 2 - Mask Sparsity Ablation and Sample Ablation

$ bash transformers/examples/text-classification/scripts/run_figure_2.sh

# Table 2 - Distributed Training on GLUE

$ bash transformers/examples/text-classification/scripts/run_table_2.sh

# Table 3 - Distributed Training on CIFAR10

$ bash cifar10-fast/scripts/distributed_training.sh

# Table 4 - Efficient Checkpointing

$ bash cifar10-fast/scripts/small_checkpoints.sh

Notes

For reproduction of Diff Pruning results from Table 1, see code here.

Acknowledgements

We thank Yoon Kim, Michael Matena, and Demi Guo for helpful discussions.

Comments

Transformers

Thanks for making the code available, I see that you have added a fork of transformers, and I am assuming that you have done so since you have made some changes to the internals, could you please point me to those changes if possible?

opened by vdeepak13 2
fisher calculation is based on random initialized linear layer?

Hi, thanks for the great work!

I noticed that in both the paper and the released code, the fisher mask is calculated based on a randomly initialized linear layer. I am pretty surprised/shocked that this could actually lead to good performance in practice. Since the linear weights are randomly initialized, does that mean the resulting backpropagated gradients for the backbone are also random?

Just curious how do u think of this issue. Thanks!

opened by Tsingularity 2
where is the fisher mask calculation for nlp experiments

Hi, thanks for the great work!

I found the fisher mask calculation for cifar experiments here, but I cannot find the counterpart code for nlp experiments. Could u please share a link to it? Thanks!

opened by Tsingularity 2
Cannot import name 'MultiSameFisherMask'

When trying to run bash cifar10-fast/scripts/distributed_training_fish.sh 256 0.005 50 2 0.4 2 I get the error ImportError: cannot import name 'MultiSameFisherMask' from 'fisher' (/Users/sebastian/Code/FisherInformation/FISH/cifar10-fast/fisher.py) I believe the class MultiSameFisherMask doesn't exist in fisher.py, yet it is being imported here: https://github.com/varunnair18/FISH/blob/bfcdad8dacadffceea6ee343e7c561988b839c7b/cifar10-fast/asgd_dawn.py#L6.

opened by scaldas 2
How can we use this method for CNN based models

Hi, I like the simplicity of the method and want to experiment it on CNN based model. However the code you have is mostly for transformer based models. I was curious to know if you have tried it on CNN based models like VGG16, PresNet & WideResNet ?

Thanks, Amit

opened by amitchandak 1
The "generate_masks" is missing

Hi,

a errors happen at https://github.com/varunnair18/FISH/blob/b5bff1e66559f7c865e2f45c1488ae13e193dae2/transformers/examples/text-classification/custom_trainer.py#L108, since there are not a file of generate_masks. Could you check this?

opened by BaohaoLiao 0

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Related tags

Overview

Fisher Induced Sparse uncHanging (FISH) Mask

Setup

FISH Mask: GLUE Experiments

Parameter-Efficient Transfer Learning

Distributed Training

FISH Mask: CIFAR10 Experiments

Distributed Training

Efficient Checkpointing

Replicating Results

Notes

Acknowledgements

Comments

Transformers

fisher calculation is based on random initialized linear layer?

where is the fisher mask calculation for nlp experiments

Cannot import name 'MultiSameFisherMask'

How can we use this method for CNN based models

The "generate_masks" is missing

Owner

Varun Nair

Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

Code for our NeurIPS 2021 paper 'Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation'

Official code for On Path Integration of Grid Cells: Group Representation and Isotropic Scaling (NeurIPS 2021)

This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

Source code of NeurIPS 2021 Paper ''Be Confident! Towards Trustworthy Graph Neural Networks via Confidence Calibration''

Code repo for "RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network" (Machine Learning and the Physical Sciences workshop in NeurIPS 2021).

[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

Code for Subgraph Federated Learning with Missing Neighbor Generation (NeurIPS 2021)

Code for NeurIPS 2021 paper: Invariant Causal Imitation Learning for Generalizable Policies

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021)

[NeurIPS 2021] “Improving Contrastive Learning on Imbalanced Data via Open-World Sampling”,

[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning