A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.

Related tags

Deep Learning wilds
Overview


PyPI License

Overview

WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

The WILDS package contains:

  1. Data loaders that automatically handle data downloading, processing, and splitting, and
  2. Dataset evaluators that standardize model evaluation for each dataset.

In addition, the example scripts contain default models, allowing new algorithms to be easily added and run on all of the WILDS datasets.

For more information, please read our paper or visit our website. For questions and feedback, please post on the discussion board.

Installation

We recommend using pip to install WILDS:

pip install wilds

If you have already installed it, please check that you have the latest version:

python -c "import wilds; print(wilds.__version__)"
# This should print "1.0.0". If it doesn't, update by running:
pip install -U wilds

If you plan to edit or contribute to WILDS, you should install from source:

git clone [email protected]:p-lambda/wilds.git
cd wilds
pip install -e .

Requirements

  • numpy>=1.19.1
  • pandas>=1.1.0
  • pillow>=7.2.0
  • torch>=1.7.0
  • tqdm>=4.53.0
  • pytz>=2020.4
  • outdated>=0.2.0
  • ogb>=1.2.3
  • torch-scatter>=2.0.5
  • torch-geometric>=1.6.1

Running pip install wilds will check for all of these requirements except for the torch-scatter and torch-geometric packages, which require a quick manual install.

Default models

After installing the WILDS package, you can use the scripts in examples/ to train default models on the WILDS datasets. These scripts are not part of the installed WILDS package. To use them, you should clone the repo (assuming you did not install from source):

git clone [email protected]:p-lambda/wilds.git

To run these scripts, you will need to install these additional dependencies:

  • torchvision>=0.8.1
  • transformers>=3.5.0

All baseline experiments in the paper were run on Python 3.8.5 and CUDA 10.1.

Usage

Default models

In the examples/ folder, we provide a set of scripts that we used to train models on the WILDS package. These scripts are configured with the default models and hyperparameters that we used for all of the baselines described in our paper. All baseline results in the paper can be easily replicated with commands like:

cd examples
python run_expt.py --dataset iwildcam --algorithm ERM --root_dir data
python run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data

The scripts are set up to facilitate general-purpose algorithm development: new algorithms can be added to examples/algorithms and then run on all of the WILDS datasets using the default models.

The first time you run these scripts, you might need to download the datasets. You can do so with the --download argument, for example:

python run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data --download

Data loading

The WILDS package provides a simple, standardized interface for all datasets in the benchmark. This short Python snippet covers all of the steps of getting started with a WILDS dataset, including dataset download and initialization, accessing various splits, and preparing a user-customizable data loader.

>>> from wilds.datasets.iwildcam_dataset import IWildCamDataset
>>> from wilds.common.data_loaders import get_train_loader
>>> import torchvision.transforms as transforms

# Load the full dataset, and download it if necessary
>>> dataset = IWildCamDataset(download=True)

# Get the training set
>>> train_data = dataset.get_subset('train',
...                                 transform=transforms.Compose([transforms.Resize((224,224)),
...                                                               transforms.ToTensor()]))

# Prepare the standard data loader
>>> train_loader = get_train_loader('standard', train_data, batch_size=16)

# Train loop
>>> for x, y_true, metadata in train_loader:
...   ...

The metadata contains information like the domain identity, e.g., which camera a photo was taken from, or which hospital the patient's data came from, etc.

Domain information

To allow algorithms to leverage domain annotations as well as other groupings over the available metadata, the WILDS package provides Grouper objects. These Grouper objects extract group annotations from metadata, allowing users to specify the grouping scheme in a flexible fashion.

>>> from wilds.common.grouper import CombinatorialGrouper

# Initialize grouper, which extracts domain information
# In this example, we form domains based on location
>>> grouper = CombinatorialGrouper(dataset, ['location'])

# Train loop
>>> for x, y_true, metadata in train_loader:
...   z = grouper.metadata_to_group(metadata)
...   ...

The Grouper can be used to prepare a group-aware data loader that, for each minibatch, first samples a specified number of groups, then samples examples from those groups. This allows our data loaders to accommodate a wide array of training algorithms, some of which require specific data loading schemes.

# Prepare a group data loader that samples from user-specified groups
>>> train_loader = get_train_loader('group', train_data,
...                                 grouper=grouper,
...                                 n_groups_per_batch=2,
...                                 batch_size=16)

Evaluators

The WILDS package standardizes and automates evaluation for each dataset. Invoking the eval method of each dataset yields all metrics reported in the paper and on the leaderboard.

>>> from wilds.common.data_loaders import get_eval_loader

# Get the test set
>>> test_data = dataset.get_subset('test',
...                                 transform=transforms.Compose([transforms.Resize((224,224)),
...                                                               transforms.ToTensor()]))

# Prepare the data loader
>>> test_loader = get_eval_loader('standard', test_data, batch_size=16)

# Get predictions for the full test set
>>> for x, y_true, metadata in test_loader:
...   y_pred = model(x)
...   [accumulate y_true, y_pred, metadata]

# Evaluate
>>> dataset.eval(all_y_pred, all_y_true, all_metadata)
{'recall_macro_all': 0.66, ...}

Citing WILDS

If you use WILDS datasets in your work, please cite our paper (Bibtex):

  • WILDS: A Benchmark of in-the-Wild Distribution Shifts (2020). Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang.

Please also cite the original papers that introduce the datasets, as listed on the datasets page.

Acknowledgements

The design of the WILDS benchmark was inspired by the Open Graph Benchmark, and we are grateful to the Open Graph Benchmark team for their advice and help in setting up WILDS.

Comments
  • Installating via pip seems to miss `torch_scatter` dependency

    Installating via pip seems to miss `torch_scatter` dependency

    Hey,

    I noticed that the installation via pip install wilds seems to miss the torch_scatter dependency that is also listed in the README. When e.g. trying to do from wilds.datasets.amazon_dataset import AmazonDataset I got

    from wilds.datasets.amazon_dataset import AmazonDataset
      File "/Users/deul/Desktop/wilds/wilds/datasets/amazon_dataset.py", line 6, in <module>
        from wilds.common.utils import map_to_id_array
      File "/Users/deul/Desktop/wilds/wilds/common/utils.py", line 1, in <module>
        import torch, torch_scatter
    ModuleNotFoundError: No module named 'torch_scatter'
    

    As far as I can see, the solution should be as easy as adding torch_scatter>=2.0.5 to the install_requires attribute in setup.py. In my case, the error was resolved after installing torch_scatter separately.

    opened by Kaleidophon 6
  • Data loader for PovertyMap is very slow

    Data loader for PovertyMap is very slow

    Hi -

    Ran into a bit of an issue with data loading the Povertymap dataset - loading a single minibatch with 128 examples takes about 5-6 seconds. This is not a huge deal but slow enough to make me curious if there's a faster way of doing this.

    Digging into the code a bit, it looks like the slowdown is mostly due to the array copy on line 239 of poverty_dataset.py https://github.com/p-lambda/wilds/blob/f984047af654eed6be51a7f770804a1c1b1ad0a0/wilds/datasets/poverty_dataset.py#L239 FWIW it looks like this is a known issue for memory-mapped numpy arrays on Linux systems (https://stackoverflow.com/questions/42864320/numpy-memmap-performance-issues).

    I'm not sure if there are any recommendations for getting around this, or if there's another way the data could be loaded in? Or let me know if I'm totally off-base here. Thanks!

    opened by dmadras 6
  • `assert` error in new wilds version with FMoW

    `assert` error in new wilds version with FMoW

    Hello, I am using the new version of WILDS and getting the error:

    ... wilds/common/utils.py" line 86, in avg_over_groups
        assert v.numel()==g.numel()
    

    any ideas? It may be a bug on my end and if I catch it I'll update here.

    opened by mitchellnw 4
  • Unable to Train ERM model with civilcomments

    Unable to Train ERM model with civilcomments

    Hi,

    I am having trouble in running the code with command python3 wilds/examples/run_expt.py --dataset civilcomments --algorithm ERM --root_dir data --download Everything stuck, no error reported, both GPU and CPU are not leveraged.

    If ctrl+C, it shows image

    The same thing didn't happen when I tried to run the same script but with groupDRO.

    It would be very helpful if you have any clue on this, and thank you a lot for your amazing, well developed code!

    opened by Bluepossibility 3
  • Issue in OOD data distribution when Grouper is set to

    Issue in OOD data distribution when Grouper is set to "regions" for FMoW

    Hi,

    I am trying to change the groupby from "year" to "region". I have followed the instructions in the README page and currently using the following command: python3 wilds/examples/run_expt.py --dataset fmow --algorithm ERM --groupby_fields region --root_dir wilds_fmow/

    However, the issue is that the training dataset is not being separated in terms of distinct regions for ID and OOD manner. That is, all regions are included in ID as well as OOD. Here is a screenshot of the output: Screenshot 2022-11-09 at 15 37 47

    Therefore, I was wondering if that is a bug in the code or am I missing something?

    Thanks Sara A. Al-Emadi

    opened by saraalemadi 3
  • Minor issue: `pip install wilds` changes pytorch version

    Minor issue: `pip install wilds` changes pytorch version

    A really minor issue, but the pip install wilds changed my pytorch version which then caused some prior evals on non-wilds datasets to change slightly. Is it possible for this to not occur? No worries if not.

    opened by mitchellnw 3
  • Understanding the prediction_dir format for leaderboard submission

    Understanding the prediction_dir format for leaderboard submission

    I wonder if the log folder used during training is the prediction_dir described in Get Started: Evaluating trained models.

    I tried to reproduce the ERM result on a subset of camelyon with the following command:

    python examples/run_expt.py --dataset camelyon17 --algorithm ERM--root_dir data --frac 0.1 --log_dir log_erm_01.

    Training goes well.

    But my file camelyon17_split:id_val_seed:0_epoch is empty.

    Then I ran the following command: python examples/evaluate.py log_erm_01 erm_01_output --root-dir data --dataset camelyon17

    And I got this:

    Traceback (most recent call last):
      File "examples/evaluate.py", line 282, in <module>
        main()
      File "examples/evaluate.py", line 244, in main
        evaluate_benchmark(
      File "examples/evaluate.py", line 136, in evaluate_benchmark
        predictions_file = get_prediction_file(
      File "examples/evaluate.py", line 89, in get_prediction_file
        raise FileNotFoundError(
    FileNotFoundError: Could not find CSV or pth prediction file that starts with camelyon17_split:id_val_seed:0.
    

    So my question is whether the log file is the prediction_dir described in Get Started ?

    opened by jmamath 3
  • How do I access data from only one group?

    How do I access data from only one group?

    Hello, Thanks for the fantastic library!

    I have two questions:

    1. Is there any way I can get a per-group dataloader in wilds? This will help with, for instance, training a separate model for each group of data.
    2. Can I change the split of data for each dataset? My application requires 50% of the data for each group/domain for testing.

    Thanks!

    opened by krishnap25 3
  • Model loaded from a .pth predicts only zeros

    Model loaded from a .pth predicts only zeros

    Hello !

    I downloaded for the Camelyon17 dataset your trained model from CodaLab (ERM and seed0). I have installed all packages correctly according to your readme and load the model as follows:

    path = "/best_model.pth"
    state = torch.load(path)['algorithm']
    
    state_dict = {}
     
    for key in list(state.keys()):
        state_dict[key.replace('model.', '')] = state[key]
    
    model.load_state_dict(state_dict)
    
    model.eval()
    

    I initialize the dataset I use for testing the model as follows:

    import datasets_load  # from wilds package
    dataset = datasets_load.Dataset('camelyon17', 32, '/data', 0.75, False)
    

    For the prediction I used the following piece of code:

    from wilds.common.data_loaders import get_eval_loader
    
    test_data = dataset.test_set
    test_loader = get_eval_loader('standard', test_data, batch_size=32)
    
    with torch.no_grad():
        for x, y_true, metadata in test_loader:
              y_pred = model(x)
              labels = y_true
              _, predicted = torch.max(y_pred, 1)
              # print statements to check the output
              print("Labels: ", labels)
              print("Predicted: ", predicted)
              print("Correct: ", (predicted == labels).sum().item())
    
    

    So far so good. When I run the code, the labels are printed (which always consist of 1 at the beginning, because shuffle=False) and the prediction which always consists of 0 values.

    Labels:  tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
    Predicted:  tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    Correct:  0
    

    I would appreciate any advice or assistance. Many thanks in advance. Tim

    opened by tim1188 3
  • Cannot fetch 'ogb-molpcba' dataset due to missing arg

    Cannot fetch 'ogb-molpcba' dataset due to missing arg

    dataset = get_dataset(dataset='ogb-molpcba', download=True, root_dir='../data/')
    

    Results in the following error:

    --------------------------------------------------------------------
    TypeError                          Traceback (most recent call last)
    <ipython-input-2-c369817b9157> in <module>
    ----> 1 dataset = get_dataset(dataset='ogb-molpcba', download=True, root_dir='../data/')
    
    ~/anaconda3/envs/benchmark/lib/python3.7/site-packages/wilds/get_dataset.py in get_dataset(dataset, version, **dataset_kwargs)
         51     elif dataset == 'ogb-molpcba':
         52         from wilds.datasets.ogbmolpcba_dataset import OGBPCBADataset
    ---> 53         return OGBPCBADataset(version=version, **dataset_kwargs)
         54 
         55     elif dataset == 'poverty':
    
    ~/anaconda3/envs/benchmark/lib/python3.7/site-packages/wilds/datasets/ogbmolpcba_dataset.py in __init__(self, version, root_dir, download, split_scheme)
         88             download_url('https://snap.stanford.edu/ogb/data/misc/ogbg_molpcba/scaffold_group.npy', os.path.join(self.ogb_dataset.root, 'raw'))
         89         self._metadata_array = torch.from_numpy(np.load(metadata_file_path)).reshape(-1,1).long()
    ---> 90         self._collate = PyGCollater(follow_batch=[])
         91 
         92         self._metric = Evaluator('ogbg-molpcba')
    
    TypeError: __init__() missing 1 required positional argument: 'exclude_keys'
    

    Versions:

    wilds 1.1.0 torch_geometric 1.7.0

    opened by arnaudvl 3
  • Could you provide the trained weights?

    Could you provide the trained weights?

    Hello,

    I am training BERT+ERM on the Amazon dataset but it is very time cost. Is it possible to provide the best trained parameters to the users? ( like BERT is proving the pretrained weights, maybe you can have another folder under examples which contains all the weights for users.) It will save users about a week ( and computations). Thank you!

    opened by yachuan 3
  • camelyon17 split scheme: in-dist

    camelyon17 split scheme: in-dist

    I am not able to run Camelyon17 with --split_scheme in-dist (I'm assuming this corresponds to the setting with ID val data).

    Any pointers on how to run this, or in general how to run camelyon with the ID val data?

    Thank you for the help!

    opened by thomaspzollo 1
  • run_expt.py: --device argument doesn't set the device

    run_expt.py: --device argument doesn't set the device

    Hey, I'm running Wilds on a p2.8xlarge AWS EC2 instance with 8 K80 GPUs. I noticed that when I try to run run_expt.py and use the --device argument to divide the jobs I'm trying to run between the GPUs, they all end up running on GPU 0. I verified this by the memory usage in nvidia-smi as well as printing the device used by torch using torch.cuda.current_device(). My guess is that the CUDA_VISIBLE_DEVICES environment variable, set here, is set too late and PyTorch just defaults to device 0.

    I've worked around this by setting the CUDA_VISIBLE_DEVICES variable manually, before running the script. I just thought I'd let you know I encountered this issue.

    Really appreciate the project by the way! Being able to access multiple datasets for domain generalization with the same interface is really useful, and I managed to use run_expt pretty easily to run my own experiments.

    opened by vvolhejn 0
Releases(v2.0.0)
  • v2.0.0(Dec 13, 2021)

    The v2.0.0 release adds unlabeled data to 8 datasets and several new algorithms for taking advantage of the unlabeled data. It also updates the standard data augmentations used for the image datasets.

    All labeled (training, validation, and test) datasets are exactly the same. Evaluation metrics are also exactly the same. All results on the datasets in v1.x are therefore still current and directly comparable to results obtained on v2.

    For more information, please read our paper on the unlabeled data.

    New datasets with unlabeled data

    We have added unlabeled data to the following datasets:

    1. iwildcam
    2. camelyon17
    3. ogb-molpcba
    4. globalwheat
    5. civilcomments
    6. fmow
    7. poverty
    8. amazon

    The following datasets have no unlabeled data and have not been changed:

    1. rxrx1
    2. py150

    The labeled training, validation, and test data in all datasets have been kept exactly the same.

    The unlabeled data comes from the same underlying sources as the original labeled data, and they can be from the source, validation, extra, or target domains. We describe each dataset in detail in our paper.

    Each unlabeled dataset has its own corresponding data loader defined in wilds/datasets/unlabeled. Please see the README for more details on how to use them.

    New algorithms for using unlabeled data

    In our scripts in the examples folder, we have updated and/or added new algorithms that make use of the unlabeled data:

    1. CORAL (Sun and Saenko, 2016)
    2. DANN (Ganin et al., 2016)
    3. AFN (Xu et al., 2019)
    4. Pseudo-Label (Lee, 2013)
    5. FixMatch (Sohn et al., 2020)
    6. Noisy Student (Xie et al., 2020)
    7. SwAV pre-training (Caron et al., 2020)
    8. Masked language model pre-training (Devlin et al., 2019)

    Other changes

    GlobalWheat v1.0 -> v1.1

    We have corrected some errors in the metadata for the previous version of the GlobalWheat (labeled) dataset. Users who did not explicitly make use of the location or stage metadata (which should be most users) will not be affected. All baseline results are unchanged.

    DomainNet support

    We have included data loaders for the DomainNet dataset (Peng at al., 2019) as a means of benchmarking the algorithms we implemented on existing datasets.

    Data augmentation

    We have added support for RandAugment (Cubuk et al., 2019) for RGB images, and we have also implemented a set of data augmentations for the multi-spectral Poverty dataset. These augmentations are used in all of the algorithms for unlabeled data listed above.

    Hyperparameters

    In our experiments to benchmark the algorithms for using unlabeled data, we tuned hyperparameters by random search instead of grid search. The default hyperparameters in examples/configs/datasets.py still work well but do not reflect the exact hyperparameters we used for our experiments. To see those, please view our CodaLab worksheet.

    Miscellaneous

    • In our example scripts, we have added support for gradient accumulation by specifying the gradient_accumulation_steps parameter.
    • We have also added support for logging using Weights and Biases.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.2(Aug 4, 2021)

    v1.2.2 contains several minor changes:

    • Added a check to make sure that a group data loader is used whenever n_groups_per_batch or distinct_groups are passed in as arguments to examples/run_expt.py. (https://github.com/p-lambda/wilds/issues/79)
    • Data augmentations now only transform x by default. Set do_transform_y when initializing the WILDSSubset to modify both x and y. (https://github.com/p-lambda/wilds/issues/77)
    • For FasterRCNN, we now use the PyTorch implementation of smooth_l1_loss instead of the custom torchvision implementation, which was removed in torchvision v0.10.
    • Updated the requirements to include torchvision, scipy, and scikit-learn. Previously, torchvision was only needed for the example scripts. However, it is now also used for computing metrics in the GlobalWheat-WILDS dataset, so we have moved it into the core set of requirements.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jul 19, 2021)

    v1.2.1 adds two new benchmark datasets: the GlobalWheat wheat head detection dataset, and the RxRx1 cellular microscopy dataset. Please see our paper for more details on these datasets.

    It also simplifies saving and evaluation predictions made across different replicates and datasets.

    New datasets

    New benchmark dataset: GlobalWheat-WILDS v1.0

    • The Global Wheat Head detection dataset comprises images of wheat fields collected from 12 countries around the world. The task is to draw bounding boxes around instances of wheat heads in each image, and the distribution shift is over images taken in different locations.
    • Model performance is measured by the proportion of the predicted bounding boxes that sufficiently overlap with the ground truth bounding boxes (IoU > 0.5). The example script implements a FasterRCNN baseline.
    • This dataset is adapted from the Global Wheat Head Dataset 2021, which was recently used in a public competition held in conjunction with the Computer Vision in Plant Phenotyping and Agriculture Workshop at ICCV 2021.

    New benchmark dataset: RxRx1-WILDS v1.0

    • The RxRx1 dataset comprises images of genetically-perturbed cells taken with fluorescent microscopy and collected across 51 experimental batches. The task is to classify the identity of the genetic perturbation applied to each cell, and the distribution shift is over different experimental batches.
    • Model performance is measured by average classification accuracy. The example script implements a ResNet-50 baseline.
    • This dataset is adapted from the RxRx1 dataset released by Recursion.

    Additional dataset: ENCODE

    • The ENCODE dataset is based on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge. The task is to classify if a given genomic location will be bound by a particular transcription factor, and the distribution shift is over different cell types.
    • We did not include this dataset in the official benchmark as we were unable to learn a model that could generalize across all the cell types simultaneously, even in an in-distribution setting, which suggested that the model family and/or feature set might not be rich enough.

    Other changes

    Saving and evaluating predictions

    To ease evaluation and leaderboard submission, we have made the following changes:

    • Predictions are now automatically saved in the format described in our submission guidelines.
    • We have added an evaluation script that evaluates these saved predictions across multiple replicates and datasets. See the updated README and examples/evaluate.py for more details.

    Code changes to support detection tasks

    To support detection tasks, we have modified the example scripts as well as made slight changes to the WILDS data loaders. All interfaces should be backwards-compatible.

    • The labels y and the model outputs no longer need to be a Tensor. For example, for detection tasks, a model might return a dictionary containing bounding box coordinates as well as class predictions for each bounding box. Accordingly, several helper functions have been rewritten to be more flexible.
    • Models can now optionally take in y in the forward call. For example, during training, a model might use ground truth bounding boxes to train a bounding box classifier.
    • Data transforms can now transform both x and y. We have also merged train_transform and eval_transform functions into a single function that takes a is_training parameter.

    Miscellaneous changes

    • We have changed the names of the in-distribution split_scheme's to match the terminology in Section 5 of the updated paper.
    • The FMoW-WILDS and PovertyMap-WILDS constructors now no longer use the oracle_training_set parameter to use an in-distribution split. This is now controlled through split_scheme to be consistent with the other datasets.
    • We fixed a minor bug in the PovertyMap-WILDS in-distribution baseline. The Val (ID) and Test (ID) splits are slightly changed.
    • The FMoW-WILDS constructor now sets use_ood_val=True by default. This change has no effect for users using the example scripts, as use_ood_val is already set in config/datasets.py.
    • Users who are only using the data loaders and not the evaluation metrics or example scripts will no longer need to install torch_scatter (thanks Ke Alexander Wang).
    • The Waterbirds dataset now computes the adjusted average accuracy on the validation and test sets, as described in Appendix C.1 of the corresponding paper.
    • The behavior of algorithm.eval() is now consistent with algorithm.model.eval() in that both preserve the grad_fn attribute (thanks Divya Shanmugam). See https://github.com/p-lambda/wilds/issues/45.
    • The dataset name for OGB-MolPCBA has been changed from ogbg-molpcba to to ogb-molpcba for consistency.
    • We have updated the OGB-MolPCBA data loader to be compatible with v1.7 of the pytorch_geometric dependency (thanks arnaudvl). See https://github.com/p-lambda/wilds/issues/52.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Mar 10, 2021)

    The v1.1.0 release contains a new Py150 benchmark dataset for code completion, as well as updates to several existing datasets and default models to make them significantly faster and easier to use.

    Some of these changes are breaking changes that will impact users who are currently running experiments with WILDS. We sincerely apologize for the inconvenience. We ask all users to update their package to v1.1.0, which will automatically update your datasets. In addition, please update your default models, for example by using the latest example scripts in this repo. These changes were primarily made to accelerate model training, which was a bottleneck for many users; at this time, we do not expect to have to make further changes to the existing datasets or default models.

    New datasets

    New benchmark dataset: Py150

    • The Py150-WILDS dataset is a code completion dataset, where the distribution shift is over code from different Github repositories.
    • We focus on accuracy on the subpopulation of class and method tokens, as prior work has shown that those are the most frequent queries in real-world code completion settings.
    • It is a variant of the Py150 dataset from Raychev et al., 2016.
    • See our paper for more details.

    Additional dataset: SQF

    • The SQF dataset is based on the stop-question-and-frisk dataset released by the New York Police Department. We adapt the version processed by Goel et al., 2016. The task is to predict criminal possession of a weapon.
    • We use this dataset to study distribution shifts in an algorithmic fairness context. Specifically, we consider subpopulation shifts across locations and race groups. However, while there are large performance gaps, we did not find that they were caused by the distribution shift. We therefore did not include this dataset as part of the official benchmark.

    Major updates to existing datasets

    Note that datasets are versioned separately from the main WILDS version. We have two major updates (i.e., breaking, non-backwards-compatible changes) to datasets.

    Amazon v1.0 -> v2.0

    • To speed up model training, we have subsampled the number of reviewers in this dataset to 25% of its original size, while keeping the same number of reviews per reviewer.

    iWildCam v1.0 -> v2.0

    • Previously, the ID split was done uniformly at random, meaning that images from the same sequence (i.e., taken within a few seconds of each other by the same camera) could be found across all of the training / validation (ID) / test (ID) sets.
    • In v2.0, we have redone the ID split so that all images taken on the same day by the same camera are in only one of the training, validation (ID), or test (ID) sets. In other words, these sets still comprise images from the same cameras, but taken on different days.
    • In line with the new iWildCam 2021 challenge on Kaggle, we have also removed the following images:
      • images that include humans or pictures taken indoors.
      • images with non-animal categories such as start and unidentifiable.
      • images in categories such as unknown, unknown raptor and unknown rat.
    • We added back in location 537 that was previously removed as we mistakenly believed those images were corrupted.
    • We have re-split the data into training, validation (ID), test (ID), validation (OOD), and test (OOD) sets. This is a different random split from v1.0.
    • Since we remove any classes that do not end up in the train split, removing those images and redoing the split gave us a different set of species. There are now 182 classes instead of 186. Specifically, the following classes have been removed: ['unknown', 'macaca fascicularis', 'proechimys sp', 'unidentifiable', 'turtur calcospilos', 'streptopilia senegalensis', 'equus africanus', 'macaca nemestrina', 'start', 'paleosuchus sp', 'unknown raptor', 'unknown rat', 'misfire', 'mustela lutreolina', 'canis latrans', 'myoprocta pratti', 'xerus rutilus', 'end', 'psophia crepitans', 'ictonyx striatus']. The following classes have been added: [‘praomys tullbergi', 'polyplectron chalcurum', 'ardeotis kori', 'phaetornis sp', 'mus minutoides', 'raphicerus campestris', 'tigrisoma mexicanum', 'leptailurus serval', 'malacomys longipes', 'oenomys hypoxanthus', 'turdus olivaceus', 'macaca sp', 'leiothrix argentauris', 'lophura sp', 'mazama temama', 'hippopotamus amphibius']. For convenience, we have also added a categories.csv that maps from label IDs to species names.
    • To speed up downloading and model training (by reducing the I/O bottleneck), we have also resized all images to have a height of 448px while keeping the original aspect ratio. All images are wide (so they now have a min dimension of 448px). Note that as JPEG compression is lossy, this procedure gives different images from resizing the full-sized image in the code after loading it.

    Minor updates to existing datasets

    We made two backwards-compatible changes to existing datasets. We encourage all users to update these datasets; these updates should leave results unchanged (modulo training randomness). In future versions of the WILDS package, we will deprecate the older versions of these datasets.

    FMoW v1.0 -> v1.1

    • Previously, the images were stored as chunks in .npy files and read in using NumPy memmapping.
    • Now, we have converted them (losslessly) into individual PNG images. This should help with disk I/O and memory usage, and make them more convenient to visualize and use in other pipelines.

    PovertyMap v1.0 -> v1.1

    • Previously, the images were stored in a single .npy file and read in using NumPy memmapping.
    • Now, we have converted them (loselessly) into individual compressed .npz files. This should help with disk I/O and memory usage.
    • We have correspondingly updated the default number of workers for the data loader from 1 to 4.

    Default model updates

    We have updated the default models for several datasets. Please take note of these changes if you are currently running experiments with these datasets.

    Amazon and CivilComments

    • To speed up model training, we have switched from BERT-base-uncased to DistilBERT-base-uncased. This obtains roughly similar accuracy but at twice the speed.
    • For CivilComments, we have also increased the number of replicates from 3 to 5, to reduce variability in the reported performance.

    Camelyon17

    • Previously, we were upsizing each image to 224x224 before passing it into the model.
    • We now leave the images at their original resolution of 96x96, which significantly speeds up model training.

    iWildCam

    • Previously, we were resizing each image to 224x224 before passing it into the model. However, this limited model accuracy, as the animals in the images can sometimes be quite small.
    • We now resize each image to 448x448 before passing it into the model, which improves accuracy and macro F1 across the board.

    FMoW

    • For consistency with the other datasets, we have changed the early stopping validation criterion (val_metric) from acc_avg to acc_worst_region.

    PovertyMap

    • For consistency with the other datasets, we have changed the early stopping validation criterion (val_metric) from r_all to r_wg.

    Other changes

    • We have uploaded an executable version of our paper to CodaLab. This contains the exact commands, code, and data used for each experiment reported in our paper. The trained model weights for every experiment can also be found there.
    • To ease downloading, we have added wilds/download_datasets.py, which allows users to download all (or a subset of) datasets at once. Please see the README for instructions.
    • We have added a convenience function for getting the appropriate constructor for each dataset in wilds/get_dataset.py. This function allows you to specify a version argument. If this is not specified, it defaults to the latest available version for that dataset. If that version is not downloaded and the download argument is also set, then it will automatically download that version.
    • The example script examples/run_expt.py now also takes in a version argument.
    • We have added download sizes and expected training times to the README.
    • We have updated the default inputs for WILDSDatasets.eval methods for various datasets. For example, eval for most classification datasets now take in predicted labels by default, while the predictions were previously passed in as logits. The default inputs vary across datasets, and we document this in the docstring of each eval method.
    • We made a few updates to the code in examples/ to interface better with language modeling tasks (for Py150). None of these changes affect the results or the interface with algorithms.
    • We updated the code in examples/ to save model predictions in an appropriate format for submissions to the leaderboard.
    • Finally, we have also updated our paper to streamline the writing and include these new numbers and datasets.
    Source code(tar.gz)
    Source code(zip)
CrossNorm and SelfNorm for Generalization under Distribution Shifts (ICCV 2021)

CrossNorm (CN) and SelfNorm (SN) (Accepted at ICCV 2021) This is the official PyTorch implementation of our CNSN paper, in which we propose CrossNorm

null 100 Dec 28, 2022
PyTorch evaluation code for Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

Out-of-distribution Generalization Investigation on Vision Transformers This repository contains PyTorch evaluation code for Delving Deep into the Gen

Chongzhi Zhang 72 Dec 13, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.datasets: The raw text iterators for common NLP datasets torchtext.data: Some basic NLP building bloc

null 3.2k Jan 8, 2023
This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. (CVPR 2022)"

Gait3D-Benchmark This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild

null 82 Jan 4, 2023
Distributionally robust neural networks for group shifts

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization This code implements the g

null 151 Dec 25, 2022
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-Norén 21.8k Jan 9, 2023
Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

SEDE SEDE (Stack Exchange Data Explorer) is new dataset for Text-to-SQL tasks with more than 12,000 SQL queries and their natural language description

Rupert. 83 Nov 11, 2022
The code of paper 'Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection'

Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection Pytorch implemetation of paper 'Learning to Aggregate and Personalize

Tencent YouTu Research 136 Dec 29, 2022
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Machine Learning Hand Detector This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Dev

Popstar Idhant 3 Feb 25, 2022
[CVPR'21] Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild

IVOS-W Paper Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild Zhaoyun Yin, Jia Zheng, Weixin Luo, Shenhan Qian, Hanli

SVIP Lab 38 Dec 12, 2022
Learning High-Speed Flight in the Wild

Learning High-Speed Flight in the Wild This repo contains the code associated to the paper Learning Agile Flight in the Wild. For more information, pl

Robotics and Perception Group 391 Dec 29, 2022
Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes / 3DCrowdNet News ?? 3DCrowdNet achieves the state-of-the-art accuracy on 3D

Hongsuk Choi 113 Dec 21, 2022
A simple python module to generate anchor (aka default/prior) boxes for object detection tasks.

PyBx WIP A simple python module to generate anchor (aka default/prior) boxes for object detection tasks. Calculated anchor boxes are returned as ndarr

thatgeeman 4 Dec 15, 2022
Paaster is a secure by default end-to-end encrypted pastebin built with the objective of simplicity.

Follow the development of our desktop client here Paaster Paaster is a secure by default end-to-end encrypted pastebin built with the objective of sim

Ward 211 Dec 25, 2022
Source code and notebooks to reproduce experiments and benchmarks on Bias Faces in the Wild (BFW).

Face Recognition: Too Bias, or Not Too Bias? Robinson, Joseph P., Gennady Livitz, Yann Henon, Can Qin, Yun Fu, and Samson Timoner. "Face recognition:

Joseph P. Robinson 41 Dec 12, 2022
Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance

Models for natural language understanding (NLU) tasks often rely on the idiosyncratic biases of the dataset, which make them brittle against test cases outside the training distribution.

Ubiquitous Knowledge Processing Lab 22 Jan 2, 2023