Simple data balancing baselines for worst-group-accuracy benchmarks.

Overview

BalancingGroups

Code to replicate the experimental results from Simple data balancing baselines achieve competitive worst-group-accuracy.

Replicating the main results

Set environment variables

export DATASETS_PATH=/path/to/datasets
export SLURM_PATH=/path/to/slurm/logs

Download and extract datasets

Generate dataset metadata

cd metadata/
python generate_metadata_waterbirds.py
python generate_metadata_celeba.py
python generate_metadata_civilcomments.py
python generate_metadata_multinli.py
cd ..

Launch jobs

# Launching 1400 combo seeds = 50 hparams for 4 datasets for 7 algorithms
# Each combo seed is ran 5 times to compute error bars, totalling 7000 jobs
./train.py --output_dir main_sweep --num_hparams_seeds 1400 

Parse results

./parse.py main_sweep

License

This source code is released under the CC-BY-NC license, included here.

Comments
  • Best Model Parameters

    Best Model Parameters

    Do you plan to release the best models for all the algorithms and datasets? Based on the paper, it is not clear what were the best hyperparameter values since it provides mean and std over top 5. It would help make the work more accessible for researchers with lesser compute if you could also release the models! Thanks :)

    opened by pratyushmaini 4
  • File Not Found for civilcomments train

    File Not Found for civilcomments train

    Are you using a different file for train set? It does not seem to get automatically downloaded.

    FileNotFoundError: [Errno 2] No such file or directory: 'tr/civilcomments/civilcomments_fine.csv'


    File "train.py", line 57, in run_experiment loaders = get_loaders(args["data_path"], args["dataset"], args["batch_size"], args["method"]) File "datasets.py", line 347, in get_loaders dataset_tr = Dataset(data_path, "tr", subsample_what, duplicates) File "datasets.py", line 235, in init super().init(split, data_path, subsample_what, duplicates, "fine") File "datasets.py", line 195, in init text = pd.read_csv(

    opened by pratyushmaini 3
  • An issue for the JTT code.

    An issue for the JTT code.

    Sorry to disturb you. I wonder why the code for upweighting is like "self.weights[i] += predictions.detach() * (self.hparams["up"] - 1)". I think this will upweigh the correctly-classified samples.

    opened by LJSthu 2
  • Questions regarding the experiment on civilcomments

    Questions regarding the experiment on civilcomments

    I have some questions regarding the group assignments for the civilcomments dataset. In the paper, it is mentioned that coarse grouping is being used, so there should only be two groups (the example being one of the identities or not).

    I have generated the metadata (metadata_civilcomments_coarse.csv) with setup_datasets.py. In this metadata file, it appears that there are 8 different groups in the column named "a" (value 0-7), which seems to be inconsistent with the two groups mentioned in the paper. From the code, it appears that only the training set is using the coarse grouping while the validation and testing sets are using fine grouping. I am wondering why it is designed this way or whether there are any places that I have misunderstood?

    Thank you for your time in advance.

    opened by yangarbiter 2
  • Weight decay for BERT models

    Weight decay for BERT models

    Hi! I noticed that in your code for BERT AdamW optimizer you only apply weight decay to parameters that contain the strings bias or LayerNorm.weight:

    https://github.com/facebookresearch/BalancingGroups/blob/72d31e56e168b8ab03348810d4c5bac0f8a90a7a/models.py#L41-L45

    The original group DRO code seems to do the opposite and not apply weight decay to only those parameters:

    https://github.com/kohpangwei/group_DRO/blob/master/train.py#L111-L114

    opened by izmailovpavel 0
Owner
Facebook Research
Facebook Research
Machine-care - A simple python script to take care of simple maintenance tasks

Machine care An simple python script to take care of simple maintenance tasks fo

null 2 Jul 10, 2022
Simple, light-weight config handling through python data classes with to/from JSON serialization/deserialization.

Simple but maybe too simple config management through python data classes. We use it for machine learning.

Eren Gölge 67 Nov 29, 2022
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
Data from "Datamodels: Predicting Predictions with Training Data"

Data from "Datamodels: Predicting Predictions with Training Data" Here we provid

Madry Lab 51 Dec 9, 2022
Simple structured learning framework for python

PyStruct PyStruct aims at being an easy-to-use structured learning and prediction library. Currently it implements only max-margin methods and a perce

pystruct 666 Jan 3, 2023
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list [email protected] Fiber Distributed Computing for AI Made Simp

Uber Open Source 997 Dec 30, 2022
🎛 Distributed machine learning made simple.

?? lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 44 Nov 27, 2022
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 3, 2023
Simple linear model implementations from scratch.

Hand Crafted Models Simple linear model implementations from scratch. Table of contents Overview Project Structure Getting started Citing this project

Jonathan Sadighian 2 Sep 13, 2021
A simple machine learning package to cluster keywords in higher-level groups.

Simple Keyword Clusterer A simple machine learning package to cluster keywords in higher-level groups. Example: "Senior Frontend Engineer" --> "Fronte

Andrea D'Agostino 10 Dec 18, 2022
A simple example of ML classification, cross validation, and visualization of feature importances

Simple-Classifier This is a basic example of how to use several different libraries for classification and ensembling, mostly with sklearn. Example as

Rob 2 Aug 25, 2022
Simple Machine Learning Tool Kit

Getting started smltk (Simple Machine Learning Tool Kit) package is implemented for helping your work during data preparation testing your model The g

Alessandra Bilardi 1 Dec 30, 2021
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 6, 2023
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

null 1 Dec 13, 2021
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. Solve a variety of tasks with pre-trained models or finetune them in

Backprop 227 Dec 10, 2022