COResets and Data Subset selection
Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.
In this README
What is CORDS?
CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of pytorch. Deep Learning systems are extremely compute intensive today with large turn around times, energy inefficiencies, higher costs and resourse requirements [1,2]. CORDS is an effort to make deep learning more energy, cost, resource and time efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:
Data Efficiency
Reducing End to End Training Time
Reducing Energy Requirement
Faster Hyper-parameter tuning
Reducing Resource (GPU) Requirement and Costs
The primary purpose of CORDS is to select the right representative data subsets from massive datasets, and it does so iteratively. CORDS uses some recent advances in data subset selection and particularly, ideas of coresets and submodularity select such subsets. CORDS implements a number of state of the art data subset selection algorithms and coreset algorithms. Some of the algorithms currently implemented with CORDS include:
- GLISTER [3]
- GradMatch [4]
- CRAIG [4,5]
- SubmodularSelection [6,7,8] (Facility Location, Feature Based Functions, Coverage, Diversity)
- RandomSelection
We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:
- Reproducability of SOTA in Data Selection and Coresets: Enable easy reproducability of SOTA described above. We are trying to also add more algorithms so if you have an algorithm you would like us to include, please let us know,.
- Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets including CIFAR-10, CIFAR-100, MNIST, SVHN and ImageNet.
- Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
- Modular design: The data selection algorithms are separate from the training loop, thereby enabling modular design and also varied scenarios of utility.
- Broad number of usecases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating a number of additional use cases like object detection, speech recognition, semi-supervised learning, Auto-ML, etc.
Installation
-
To install latest version of CORDS package using PyPI:
pip install -i https://test.pypi.org/simple/ cords
-
To install using source:
git clone https://github.com/decile-team/cords.git cd cords pip install -r requirements/requirements.txt
Next Steps
Tutorials
Documentation
The documentation for the latest version of CORDS can always be found here.