Python Auto-ML Package for Tabular Datasets

Overview
Tabular-AutoML

Tabular-AutoML

AutoML Package for tabular datasets

Tabular dataset tuning is now hassle free!

Run one liner command and get best tuning and processed dataset in a go.

Python Git

Used Python Libraries :
lightgbm numpy numpy numpy

Installation & Usage


  1. Create a Virtual Environment : Tutorial
  2. Clone the repository.
  3. Open the directory with cmd.
  4. Copy this command in terminal to install dependencies.
pip install -r requirements.txt
  1. Installing the requirements.txt may generate some error due to outdated MS Visual C++ Build. You can fix this problem using this.
  2. First check the parser variable that has to be passed with all customizations.
>>> python -m tab_automl.main --help
usage: main.py [-h] -d  -t  -tf  [-p] [-f] [-spd] [-sfd] [-sm]

automl hyper parameters

optional arguments:
  -h, --help            show this help message and exit
  -d , --data-source    File path
  -t , --problem-type   Problem Type , currently supporting *regression* or *classification*
  -tf , --target-feature
                        Target feature inside the data
  -p , --pre-proc       If data processing is required
  -f , --fet-eng        If feature engineering is required
  -spd , --save-proc-data
                        Save the processed data
  -sfd , --save-fet-data
                        Save the feature engineered data
  -sm , --save-model    Save the best trained model
  1. Now run the command with your custom data, problem type and target feature
>> # For Classification Problem >>> python -m tab_automl.main -d "your custom data scource\custom_data.csv" -t "classification" -tf "your_custom_target_feature" -spd "true" -sfd "true" -sm "true"">
>>> # For Regression Problem
>>> python -m tab_automl.main -d "your custom data scource\custom_data.csv" -t "regression" -tf "your_custom_target_feature" -spd "true" -sfd "true" -sm "true"

>>> # For Classification Problem
>>> python -m tab_automl.main -d "your custom data scource\custom_data.csv" -t "classification" -tf "your_custom_target_feature" -spd "true" -sfd "true" -sm "true"

Contributing Guidelines


  1. Coment on the issue on which you want to work.
  2. If you get assigned, fork the repository.
  3. Create a new branch which should be named on your github user_id , e.g. sagnik1511.
  4. Update the changes on that branch.
  5. Create a PR (Pull request) to the main branch of the parent repository.
  6. The PR title should named like this [Issue Number] Heading of the issue.
  7. Describe the changes you have done with proper reasons.

Contributors


  1. Sagnik Roy : sagnik1511

If you like the project, do

Also follow me on GitHub , Kaggle , LinkedIn

Thank You for Visiting :)

Comments
  • Add new dataset for clustering problems inside datasets folder

    Add new dataset for clustering problems inside datasets folder

    Add a new tabular dataset for clustering problems inside tab_automl/datasets and also add the dataset class inside tab_automl/automl/datasets.py.

    Add proper comments and quality in code.

    Follow contributing guidelines on README.md

    enhancement JWOC medium 
    opened by sagnik1511 7
  • Adding KNN classifier

    Adding KNN classifier

    The PR fix for issues #7

    Added KNN classifier to models.py the advantage of this is it takes almost zero time to train because it only stores the data of the training part. and faster than all the models mentioned in the models.py file , it is also a non parametric models with only parameter that needs to be mentioned is the number of neighbours, adding to this as KNN doesnt undergo training we can add new data to it which doesn't affect the accuracy of the model. It is also very easy to implement and interpret as there is only one hyperparameter which is the number of neighbours .

    opened by VishnuBhaarath 5
  • Adding clustering 5 models inside single_model_dict #32

    Adding clustering 5 models inside single_model_dict #32

    PR tagged with #32

    I have added 5 clustering models in automl -> models.py . Name of the clustering models are as follows: AffinityPropagation AgglomerativeClustering Birch DBSCAN KMeans . DESCRIPTION AffinityPropagation - It involves finding a set of exemplars that best summarize the data. It takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges . AgglomerativeClustering- It involves merging examples until the desired number of clusters is achieved.It is implemented via the class AgglomerativeClustering and the main configuration to tune is the “n_clusters” set, an estimate of the number of clusters in the data, e.g. 2. . BIRCH -BIRCH Clustering involves constructing a tree structure from which cluster centroids are extracted. main configuration to tune is the “threshold” and “n_clusters” hyperparameters, the latter of which provides an estimate of the number of clusters. . DBSCAN Clustering involves finding high-density areas in the domain and expanding those areas of the feature space around them as clusters. the main configuration to tune is the “eps” and “min_samples” hyperparameters. . KMEANS -the main configuration to tune is the “n_clusters” hyperparameter set to the estimated number of clusters in the data.

    JWOC easy 
    opened by Tihsrah 3
  • KNN regressor

    KNN regressor

    PR fix for issue #6

    Added KNN regressor to models.py the advantage of this is it takes almost zero time to train because it only stores the data of the training part. and faster than all the models mentioned in the models.py file in training time as its zero in KNN regressor , it is also a non parametric models with only parameter that needs to be mentioned is the number of neighbours, adding to this as KNN doesn't undergo training we can add new data to it which doesn't affect the accuracy of the model. It is also very easy to implement and interpret as there is only one hyperparameter which is the number of neighbours, apart from this it's versatile and can be used as a regressor as well as classifier

    JWOC easy 
    opened by VishnuBhaarath 2
  • Update all Print statements to f-string

    Update all Print statements to f-string

    opened by kunalchhabra37 2
  • [ Taking different formats as input data {#4}] Load data from different file formats

    [ Taking different formats as input data {#4}] Load data from different file formats

    new file formats which are added are: .txt .json .xlsx .sqlite

    changes are made on ClassificationDataset in datasets.py

    One input is added to get table name from the sqlite in the .sqlite file format

    JWOC medium 
    opened by Tihsrah 2
  • Load data from different file formats.

    Load data from different file formats.

    The datasets are getting loaded on .csv format only in the codebase, example : see here.

    Add different data loading techniques for other formats like .txt , .sqlite, etc. Add required comments in the code.

    Follow contributing guidelines on README.md

    enhancement JWOC medium 
    opened by sagnik1511 2
  • [Issue {#32}]Add clustering models inside single_model_dict

    [Issue {#32}]Add clustering models inside single_model_dict

    This PR fixes for issue #32

    Changes made

    Added 5 new clustering models in models.py

    Reason

    Mini-Batch K-Means Mini-Batch K-Means is a modified version of k-means that makes updates to the cluster centroids using mini-batches of samples rather than the entire dataset, which can make it faster for large datasets, and perhaps more robust to statistical noise. Mean Shift Mean shift clustering involves finding and adapting centroids based on the density of examples in the feature space. OPTICS OPTICS clustering (where OPTICS is short for Ordering Points To Identify the Clustering Structure) is a modified version of DBSCAN. Spectral Clustering Spectral Clustering is a general class of clustering methods, drawn from linear algebra. to tune is the n_clusters hyperparameter used to specify the estimated number of clusters in the data. Gaussian Mixture Model A Gaussian mixture model summarizes a multivariate probability density function with a mixture of Gaussian probability distributions.

    JWOC easy 
    opened by snega16 1
  • [Issue {#25}] Add new dataset for clustering problems inside datasets folder

    [Issue {#25}] Add new dataset for clustering problems inside datasets folder

    This PR fixes for issue #25

    Changes made:

    • Added new clustering dataset Credit Card Customer data to datasets
    • Updated datasets,py with clustering class clustering() and clustering dataset class called Credit_Card_Customer_Data().
    JWOC medium 
    opened by snega16 1
  • [Issue {#7}] add new models for classification training

    [Issue {#7}] add new models for classification training

    The PR fix for issues #7

    Added KNN classifier to models.py the advantage of this is it takes almost zero time to train because it only stores the data of the training part. and faster than all the models mentioned in the models.py file , it is also a non parametric models with only parameter that needs to be mentioned is the number of neighbours, adding to this as KNN doesnt undergo training we can add new data to it which doesn't affect the accuracy of the model. It is also very easy to implement and interpret as there is only one hyperparameter which is the number of neighbours .

    JWOC easy 
    opened by VishnuBhaarath 1
  • [Issue {#7}] Add new models for classification training

    [Issue {#7}] Add new models for classification training

    This PR fixes for issue #7

    Changes made

    • Added new classification model XGBoost Classifier in models.py
    • Added new requirements for the model in requirements.txt

    Reason

    XGBoost Classifier model handles the missing data efficiently and it also has built in cross validation capability. It is also regularized, so the models don't overfit. To add, it also uses gradient descent algorithm to minimize loss. So this model can give good accuracy.

    JWOC easy 
    opened by snega16 1
  • Add clustering models inside single_model_dict

    Add clustering models inside single_model_dict

    What you have to do- 1. Add 5 new clustering models inside tab_automl.automl.models file's single_model_dict object. 2. Follow same code representations.

    Follow contributing guidelines on README.md

    JWOC easy 
    opened by sagnik1511 10
  • Add a new single_model_trainer function for training clustering problems.

    Add a new single_model_trainer function for training clustering problems.

    What you have to do- 1. Add a new function inside the tab_automl.automl.training.Trainer class for training clustering problems. 2. Inside the main function of tab_automl.main some implementation will be needed so that the clustering problem type fits.

    Follow contributing guidelines on README.md

    JWOC medium 
    opened by sagnik1511 0
  • Update the parser with the new problem type

    Update the parser with the new problem type "Clustering"

    What you have to do - 1. Update the parser's problem type definitions. 2. Update the tab_automl.utils.misc.validate_parse_variable as it was prepared to check only the problem types of classification and regression. 3. The target variable parser should have a default value None as the clustering problem won't allow any target variable, but keep in mind if the problem type is some supervised technique, then the target_feature should be checked inside .tab_automl.utils.misc.validate_parse_variable function. 4. Also update the README.md where it specifies the problem types.

    Follow contributing guidelines on README.md

    help wanted JWOC medium 
    opened by sagnik1511 0
  • Add a parameter of k-fold validation inside training

    Add a parameter of k-fold validation inside training

    1. Add k-fold validation for chosen datasets.
    2. Add appropriate print statements and comments inside the code.
    3. Add all utilities on tab_automl.utils.training
    4. If possible update the parser too with a variable named -kf --k-fold which takes the number of folds. (Optional)

    Follow contributing guidelines on README.md

    help wanted hard JWOC 
    opened by sagnik1511 0
  • Add a new class

    Add a new class "OutlierProcessing" under processing

    1. Prepare a new class under the processing module.
    2. Prepare the functions with a proper idea and also add appropriate comments.
    3. Add a function "run" inside the "OutlierProcessing" which will go through every feature, e.g. link.
    4. Add the function under the class Preprocessing.

    Follow contributing guidelines on README.md

    enhancement JWOC medium 
    opened by sagnik1511 4
  • Add new loss functions on training

    Add new loss functions on training

    1. Add 3 loss functions for both regression and classification problem types.
    2. Add them similarly to how the model scores are stored. See here
    3. Add proper comments.
    4. If new functions are needed for the loss functions, store them on tab_automl.utils.training .
    5. Update the requirements if new libraries are being used.

    Follow contributing guidelines on README.md

    enhancement hard JWOC 
    opened by sagnik1511 8
Owner
Sagnik Roy
Data Science Intern @ Argoid • Video Games & Machine Vision attracts me!
Sagnik Roy
An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

null 45 Dec 8, 2022
The toolkit to generate auto labeled datasets

Ozeu Ozeu is the toolkit to autolabal dataset for instance segmentation. You can generate datasets labaled with segmentation mask and bounding box fro

Xiong Jie 28 Mar 28, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

null 801 Jan 8, 2023
Implementation of TabTransformer, attention network for tabular data, in Pytorch

Tab Transformer Implementation of Tab Transformer, attention network for tabular data, in Pytorch. This simple architecture came within a hair's bread

Phil Wang 420 Jan 5, 2023
Boosted neural network for tabular data

XBNet - Xtremely Boosted Network Boosted neural network for tabular data XBNet is an open source project which is built with PyTorch which tries to co

Tushar Sarkar 175 Jan 4, 2023
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
Calculates carbon footprint based on fuel mix and discharge profile at the utility selected. Can create graphs and tabular output for fuel mix based on input file of series of power drawn over a period of time.

carbon-footprint-calculator Conda distribution ~/anaconda3/bin/conda install anaconda-client conda-build ~/anaconda3/bin/conda config --set anaconda_u

Seattle university Renewable energy research 7 Sep 26, 2022
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

null 63 Oct 17, 2022
The official implementation of the paper, "SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning"

SubTab: Author: Talip Ucar ([email protected]) The official implementation of the paper, SubTab: Subsetting Features of Tabular Data for Self-Supervis

AstraZeneca 98 Dec 29, 2022
A framework for attentive explainable deep learning on tabular data

?? kendrite A framework for attentive explainable deep learning on tabular data ?? Quick start kedro run ?? Built upon Technology Description Links ke

Marnix Koops 3 Nov 6, 2021
PyTorch implementation for OCT-GAN Neural ODE-based Conditional Tabular GANs (WWW 2021)

OCT-GAN: Neural ODE-based Conditional Tabular GANs (OCT-GAN) Code for reproducing the experiments in the paper: Jayoung Kim*, Jinsung Jeon*, Jaehoon L

BigDyL 7 Dec 27, 2022
Job-Recommend-Competition - Vectorwise Interpretable Attentions for Multimodal Tabular Data

SiD - Simple Deep Model Vectorwise Interpretable Attentions for Multimodal Tabul

Jungwoo Park 40 Dec 22, 2022
The pyrelational package offers a flexible workflow to enable active learning with as little change to the models and datasets as possible

pyrelational is a python active learning library developed by Relation Therapeutics for rapidly implementing active learning pipelines from data management, model development (and Bayesian approximation), to creating novel active learning strategies.

Relation Therapeutics 95 Dec 27, 2022
Código de um painel de auto atendimento feito em Python.

Painel de Auto-Atendimento O intuito desse projeto era fazer em Python um programa que simulasse um painel de auto atendimento, no maior estilo Mac Do

Calebe Alves Evangelista 2 Nov 9, 2022
A little Python application to auto tag your photos with the power of machine learning.

Tag Machine A little Python application to auto tag your photos with the power of machine learning. Report a bug or request a feature Table of Content

Florian Torres 14 Dec 21, 2022
Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible

Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible, to be the most reliable with the least complexity possible

Nikolas B Virionis 2 Aug 1, 2022