Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

Bendidi Ihab

Last update: Feb 13, 2022

Related tags

Machine Learning IIC-SimCLR-MoCo-clusterer

Overview

Self Supervised clusterer

Combined IIC, and Moco architectures, with some SimCLR notions, to get state of the art unsupervised clustering while retaining interesting image latent representations in the feature space using contrastive learning.

Installation

Currently successfully tested on Ubuntu 18.04 and Ubuntu 20.04, with python 3.6 and 3.8

Works for Pytorch versions >= 1.4. Launch following command to install all pd

pip3 install -r requirements.txt

Logs

All information is logged to tensorboard. If you activate the neptune flag, you can also make logs to Neptune.ai.

Tensorboard

To check logs of your trainings using tensorboard, use the command :

tensorboard --logdir=./logs/NAME_OF_TEST/events

The NAME_OF_TEST is generated automatically for each automatic training you launch, composed of the inputed name of the training you chose (explained further below in commands), and the exact date and time when you launched the training. For example test_on_nocadozole_20210518-153531

Neptune

Before using neptune as a log and output control tool, you need to create a neptune account and get your developer token. Create a neptune_token.txt file and store the token in it.

Create in neptune a folder for your outputs, with a name of your choice, then go to main.py and modify from line 129 :

if args.offline :
    CONNECTION_MODE = "offline"
    run = neptune.init(project='USERNAME/PROJECT_NAME',# You should add your project name and username here
                   api_token=token,
                   mode=CONNECTION_MODE,
                   )
else :
    run = neptune.init(project='USERNAME/PROJECT_NAME',# You should add your project name and username here
               api_token=token,
               )

Preparing your own data

All datasets will be put in the ./data folder. As you might have to create various different datasets inside, create a folder inside for each dataset you use, while giving it a linux-friendly name.

To be completed

Commands

Adding the --labels command means you have ground truth for classes, and you wish to use it in evaluation
Adding the --neptune command means you wish to log your data in neptune (Check logging section)
output_k is the number of clusters
model_name is the name you'll use to keep track of this specific model. Date of training launch will be added to its name.
augmentation is the contrastive loss augmentation types you'll be using. They can be consulted and modified in the datasets/datasetgetter.py file.
epochs is the maximal number of epochs you wish to have. It is 1000 by default
batch_size is the training batch size. Default is 32
val_batch is the validation batch size. Default is 10
sty_dim is the size of the style vector. default is 128
img_size size of input images
--debug is a flag for activating debug mode, where the training is very fast, just to check if everything is working fine

training from scratch

python main.py --gpu 2  --output_k 9  --model_name=validating_best_image_transfer --augmentation BBC --data_type BBBC021_196  --data_folder N1 --neptune --img_size 196

training using pretrained model

python main.py --gpu 2  --output_k 9  --model_name=validating_best_image_transfer --augmentation improved_v2 --data_type BBBC021_196  --data_folder ND8D --labels --neptune --load_model testing_high_cluster_number_20210604-024131_

valiadtion using pretrained model

python main.py --gpu 2  --output_k 9  --model_name=validating_best_image_transfer --augmentation improved_v2 --data_type BBBC021_196  --data_folder ND8D --labels --validation --neptune --load_model testing_high_cluster_number_20210604-024131_

This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 3, 2021

Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

END TO END MACHINE LEARNING PROJECT ON HITTERS DATASET Can a machine learning project be implemented to estimate the salaries of baseball players whos

7 Dec 18, 2021

ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

77 Jan 6, 2023

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

239 Nov 10, 2022

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification Introduction. This package includes the pyth

5 Dec 6, 2022

Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

0 Jan 31, 2022

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 8, 2023

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community

23.6k Jan 3, 2023

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

1.6k Dec 31, 2022

Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

Related tags

Overview

Self Supervised clusterer

Installation

Logs

Tensorboard

Neptune

Preparing your own data

Commands

training from scratch

training using pretrained model

valiadtion using pretrained model

You might also like...

This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

ml4ir: Machine Learning for Information Retrieval

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Implemented four supervised learning Machine Learning algorithms

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Owner

Bendidi Ihab

neurodsp is a collection of approaches for applying digital signal processing to neural time series

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Self Organising Map (SOM) for clustering of atomistic samples through unsupervised learning.

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

GroundSeg Clustering Optimized Kdtree

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Turning images into '9-pan' palettes using KMeans clustering from sklearn.

monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms