Hyperbolic Hierarchical Clustering.

HazyResearch

Last update: Dec 15, 2022

Related tags

Deep Learning HypHC

Overview

Hyperbolic Hierarchical Clustering (HypHC)

This code is the official PyTorch implementation of the NeurIPS 2020 paper:

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering
Ines Chami, Albert Gu, Vaggos Chatziafratis and Christopher Ré
Stanford University
Paper: https://arxiv.org/abs/2010.00402

Abstract. Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective. We can show that after decoding, the global minimizer of our continuous relaxation yields a discrete tree with a (1+epsilon)-factor approximation for Dasgupta's optimal tree, where epsilon can be made arbitrarily small and controls optimization challenges. We experimentally evaluate HypHC on a variety of HC benchmarks and find that even approximate solutions found with gradient descent have superior clustering quality than agglomerative heuristics or other gradient based algorithms. Finally, we highlight the flexibility of HypHC using end-to-end training in a downstream classification task.

Installation

This code has been tested with python3.7. First, create a virtual environment (or conda environment) and install the dependencies:

python3 -m venv hyphc_env

source hyphc_env/bin/activate

pip install -r requirements.txt

Then install the mst and unionfind packages which are used to decode embeddings into trees and compute the discrete Dasgupta Cost efficiently:

cd mst; python setup.py build_ext --inplace

cd unionfind; python setup.py build_ext --inplace

Datasets

source download_data.sh

This will download the zoo, iris and glass datasets from the UCI machine learning repository. Please refer to the paper for the download links of the other datasets used in the paper.

Code Usage

Train script

To use the code, first set environment variables in each shell session:

source set_env.sh

To train the HypHC mode, use the train script:

python train.py
    optional arguments:
      -h, --help            show this help message and exit
      --seed SEED
      --epochs EPOCHS
      --batch_size BATCH_SIZE
      --learning_rate LEARNING_RATE
      --eval_every EVAL_EVERY
      --patience PATIENCE
      --optimizer OPTIMIZER
      --save SAVE
      --fast_decoding FAST_DECODING
      --num_samples NUM_SAMPLES
      --dtype DTYPE
      --rank RANK
      --temperature TEMPERATURE
      --init_size INIT_SIZE
      --anneal_every ANNEAL_EVERY
      --anneal_factor ANNEAL_FACTOR
      --max_scale MAX_SCALE
      --dataset DATASET

Examples

We provide examples of training commands for the zoo, iris and glass datasets. For instance, to train HypHC on zoo, run:

source examples/run_zoo.sh

This will create an embedding directory and save training logs, embeddings and the configuration parameters in a embedding/zoo/[unique_id] where the unique id is based on the configuration parameters used to train the model.

Citation

If you find this code useful, please cite the following paper:

@inproceedings{NEURIPS2020_ac10ec1a,
 author = {Chami, Ines and Gu, Albert and Chatziafratis, Vaggos and R\'{e}, Christopher},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 pages = {15065--15076},
 publisher = {Curran Associates, Inc.},
 title = {From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering},
 url = {https://proceedings.neurips.cc/paper/2020/file/ac10ec1ace51b2d973cd87973a98d3ab-Paper.pdf},
 volume = {33},
 year = {2020}
}

Comments

How can I visualize the hyperbolic embedding

Hi, all,

After I run the scripts examples/run_zoo.sh and example/run_glass.sh, the trained model and log files are saved at ./embeddings as follows:

root@Lab-PC:/data/code14/HypHC/embeddings# tree .
.
├── glass
│   └── bc24ee553d3178d956bbb7a68d6058f5a48edb408c82a49e613a344d71602abf
│       ├── config.json
│       ├── model_0.pkl
│       └── train_0.log
└── zoo
    └── ca286289368cbe66abdfe32406cffed3ff5e6102252c0fc6502519713db04b83
        ├── config.json
        ├── model_0.pkl
        └── train_0.log

How can I visualize the hyperbolic embedding similar to demo file HypHC.gif?

Thanks~

opened by amiltonwong 2

Changing the curvature

Hi,

Thanks for the great work and well documented code.

Currently, it seems like the hyperbolic embedding lives in a Poincare disk/ball with curvature -1. In my own application, I would like to work with arbitrary curvature -c, c>0.

So my question is can I do so by just modifying utils/poincare.py? Or I also need to make changes in optim and elsewhere?

Thanks in advance for your valuable time.

Best, Eli

opened by elichienxD 0
Support for Large Datasets
Our paper was recently accepted at WI-IAT and will be published soon, here is the arxiv version:(https://arxiv.org/abs/2110.15923)

We leverage HypHC in our work to reduce the dimensions. Our dataset had 59260 data points, each of dimension 600. The current version of the code is giving out of memory errors in the pre-training stage itself. By moving some lines around, and rewriting sections of the code, we were able to keep the same functionality of the code and train the model for our dataset.

I have included two additional arguments in the config file:

"large_dataset": should be set to 1 if the dataset is large, otherwise 0. In the case of large datasets, some matrix multiplications are replaced with loops, which make the code a bit slower (hence the argument is provided)

"data_points": the number of data points in the dataset, would be the number of lines in the .data file in the data directory
opened by Tanvi141 0
Any plan for a sklearn-like API?
Thank you for sharing your fantastic work! I am just wondering do you have any plan to develop a sklearn-like interface.

For example:

alg = HypHC(*args) alg.fit(X) label = alg.predict(X)

Thank you!
opened by AtlantixJJ 0
decoding trees?

The paper says that to decode the binary tree from the embedding, a top-down greedy approach is used. However from the code, it seems that still a bottom up approach is used based on the angle similarity matrix, is that correct?

Say if I want to obtain a hierarchical clustering on the nodes given the final Poincare embeddings, is doing a single linkage agglomerative clustering using the angle similarity matrix between normalized embedding points equivalent to the decoding tree algorithm implemented in this code?

opened by xinyue96 1
Other examples of parameter configuration

Hi,

Could you please share examples of your parameter configurations on the mid- and large-scale datasets? For example, one on Segmentation and one on CIFAR-100.

Thank you!

opened by MorganeAyle 0

Owner

HazyResearch

We are a CS research group led by Prof. Chris Ré.

GitHub

PyTorch implementation HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections

HoroPCA This code is the official PyTorch implementation of the ICML 2021 paper: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projec

52 Nov 14, 2022

A Python framework for developing parallelized Computational Fluid Dynamics software to solve the hyperbolic 2D Euler equations on distributed, multi-block structured grids.

pyHype: Computational Fluid Dynamics in Python pyHype is a Python framework for developing parallelized Computational Fluid Dynamics software to solve

21 Nov 22, 2022

Hyperbolic Image Segmentation, CVPR 2022

Hyperbolic Image Segmentation, CVPR 2022 This is the implementation of paper Hyperbolic Image Segmentation (CVPR 2022). Repository structure assets :

46 Dec 29, 2022

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

5 Jul 18, 2022

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ADGC: Awesome Deep Graph Clustering ADGC is a collection of state-of-the-art (SOTA), novel deep graph clustering methods (papers, codes and datasets).

297 Dec 27, 2022

Anomaly Detection Based on Hierarchical Clustering of Mobile Robot Data

We proposed a new approach to detect anomalies of mobile robot data. We investigate each data seperately with two clustering method hierarchical and k-means. There are two sub-method that we used for produce an anomaly score. Then, we merge these two score and produce merged anomaly score as a result.

1 Jan 9, 2022

Hierarchical-Bayesian-Defense - Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference (Openreview)

Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical V

20 Dec 2, 2022

Hyperbolic Hierarchical Clustering.

Related tags

Overview

Hyperbolic Hierarchical Clustering (HypHC)

Installation

Datasets

Code Usage

Train script

Examples

Citation

Comments

How can I visualize the hyperbolic embedding

Changing the curvature

Support for Large Datasets

Any plan for a sklearn-like API?

decoding trees?

Other examples of parameter configuration

Owner

HazyResearch

PyTorch implementation HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections

A Python framework for developing parallelized Computational Fluid Dynamics software to solve the hyperbolic 2D Euler equations on distributed, multi-block structured grids.

Hyperbolic Image Segmentation, CVPR 2022

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

Anomaly Detection Based on Hierarchical Clustering of Mobile Robot Data

Hierarchical-Bayesian-Defense - Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference (Openreview)

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

《Improving Unsupervised Image Clustering With Robust Learning》(2020)

PyTorch implementation for COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction (CVPR 2021)

The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

Repo for the Video Person Clustering dataset, and code for the associated paper

Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering

Segmentation and Identification of Vertebrae in CT Scans using CNN, k-means Clustering and k-NN

This repository holds the code for the paper "Deep Conditional Gaussian Mixture Model forConstrained Clustering".

Pytorch implementation of Supporting Clustering with Contrastive Learning, NAACL 2021

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in clustering (CVPR2021)

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning