Implementation of SiameseXML (ICML 2021)

Overview

SiameseXML

Code for SiameseXML: Siamese networks meet extreme classifiers with 100M labels


Best Practices for features creation


  • Adding sub-words on top of unigrams to the vocabulary can help in training more accurate embeddings and classifiers.

Setting up


Expected directory structure

+-- <work_dir>
|  +-- programs
|  |  +-- siamesexml
|  |    +-- siamesexml
|  +-- data
|    +-- <dataset>
|  +-- models
|  +-- results

Download data for SiameseXML

* Download the (zipped file) BoW features from XML repository.  
* Extract the zipped file into data directory. 
* The following files should be available in <work_dir>/data/<dataset> for new datasets (ignore the next step)
    - trn_X_Xf.txt
    - trn_X_Y.txt
    - tst_X_Xf.txt
    - lbl_X_Xf.txt
    - tst_X_Y.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
* The following files should be available in <work_dir>/data/<dataset> if the dataset is in old format (please refer to next step to convert the data to new format)
    - train.txt
    - test.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy 

Convert to new data format

# A perl script is provided (in siamesexml/tools) to convert the data into new format
# Either set the $data_dir variable to the data directory of a particular dataset or replace it with the path
perl convert_format.pl $data_dir/train.txt $data_dir/trn_X_Xf.txt $data_dir/trn_X_Y.txt
perl convert_format.pl $data_dir/test.txt $data_dir/tst_X_Xf.txt $data_dir/tst_X_Y.txt

Example use cases


A single learner

The given code can be utilized as follows. A json file is used to specify architecture and other arguments. Please refer to the full documentation below for more details.

./run_main.sh 0 SiameseXML LF-AmazonTitles-131K 0 108

Full Documentation

./run_main.sh <gpu_id> <type> <dataset> <version> <seed>

* gpu_id: Run the program on this GPU.

* type
  SiameseXML uses DeepXML[2] framework for training. The classifier is trained in M-IV.
  - SiameseXML: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets).
  - SiameseXML++: The intermediate representation is fine-tuned while training the classifier (leads to better accuracy on some datasets).

* dataset
  - Name of the dataset.
  - SiameseXML expects the following files in <work_dir>/data/<dataset>
    - trn_X_Xf.txt
    - trn_X_Y.txt
    - tst_X_Xf.txt
    - lbl_X_Xf.txt
    - tst_X_Y.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
  - You can set the 'embedding_dims' in config file to switch between 300d and 512d embeddings.

* version
  - different runs could be managed by version and seed.
  - models and results are stored with this argument.

* seed
  - seed value as used by numpy and PyTorch.

Notes

* Other file formats such as npy, npz, pickle are also supported.
* Initializing with token embeddings (computed from FastText) leads to noticible accuracy gains. Please ensure that the token embedding file is available in data directory, if 'init=token_embeddings', otherwise it'll throw an error.
* Config files are made available in siamesexml/configs/<framework>/<method> for datasets in XC repository. You can use them when trying out the given code on new datasets.
* We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets.
* The code make use of CPU (mainly for hnswlib) as well as GPU. 

Cite as

@InProceedings{Dahiya21b,
    author = "Dahiya, K. and Agarwal, A. and Saini, D. and Gururaj, K. and Jiao, J. and Singh, A. and Agarwal, S. and Kar, P. and Varma, M",
    title = "SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels",
    booktitle = "Proceedings of the International Conference on Machine Learning",
    month = "July",
    year = "2021"
}

YOU MAY ALSO LIKE

References


[1] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021

[2] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.

[3] pyxclib: https://github.com/kunaldahiya/pyxclib

Comments
  • Possible bug?

    Possible bug?

    Hi,

    I think there is a bug in the line below next to the line in the link below. Specifically, should there not be a filter_map=None right below this line? It seems that that this filter_map is always used in _fit but it is only set if validate is true. Maybe I am missing something.

    https://github.com/Extreme-classification/siamesexml/blob/af053cf57cf3d5bbe3be6de4f7bd2e04ba589ecf/siamesexml/libs/model.py#L545

    opened by drei34 4
  • when i run the 'features = data_utils.read_gen_sparse(feat_fname)' ,i got some wrong.

    when i run the 'features = data_utils.read_gen_sparse(feat_fname)' ,i got some wrong.

    D:\soft\anaconda\envs\nlp_pytorch\python.exe E:/code/MLTC/siamesexml-master/siamesexml/test_code.py C:\Users\xjliu16\AppData\Roaming\Python\Python37\site-packages\xclib-0.97-py3.7-win-amd64.egg\xclib\data\data_utils.py:263: UserWarning: Header mis-match from inferred shape! warnings.warn("Header mis-match from inferred shape!") Traceback (most recent call last): File "E:/code/MLTC/siamesexml-master/siamesexml/test_code.py", line 32, in features = data_utils.read_gen_sparse(feat_fname) File "C:\Users\xjliu16\AppData\Roaming\Python\Python37\site-packages\xclib-0.97-py3.7-win-amd64.egg\xclib\data\data_utils.py", line 154, in read_gen_sparse header, force_header, safe_read) File "C:\Users\xjliu16\AppData\Roaming\Python\Python37\site-packages\xclib-0.97-py3.7-win-amd64.egg\xclib\data\data_utils.py", line 269, in read_sparse_file assert shape[1] <= _header_shape[1], "num_cols_inferred > num_cols_header" AssertionError: num_cols_inferred > num_cols_header shape: (147402, 171794396884325) _header_shape: (294805, 40000)

    Process finished with exit code 1 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ I don't quite understand the structure of ‘LF-AmazonTitles-131K’ data set, and I don't know what to do about this error. Can you help me to check the following, thanks

    opened by LLLLLLoki 2
  • Questions related to tokenization and R^V - are you mixing Bert Tokens and FastText embeddings?

    Questions related to tokenization and R^V - are you mixing Bert Tokens and FastText embeddings?

    Hello,

    As I understand it, the model need to tokenize text to get the dimension V and then for each token we need its fastText embedding. Some questions:

    1. For the sample data set LF-AmazonTitles-131K is the BERT tokenizer used? It seems so, as tokens look like '##abcd' which is the same format and also as BERT. If this is the case, is it wrong to mix fastText embeddings with these? For example, tokens in fastText do not look like '##ab', they look like 'ab' or '<ab' so to get the embedding for '##ab' you'd need to (at a high level) average (2*embedding('#') + embedding('ab')) / 3 or something like this but '##ab' is not a word, it's an artificial token so the meaning of '##ab' is lost if we do this?

    2. For inference, how do you make sure that any new query can be expressed in R^V? Do you basically need to make sure that all uni-grams 'a', '$', etc are in the token space V?

    Thank you!

    opened by drei34 2
  • OVA classifiers using only refinement vectors not label projection?

    OVA classifiers using only refinement vectors not label projection?

    Hi Kunal,

    Many thanks for sharing the code of SiameseXML. Hope you can help me with one doubt. While debugging the code in the OVA classifiers training, I can't see where the re-parametrization of the classifiers is happening. As far as I can see, in the classifier forward it's happening the following (forward snapshot in the image below):

    1. Embed are the data point embeddings (after the projection with the residual block) that are L2 normalized in line 209
    2. The refinement vectors are a embedding matrix and 500 of them are extracted using the shortlist. This happens in line 211, and refinement vectors are stored in short_weigths.
    3. The refinement vectors are L2 normalized in line 215.
    4. L2 normalized refinement vectors are used as classifiers in the matrix multiplication of Embed and the refinement vectors in line 216.

    Therefore, refinement vectors are directly used as classifier vectors (i.e. as in a standard linear layer). In the paper, wl classifiers are defined as: wl = norm(projection(z) + refinement vector) However, in the code what I see is: wl = norm(refinement vector)

    Am I missing something?

    Thanks in advance!

    image

    opened by DiegoOrtego 2
  • Embeddings

    Embeddings

    Hi,

    I'm a little bit confused with the embeddings part. In the article, the authors refer to TF-IDF features, while in github repo I realised that fasttext embedding is used as well but I can't understand in which step of the algorithm. Can I have further explanations please?

    Many thanks,

    opened by hdm30 1
  • Filter_labels_tst.txt file not found

    Filter_labels_tst.txt file not found

    Hello,

    I tried the model for custom dataset but i had the following issue: OSError: /XX/XX/filter_labels_test.txt not found. My question is how should i proceed to prepare filter_labels_test.txt, i can't find any explanation in the github repo.

    Many thanks in advance.

    opened by hdm30 1
  • Supplemental material missing and two others questions(surrogate_mapping.txt and SiameseXML vs SiameseXML++).

    Supplemental material missing and two others questions(surrogate_mapping.txt and SiameseXML vs SiameseXML++).

    Hi,

    Is the supplemental material missing for the paper? https://www.cse.iitd.ac.in/~kunal/siamesexml_main.pdf Where is it?

    Also, 2 other questions. What is surrogate_mapping.txt? Is this related to the "filter_" files which seem to leave out some labels? Finally, SiameseXML = Module 1 and SiameseXML++ = Module 1 - 4 in the paper? I wonder if this is the right mental map.

    Thank you!

    opened by drei34 1
  • filter_labels_text.txt and filter_labels_train.txt files ... What are they and how do you get them?

    filter_labels_text.txt and filter_labels_train.txt files ... What are they and how do you get them?

    Hi,

    I see that when you run SiameseXML if you set validate = true in the config these are used. How are these generated? Can you add this to the README since there is no mention of these files, but they seem important. Also, what is the purpose of these files?Thank you!

    opened by drei34 3
Owner
Extreme Classification
Extreme Classification
Pytorch Implementation of Spiking Neural Networks Calibration, ICML 2021

SNN_Calibration Pytorch Implementation of Spiking Neural Networks Calibration, ICML 2021 Feature Comparison of SNN calibration: Features SNN Direct Tr

Yuhang Li 60 Dec 27, 2022
Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

Self-supervised Graph-level Representation Learning with Local and Global Structure Introduction This project is an implementation of ``Self-supervise

MilaGraph 50 Dec 9, 2022
This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

Off-Belief Learning Introduction This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021. Environment Setup

Facebook Research 32 Jan 5, 2023
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

Wonjae Kim 922 Jan 1, 2023
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

Wonjae Kim 922 Jan 1, 2023
Code for ICML 2021 paper: How could Neural Networks understand Programs?

OSCAR This repository contains the source code of our ICML 2021 paper How could Neural Networks understand Programs?. Environment Run following comman

Dinglan Peng 115 Dec 17, 2022
[ICML 2021, Long Talk] Delving into Deep Imbalanced Regression

Delving into Deep Imbalanced Regression This repository contains the implementation code for paper: Delving into Deep Imbalanced Regression Yuzhe Yang

Yuzhe Yang 568 Dec 30, 2022
[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning DouZero is a reinforcement learning framework for DouDizhu (斗地主), t

Kwai Inc. 3.1k Jan 4, 2023
An interpreter for RASP as described in the ICML 2021 paper "Thinking Like Transformers"

RASP Setup Mac or Linux Run ./setup.sh . It will create a python3 virtual environment and install the dependencies for RASP. It will also try to insta

null 141 Jan 3, 2023
[ICML 2021] “ Self-Damaging Contrastive Learning”, Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, Zhangyang Wang

Self-Damaging Contrastive Learning Introduction The recent breakthrough achieved by contrastive learning accelerates the pace for deploying unsupervis

VITA 51 Dec 29, 2022
[ICML 2021] "Graph Contrastive Learning Automated" by Yuning You, Tianlong Chen, Yang Shen, Zhangyang Wang

Graph Contrastive Learning Automated PyTorch implementation for Graph Contrastive Learning Automated [talk] [poster] [appendix] Yuning You, Tianlong C

Shen Lab at Texas A&M University 80 Nov 23, 2022
Official Code for ICML 2021 paper "Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline"

Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, Jia Deng Internati

Princeton Vision & Learning Lab 115 Jan 4, 2023
How Do Adam and Training Strategies Help BNNs Optimization? In ICML 2021.

AdamBNN This is the pytorch implementation of our paper "How Do Adam and Training Strategies Help BNNs Optimization?", published in ICML 2021. In this

Zechun Liu 47 Sep 20, 2022
[ICML 2021] Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data This repo provides the source code & data of our paper: Break-It-Fix-It: Unsupervised

Michihiro Yasunaga 86 Nov 30, 2022
Code for Fold2Seq paper from ICML 2021

[ICML2021] Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design Environment file: environment.yml Data and Feat

International Business Machines 43 Dec 4, 2022
[ICML 2021] Towards Understanding and Mitigating Social Biases in Language Models

Towards Understanding and Mitigating Social Biases in Language Models This repo contains code and data for evaluating and mitigating bias from generat

Paul Liang 42 Jan 3, 2023
Official code for UnICORNN (ICML 2021)

UnICORNN (Undamped Independent Controlled Oscillatory RNN) [ICML 2021] This repository contains the implementation to reproduce the numerical experime

Konstantin Rusch 21 Dec 22, 2022
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

AI Secure 57 Dec 15, 2022
Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Self-Tuning for Data-Efficient Deep Learning This repository contains the implementation code for paper: Self-Tuning for Data-Efficient Deep Learning

THUML @ Tsinghua University 101 Dec 11, 2022