Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Microsoft

Last update: Nov 28, 2022

Related tags

Deep Learning SDR

Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

This repo is the implementation for SDR.

Tested environment

Python 3.7
PyTorch 1.7
CUDA 11.0

Lower CUDA and PyTorch versions should work as well.

Installation
Datasets
Train with our datasets
Hierarchical Inference
Cite

License, Security, support and code of conduct specifications are under the Instructions directory.

Installation

Run

bash instructions/installation.sh

Datasets

The published datasets are:

Video games
- 21,935 articles
- Expert annotated test set. 90 articles with 12 ground-truth recommendations.
- Examples:
  - Grand Theft Auto - Mafia
  - Burnout Paradise - Forza Horizon 3
Wines
- 1635 articles
- Crafted by a human sommelier, 92 articles with ~10 ground-truth recommendations.
- Examples:
  - Pinot Meunier - Chardonnay
  - Dom Pérignon - Moët & Chandon

For more details and direct download see Wines and Video Games.

Training

The training process downloads the datasets automatically.

python sdr_main.py --dataset_name video_games

The code is based on PyTorch-Lightning, all PL hyperparameters are supported. (limit_train/val/test_batches, check_val_every_n_epoch etc.)

Tensorboard support

All metrics are being logged automatically and stored in

SDR/output/document_similarity/SDR/arch_SDR/dataset_name_<dataset>/<time_of_run>

Run tesnroboard --logdir=<path> to see the the logs.

Inference

The hierarchical inference described in the paper is implemented as a stand-alone service and can be used with any backbone algorithm (models/reco/hierarchical_reco.py).

python sdr_main.py --dataset_name <name> --resume_from_checkpoint <checkpoint> --test_only

Results

Citing & Authors

If you find this repository or the annotated datasets helpful, feel free to cite our publication -

SDR: Self-Supervised Document-to-Document Similarity Ranking viaContextualized Language Models and Hierarchical Inference

 @misc{ginzburg2021selfsupervised,
     title={Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference}, 
     author={Dvir Ginzburg and Itzik Malkiel and Oren Barkan and Avi Caciularu and Noam Koenigstein},
     year={2021},
     eprint={2106.01186},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

Contact: Dvir Ginzburg, Itzik Malkiel.

Comments

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

opened by microsoft-github-policy-service[bot] 2
Adding Microsoft SECURITY.MD

Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

opened by microsoft-github-policy-service[bot] 0
Training with custom dataset?

@dvirginz What part of the code should I refer to if I were to train the model on my custom dataset? Also, is it necessary to perform the MLM training along with the contrastive loss? (would using the contrastive loss alone degrade performance by a lot?)

opened by puzzlecollector 0
how to use proposed dataset

Hi team, thank you for the great work, I want to use your proposed dataset (wines) for my study but I found no ground truth, all in that file wines.txt is only the title and sections. I want to know how ground truth is arranged in this file. Thank you team! hope you reply soon!

opened by duongnv0499 0
This repo is missing a LICENSE file

This repository is currently missing a LICENSE file.

A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license Microsoft uses at: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE.

If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license (Microsoft-internal guidance).

opened by microsoft-github-policy-service[bot] 0

Using padded tokens when creating averaged sentence embeddings

When calculating the similarity loss between two sentences, it looks like we are using the averaged word embeddings per sentence. Within models.SDR.similarity_modeling.SimilarityModeling we have the following:

...
non_masked_outputs = self.roberta(
    non_masked_input_ids,
    attention_mask=attention_mask,
    token_type_ids=token_type_ids,
    position_ids=position_ids,
    head_mask=head_mask,
    inputs_embeds=inputs_embeds,
    output_hidden_states=output_hidden_states,
    return_dict=return_dict,
)
non_masked_seq_out = non_masked_outputs[0]

meaned_sentences = non_masked_seq_out.mean(1)
miner_output = list(self.miner_func(meaned_sentences, sample_labels))

sim_loss = self.similarity_loss_func(meaned_sentences, sample_labels, miner_output)
...

It appears using the embeddings for the padded tokens since we aren't taking into account any sentence lengths. Was this done by design perhaps?

opened by AndrewLim1990 0

RuntimeError: CUDA out of memory.

Train command

%cd /home/ec2-user/SageMaker/SDR
!python sdr_main.py --dataset_name wines

Stacktrace:

Traceback (most recent call last):
  File "sdr_main.py", line 80, in <module>
    main()
  File "sdr_main.py", line 28, in main
    main_train(model_class_pointer, hyperparams,parser)
  File "sdr_main.py", line 72, in main_train
    trainer.fit(model)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
    return self.train_or_test()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 550, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 692, in run_training_batch
    self.trainer.hiddens)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 806, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 319, in training_step
    training_step_output = self.trainer.accelerator_backend.training_step(args)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/dp_accelerator.py", line 117, in training_step
    return self._step(args)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/dp_accelerator.py", line 113, in _step
    output = self.trainer.model(*args)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 93, in forward
    return self.module.training_step(*inputs[0], **kwargs[0])
  File "/home/ec2-user/SageMaker/SDR/models/doc_similarity_pl_template.py", line 49, in training_step
    batch = self(batch)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/SageMaker/SDR/models/SDR/SDR.py", line 78, in forward
    eval(f"self.forward_{self.hparams.mode}")(batch)
  File "/home/ec2-user/SageMaker/SDR/models/SDR/SDR.py", line 48, in forward_train
    run_mlm=True,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/SageMaker/SDR/models/SDR/similarity_modeling.py", line 129, in forward
    return_dict=return_dict,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 835, in forward
    return_dict=return_dict,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 490, in forward
    output_attentions,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 433, in forward
    self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1597, in apply_chunking_to_forward
    return forward_fn(*input_tensors)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in feed_forward_chunk
    intermediate_output = self.intermediate(attention_output)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 367, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/functional.py", line 1556, in gelu
    return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 14.76 GiB total capacity; 11.17 GiB already allocated; 14.75 MiB free; 11.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

opened by loretoparisi 6

TypeError: init() got an unexpected keyword argument 'filepath'

I executed the training command

%cd /home/ec2-user/SageMaker/SDR
!python sdr_main.py --dataset_name video_games

The error was

/home/ec2-user/SageMaker/SDR
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
WARNING:PML:The pytorch-metric-learning testing module requires faiss. You can install the GPU version with the command 'conda install faiss-gpu -c pytorch'
                        or the CPU version with 'conda install faiss-cpu -c pytorch'. Learn more at https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
Global seed set to 42
INFO:lightning:Global seed set to 42
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/cryptography/hazmat/backends/openssl/x509.py:18: CryptographyDeprecationWarning: This version of cryptography contains a temporary pyOpenSSL fallback path. Upgrade pyOpenSSL now.
  utils.DeprecatedIn35,
Some weights of SimilarityModeling were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Log directory:
/home/ec2-user/SageMaker/SDR/output/document_similarity/arch_SDR/dataset_name_video_games/test_only_False/22_11_2021-20_22_55

Traceback (most recent call last):
  File "sdr_main.py", line 80, in <module>
    main()
  File "sdr_main.py", line 28, in main
    main_train(model_class_pointer, hyperparams,parser)
  File "sdr_main.py", line 55, in main_train
    verbose=True,
TypeError: __init__() got an unexpected keyword argument 'filepath'

opened by loretoparisi 0

SBERT v performance?

Hi!

Looking at the results, it is written that an experiment on SBERT v has been conducted, and I am curious where I can see the performance of SBERT v.

Thanks.

opened by haven-jeon 0

Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Related tags

Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

Tested environment

Contents

Installation

Datasets

Training

Tensorboard support

Inference

Results

Citing & Authors

Comments

Owner

Microsoft

An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

NAACL2021 - COIL Contextualized Lexical Retriever

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

Siamese-nn-semantic-text-similarity - A repository containing comprehensive Neural Networks based PyTorch implementations for the semantic text similarity task

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

Ranking Models in Unlabeled New Environments （iccv21）