Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Related tags

Deep Learning SDR
Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

This repo is the implementation for SDR.

 

Tested environment

  • Python 3.7
  • PyTorch 1.7
  • CUDA 11.0

Lower CUDA and PyTorch versions should work as well.

 

Contents

License, Security, support and code of conduct specifications are under the Instructions directory.  

Installation

Run

bash instructions/installation.sh 

 

Datasets

The published datasets are:

  • Video games
    • 21,935 articles
    • Expert annotated test set. 90 articles with 12 ground-truth recommendations.
    • Examples:
      • Grand Theft Auto - Mafia
      • Burnout Paradise - Forza Horizon 3
  • Wines
    • 1635 articles
    • Crafted by a human sommelier, 92 articles with ~10 ground-truth recommendations.
    • Examples:
      • Pinot Meunier - Chardonnay
      • Dom Pérignon - Moët & Chandon

For more details and direct download see Wines and Video Games.

 

Training

The training process downloads the datasets automatically.

python sdr_main.py --dataset_name video_games

The code is based on PyTorch-Lightning, all PL hyperparameters are supported. (limit_train/val/test_batches, check_val_every_n_epoch etc.)

Tensorboard support

All metrics are being logged automatically and stored in

SDR/output/document_similarity/SDR/arch_SDR/dataset_name_<dataset>/<time_of_run>

Run tesnroboard --logdir=<path> to see the the logs.

 

Inference

The hierarchical inference described in the paper is implemented as a stand-alone service and can be used with any backbone algorithm (models/reco/hierarchical_reco.py).

 

python sdr_main.py --dataset_name <name> --resume_from_checkpoint <checkpoint> --test_only

Results

Citing & Authors

If you find this repository or the annotated datasets helpful, feel free to cite our publication -

SDR: Self-Supervised Document-to-Document Similarity Ranking viaContextualized Language Models and Hierarchical Inference

 @misc{ginzburg2021selfsupervised,
     title={Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference}, 
     author={Dvir Ginzburg and Itzik Malkiel and Oren Barkan and Avi Caciularu and Noam Koenigstein},
     year={2021},
     eprint={2106.01186},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

Contact: Dvir Ginzburg, Itzik Malkiel.

Comments
  • This repo is missing important files

    This repo is missing important files

    There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    Merge this pull request

    opened by microsoft-github-policy-service[bot] 2
  • Adding Microsoft SECURITY.MD

    Adding Microsoft SECURITY.MD

    Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    opened by microsoft-github-policy-service[bot] 0
  • Training with custom dataset?

    Training with custom dataset?

    @dvirginz What part of the code should I refer to if I were to train the model on my custom dataset? Also, is it necessary to perform the MLM training along with the contrastive loss? (would using the contrastive loss alone degrade performance by a lot?)

    opened by puzzlecollector 0
  • how to use proposed dataset

    how to use proposed dataset

    Hi team, thank you for the great work, I want to use your proposed dataset (wines) for my study but I found no ground truth, all in that file wines.txt is only the title and sections. I want to know how ground truth is arranged in this file. Thank you team! hope you reply soon!

    opened by duongnv0499 0
  • This repo is missing a LICENSE file

    This repo is missing a LICENSE file

    This repository is currently missing a LICENSE file.

    A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license Microsoft uses at: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE.

    If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license (Microsoft-internal guidance).

    opened by microsoft-github-policy-service[bot] 0
  • Using padded tokens when creating averaged sentence embeddings

    Using padded tokens when creating averaged sentence embeddings

    When calculating the similarity loss between two sentences, it looks like we are using the averaged word embeddings per sentence. Within models.SDR.similarity_modeling.SimilarityModeling we have the following:

    ...
    non_masked_outputs = self.roberta(
        non_masked_input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    non_masked_seq_out = non_masked_outputs[0]
    
    meaned_sentences = non_masked_seq_out.mean(1)
    miner_output = list(self.miner_func(meaned_sentences, sample_labels))
    
    sim_loss = self.similarity_loss_func(meaned_sentences, sample_labels, miner_output)
    ...
    

    It appears using the embeddings for the padded tokens since we aren't taking into account any sentence lengths. Was this done by design perhaps?

    opened by AndrewLim1990 0
  • RuntimeError: CUDA out of memory.

    RuntimeError: CUDA out of memory.

    Train command

    %cd /home/ec2-user/SageMaker/SDR
    !python sdr_main.py --dataset_name wines
    

    Stacktrace:

    Traceback (most recent call last):
      File "sdr_main.py", line 80, in <module>
        main()
      File "sdr_main.py", line 28, in main
        main_train(model_class_pointer, hyperparams,parser)
      File "sdr_main.py", line 72, in main_train
        trainer.fit(model)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
        results = self.accelerator_backend.train()
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
        return self.train_or_test()
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
        results = self.trainer.train()
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
        self.train_loop.run_training_epoch()
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 550, in run_training_epoch
        batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 692, in run_training_batch
        self.trainer.hiddens)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 806, in training_step_and_backward
        result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 319, in training_step
        training_step_output = self.trainer.accelerator_backend.training_step(args)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/dp_accelerator.py", line 117, in training_step
        return self._step(args)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/accelerators/dp_accelerator.py", line 113, in _step
        output = self.trainer.model(*args)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 93, in forward
        return self.module.training_step(*inputs[0], **kwargs[0])
      File "/home/ec2-user/SageMaker/SDR/models/doc_similarity_pl_template.py", line 49, in training_step
        batch = self(batch)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ec2-user/SageMaker/SDR/models/SDR/SDR.py", line 78, in forward
        eval(f"self.forward_{self.hparams.mode}")(batch)
      File "/home/ec2-user/SageMaker/SDR/models/SDR/SDR.py", line 48, in forward_train
        run_mlm=True,
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ec2-user/SageMaker/SDR/models/SDR/similarity_modeling.py", line 129, in forward
        return_dict=return_dict,
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 835, in forward
        return_dict=return_dict,
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 490, in forward
        output_attentions,
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 433, in forward
        self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1597, in apply_chunking_to_forward
        return forward_fn(*input_tensors)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in feed_forward_chunk
        intermediate_output = self.intermediate(attention_output)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 367, in forward
        hidden_states = self.intermediate_act_fn(hidden_states)
      File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/functional.py", line 1556, in gelu
        return torch._C._nn.gelu(input)
    RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 14.76 GiB total capacity; 11.17 GiB already allocated; 14.75 MiB free; 11.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    
    opened by loretoparisi 6
  • TypeError: __init__() got an unexpected keyword argument 'filepath'

    TypeError: __init__() got an unexpected keyword argument 'filepath'

    I executed the training command

    %cd /home/ec2-user/SageMaker/SDR
    !python sdr_main.py --dataset_name video_games
    

    The error was

    /home/ec2-user/SageMaker/SDR
    [nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
    [nltk_data]   Package punkt is already up-to-date!
    WARNING:PML:The pytorch-metric-learning testing module requires faiss. You can install the GPU version with the command 'conda install faiss-gpu -c pytorch'
                            or the CPU version with 'conda install faiss-cpu -c pytorch'. Learn more at https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
    Global seed set to 42
    INFO:lightning:Global seed set to 42
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/cryptography/hazmat/backends/openssl/x509.py:18: CryptographyDeprecationWarning: This version of cryptography contains a temporary pyOpenSSL fallback path. Upgrade pyOpenSSL now.
      utils.DeprecatedIn35,
    Some weights of SimilarityModeling were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    
    Log directory:
    /home/ec2-user/SageMaker/SDR/output/document_similarity/arch_SDR/dataset_name_video_games/test_only_False/22_11_2021-20_22_55
    
    Traceback (most recent call last):
      File "sdr_main.py", line 80, in <module>
        main()
      File "sdr_main.py", line 28, in main
        main_train(model_class_pointer, hyperparams,parser)
      File "sdr_main.py", line 55, in main_train
        verbose=True,
    TypeError: __init__() got an unexpected keyword argument 'filepath'
    
    opened by loretoparisi 0
  • SBERT v performance?

    SBERT v performance?

    Hi!

    Looking at the results, it is written that an experiment on SBERT v has been conducted, and I am curious where I can see the performance of SBERT v.

    Thanks.

    opened by haven-jeon 0
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

Zhihu 44 Oct 20, 2022
Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

Contextualized Perturbation for Textual Adversarial Attack Introduction This is a PyTorch implementation of Contextualized Perturbation for Textual Ad

cookielee77 30 Jan 1, 2023
NAACL2021 - COIL Contextualized Lexical Retriever

COIL Repo for our NAACL paper, COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. The code covers learning

Luyu Gao 108 Dec 31, 2022
TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

null 912 Jan 8, 2023
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

Amin Rezaei 157 Dec 11, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Patch-Rotation(PatchRot) Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models Submitted to Neurips2021 To

null 4 Jul 12, 2021
The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Lexa-Benchmark Codebase for the self-supervised goal reaching benchmark introduced in 'Discovering and Achieving Goals via World Models'. Setup Create

null 1 Oct 14, 2021
Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Wietse de Vries • Martijn Bartelds • Malvina Nissim • Martijn Wieling Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

Wietse de Vries 5 Aug 2, 2021
PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models Code accompanying CVPR'20 paper of the same title. Paper lin

Alex Damian 7k Dec 30, 2022
This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

Core-tuning This repository is the official implementation of ``Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regular

vanint 18 Dec 17, 2022
Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Hierarchical Metadata-Aware Document Categorization under Weak Supervision This project provides a weakly supervised framework for hierarchical metada

Yu Zhang 53 Sep 17, 2022
Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness This repository contains the code used for the exper

H.R. Oosterhuis 28 Nov 29, 2022
Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

null 82 Nov 29, 2022
Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks Contributions A novel pairwise feature LSP to extract structural

null 31 Dec 6, 2022
Ranking Models in Unlabeled New Environments (iccv21)

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

null 14 Dec 17, 2021