Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Overview

ConSERT

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Requirements

torch==1.6.0
cudatoolkit==10.0.103
cudnn==7.6.5
sentence-transformers==0.3.9
transformers==3.4.0
tensorboardX==2.1
pandas==1.1.5
sentencepiece==0.1.85
matplotlib==3.4.1
apex==0.1.0

Get Started

  1. Download pre-trained language model (e.g. bert-base-uncased) from HuggingFace's Library
  2. Download STS datasets to ./data folder using SentEval toolkit
  3. Run the following script to run the unsupervised experiment:
    python3 main.py --no_pair --seed 1 --use_apex_amp --apex_amp_opt_level O1 --batch_size 96 --max_seq_length 64 --evaluation_steps 200 --add_cl --cl_loss_only --cl_rate 0.15 --temperature 0.1 --learning_rate 0.0000005 --train_data stssick --num_epochs 10 --da_final_1 feature_cutoff --da_final_2 shuffle --cutoff_rate_final_1 0.2 --model_name_or_path [PRETRAINED_BERT_FOLDER] --model_save_path ./output/unsup-base-feature_cutoff-shuffle --force_del --no_dropout --patience 10
    where [PRETRAINED_BERT_FOLDER] should be replaced to the folder that contains downloaded pre-trained language model

Citation

@article{yan2021consert,
  title={ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer},
  author={Yan, Yuanmeng and Li, Rumei and Wang, Sirui and Zhang, Fuzheng and Wu, Wei and Xu, Weiran},
  journal={arXiv preprint arXiv:2105.11741},
  year={2021}
}
Comments
  • 关于句子表示

    关于句子表示

    Q1:请问在无监督学习时,利用两种数据增强方法产生了句子的两个表示,但是之后对模型评估的时候,论文中说通过平均最后两层的token来获得句子的表示,请问这个句子是指哪一个?是原始的句子送入transformer得到的表示还是增强后的句子送入得到的? Q2:在监督学习的任务中,加入了下游任务的损失,此时如果使用joint方式训练,figure2中的数据增强还是使用两种增强吗?还是一个是原始句子 一个是增强后的句子,同样的,在模型评估时用的是哪个句子的表示?谢谢!

    opened by zhangxiaowei5346 1
  • cpu ram memory leak

    cpu ram memory leak

    I've been re-implementing ConSERT these days

    Just out of curiosity, I removed early stopping to check if it makes difference in scores.

    I found that this code might have CPU memory leak.

    When I execute this code, the total amount of CPU memory usage keeps increasing and it ends up shutting down.

    Have you experienced this kind of situation on this code as well?

    opened by qmin2 0
  • 在使用unsup-consert-base.sh复现时,结果和论文中的结果差距比较大,差了约10个点

    在使用unsup-consert-base.sh复现时,结果和论文中的结果差距比较大,差了约10个点

    similarity mean: 0.6198720335960388 similarity std: 0.23218630254268646 similarity max: 0.9888665676116943 similarity min: -0.11419621855020523 labels mean: 0.5215833187103271 labels std: 0.30510348081588745 labels max: 1.0 labels min: 0.0

    不知道是不是忽略了什么细节,希望您能给出一点建议

    opened by Xiaoyingzi09 0
  • AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

    AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

    torch 1.6.0 and torch 1.8.1 not work. assert this error like title.

    Traceback (most recent call last): File "main.py", line 14, in from sentence_transformers import models, losses File "/root/ConSERT/sentence_transformers/init.py", line 3, in from .datasets import SentencesDataset, SentenceLabelDataset, ParallelSentencesDataset File "/root/ConSERT/sentence_transformers/datasets/init.py", line 1, in from .sampler import * File "/root/ConSERT/sentence_transformers/datasets/sampler/init.py", line 1, in from .LabelSampler import * File "/root/ConSERT/sentence_transformers/datasets/sampler/LabelSampler.py", line 6, in from ...datasets import SentenceLabelDataset File "/root/ConSERT/sentence_transformers/datasets/SentenceLabelDataset.py", line 8, in from .. import SentenceTransformer File "/root/ConSERT/sentence_transformers/SentenceTransformer.py", line 11, in import transformers File "/root/ConSERT/transformers/init.py", line 22, in from .integrations import ( # isort:skip File "/root/ConSERT/transformers/integrations.py", line 58, in from .file_utils import is_torch_tpu_available File "/root/ConSERT/transformers/file_utils.py", line 140, in from apex import amp # noqa: F401 File "/root/miniconda3/lib/python3.8/site-packages/apex/init.py", line 27, in from . import transformer File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/init.py", line 4, in from apex.transformer import pipeline_parallel File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/init.py", line 1, in from apex.transformer.pipeline_parallel.schedules import get_forward_backward_func File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/init.py", line 3, in from apex.transformer.pipeline_parallel.schedules.fwd_bwd_no_pipelining import ( File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/fwd_bwd_no_pipelining.py", line 10, in from apex.transformer.pipeline_parallel.schedules.common import Batch File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/common.py", line 9, in from apex.transformer.pipeline_parallel.p2p_communication import FutureTensor File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/p2p_communication.py", line 25, in from apex.transformer.utils import split_tensor_into_1d_equal_chunks File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/utils.py", line 11, in torch.distributed.all_gather_into_tensor = torch.distributed._all_gather_base AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

    this error in apex https://github.com/NVIDIA/apex/issues/1526

    apex not match torch version, can you tell me your torch version?

    opened by SkullFang 1
  • 关于dropout的问题

    关于dropout的问题

    您好,想请问一下, 1.看到代码中只有 unsup-consert-base.sh 使用了 no_dropout 参数,而其它没有将BERT自带的dropout设置为0,这是为什么呢? 2.在禁用了BERT的dropout的情况下,是原句子和数据增强后句子都也不使用dropout,还是说只是数据增强后的句子不使用?

    opened by LBJ6666 2
  • OsError when running main.py

    OsError when running main.py

    I've been running into this issue when I run bash scripts/unsup-consert-base.sh

    Traceback (most recent call last):
      File "main.py", line 327, in <module>
        main(args)
      File "main.py", line 185, in main
        word_embedding_model = models.Transformer(args.model_name_or_path, attention_probs_dropout_prob=0.0, hidden_dropout_prob=0.0)
      File "/home/qmin/ConSERT/sentence_transformers/models/Transformer.py", line 36, in __init__
        self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
      File "/home/qmin/ConSERT/transformers/modeling_auto.py", line 629, in from_pretrained
        pretrained_model_name_or_path, *model_args, config=config, **kwargs
      File "/home/qmin/ConSERT/transformers/modeling_utils.py", line 954, in from_pretrained
        "Unable to load weights from pytorch checkpoint file. "
    
    OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. 
    

    Is there any workaround?

    opened by qmin2 2
  • 'BertModel' object has no attribute 'set_flag'

    'BertModel' object has no attribute 'set_flag'

    具体报错是: File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/SentenceTransformer.py", line 594, in fit loss_value = loss_model(features, labels) File "/root/anaconda3/envs/NLP_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/losses/AdvCLSoftmaxLoss.py", line 775, in forward rep_a_view1 = self._data_aug(sentence_feature_a, self.data_augmentation_strategy_final_1, File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/losses/AdvCLSoftmaxLoss.py", line 495, in _data_aug self.model[0].auto_model.set_flag("data_aug_cutoff", True) File "/root/anaconda3/envs/NLP_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'BertModel' object has no attribute 'set_flag'

    我加载的是hfl/chinese-roberta-wwm-ext模型。

    opened by zhihao-chen 4
  • How to use the model with sentence-transformer for inference?

    How to use the model with sentence-transformer for inference?

    Cannot load the model. code from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("../../models/consbert/unsup-consert-base-atec_ccks") # the model path Error message Traceback (most recent call last): File "/home/qhd/PythonProjects/GraduationProject/code/preprocess_unlabeled_second/sentence-bert.py", line 16, in model = SentenceTransformer("../../models/cosbert/unsup-consert-base-atec_ccks") File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 87, in init modules = self._load_sbert_model(model_path) File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 824, in _load_sbert_model module = module_class.load(os.path.join(model_path, module_config['path'])) File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 123, in load return Transformer(model_name_or_path=input_path, **config) File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 30, in init self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path, cache_dir=cache_dir, **tokenizer_args) File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 445, in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1719, in from_pretrained return cls._from_pretrained( File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1791, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/models/bert/tokenization_bert_fast.py", line 177, in init super().init( File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 96, in init fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) Exception: No such file or directory (os error 2)

    opened by qhd1996 2
Owner
Yan Yuanmeng
A student in Beijing University of Posts and Telecommunications.
Yan Yuanmeng
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

null 49 Aug 24, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

null 7 Dec 5, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 6, 2023
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

null 2 Feb 3, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

null 34 Nov 24, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
null 189 Jan 2, 2023
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 3, 2023
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022