Pytorch implementation of Supporting Clustering with Contrastive Learning, NAACL 2021

Last update: Jan 5, 2023

Related tags

Deep Learning sccl

Overview

Supporting Clustering with Contrastive Learning

SCCL (NAACL 2021) Dejiao Zhang, Feng Nan, Xiaokai Wei, Shangwen Li, Henghui Zhu, Kathleen McKeown, Ramesh Nallapati, Andrew Arnold, and Bing Xiang.

Requirements

Datasets:

In additional to the original data, SCCL requires a pair of augmented data for each instance. See our paper for details.

Dependencies:

python==3.6. 
pytorch==1.6.0. 
sentence-transformers==0.3.8. 
transformers==3.3.0. 
tensorboardX==2.1.

To run the code:

1. put your dataset in the folder "./datasamples"  # for some license issue, we are not able to release the dataset now, we'll release the datasets asap
2. bash ./scripts/run.sh # you need change the dataset info and results path accordingly

Citation:

@inproceedings{zhang-etal-2021-supporting,
title = "Supporting Clustering with Contrastive Learning",
author = "Zhang, Dejiao  and Nan, Feng  and Wei, Xiaokai  and
  Li, Shang-Wen  and Zhu, Henghui  and McKeown, Kathleen  and
  Nallapati, Ramesh  and Arnold, Andrew O.  and Xiang, Bing",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.naacl-main.427",
pages = "5419--5430",
abstract = " ",}

Comments

Can‘t achieve the scores in paper

I'm trying to reproduce the paper, but cann't reach 0.85 ACC runing this code on GoogleNews-S. All hyper-parameters are set to same values in paper, and the data is enhanced with contextual argumentation. The running result shows, not the model but the representation with K-Means performances better with 0.75 acc , and the model just reaches 0.62 acc. When I increase the clustering head lr, the result with model still remains 0.62 level. What should I do to improve this？

opened by PaffxAroma 9
Twitter dataset not found?

I am trying to reproduce the original paper, but I am not able to find the twitter dataset and the link (Yin and Wang, 2016) in the paper does not direct to the dataset.

opened by urospet 5
Worse results with Banking Dataset

@Dejiao2018 Hy, the idea in the paper is quite good. But I am not able to get good results with a different dataset. For eg. Banking Dataset (77 clusters) (https://arxiv.org/pdf/2003.04807.pdf)

Convergence Criteria: Percentage Change in Prediction in Subsequent Iterations < 0.1 % Using the same available code.

Initial Results Before Training: Kmeans : ACC: 42.4 NMI: 60.95 ARI: 27.49

After Training: Kmeans : ACC: 13.0 NMI: 30.32 ARI: 5.2
After Training: Cluster Model: ACC: 11.4 NMI: 26.12 ARI: 4.7

Over subsequent iterations, the results get worse.

opened by rajat-tech-002 2
BUG?: There is an extra log() in the code when calculating cluster_loss?

In the training.py: cluster_loss = self.cluster_loss((output+1e-08).log(), target)/output.shape[0]

Why is .log() used here, which is inconsistent with the paper.

https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html?highlight=kldivloss#torch.nn.KLDivLoss

opened by wulaoshi 1
No code found for dataset partitioning

Did you divide the dataset into training dataset and test dataset? I didn't find any content related to dataset partitioning in the paper. In the code corresponding to the paper, I found that there are data files named 'test' and 'train', but it seems that the main program file 'main.py' uses the same file specified by argument 'dataname' when training and testing.

opened by gzpbbd 1
Not sure how the augmented pairs generated by the Contextual Augmenter

Hi, I am trying to reproduce the paper. But I am not sure how the augmented pairs generated by the contextual augmenter. The paper shows that the data is augmented by Bertbase and Roberta via word substitution Does it mean that you augment the input sentence once by bertbase and roberta correspondingly and combine these two sentences the augmented pair, or it means that during the experiment, you just use one single pre-trained model (e.g., roberta) and use it to augment the sentence twice to get the augmented pair? I try the latter one way, i.e, use the roberta to generate the two augmented sentences for one sentence, but it did not work.

opened by zexuanqiu 1
How is the training data generated?

train_data = pd.read_csv(os.path.join(args.data_path, args.dataname)) train_text = train_data['text'].fillna('.').values train_text1 = train_data['text1'].fillna('.').values train_text2 = train_data['text2'].fillna('.').values train_label = train_data['label'].astype(int).values

What is the difference between text, text1 and text2 here？

opened by jx1100370217 1
Doubt regarding dataloader file

I have a small doubt in the file dataloader.py under the dataloader directory. In the function augment_loader, the train_label makes sense i.e., if train_text1 and train_text2 belong to same parent, then it is 1, otherwise it is -1. But, in the function train_unshuffle_loader, I don't get the use of train_label. Can you please explain that?

opened by Sohanpatnaik106 1
performance with virtual augmentation

Hi, I'm reproducing paper and having some issues. I used below SCCL with virtual augmentation code. ''' python3 main.py
--resdir $path-to-store-your-results
--use_pretrain SBERT
--bert distilbert
--datapath $path-to-your-data
--dataname searchsnippets
--num_classes 8
--text text
--label label
--objective SCCL
--augtype virtual
--temperature 0.5
--eta 10
--lr 1e-05
--lr_scale 100
--max_length 32
--batch_size 400
--max_iter 1000
--print_freq 100
--gpuid 1 & ''' but Acc of Representation with Kmeans is higher than Acc of model. I have no idea what the problem is.. Thank you for your help.

opened by hoho5702-jbnu 0
add the bash scripts for data preparation

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

opened by scilearner 0
contrastive is inconsistency between papers and code

In section3.1 of the paper , contrastive loss = -log(pos/neg).but in code,contrastive loss = -log(pos/(neg+pos)).Why the denominator needs to add pos in the code?

opened by pickwu 0
Did you get the score for searchsnippets in the paper by using the plain distilbert as backbone rather than sentence bert?Can you do early stopping whenever the losses stop decreases or the prediction stops changing much?

Can you do early stopping whenever the losses stop decreases or the prediction stops changing much?

Originally posted by @Dejiao2018 in https://github.com/amazon-research/sccl/issues/22#issuecomment-1175308185

opened by hmllmh 0
Has anyone run out the results of the paper through the hyper parameters provided in the paper or in the source code?

Sometimes, I can get a comparative score in a certain step, however, it is always convergent to a bad score. I don't know what is wrong, Can somebody helps?

opened by hmllmh 4
use embedder.encode in get_kmeans_centers, but use sentbert.forward() to get the representation in SCCLBert

I think both the place want to get the representation of sentences, why don't use the same function to get the same representation?Is there any benefit from doing this?

opened by jkkl 0
Bug in confusion.add

It seems a bug in sccl.utils.metric.Confusion's add function. Inself._conf_flat.index_add_(0, indices, ones), the indices's index comes error: IndexError: index out of range in self. Why do add operation : indices = (target*self.conf.stride(0) + pred.squeeze_().type_as(target)).type_as(self.conf) indices = (target*self.conf.stride(0) + pred.squeeze_().type_as(target)).type_as(self.conf)

opened by MrRace 5

Owner

GitHub

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

5 Jul 18, 2022

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ADGC: Awesome Deep Graph Clustering ADGC is a collection of state-of-the-art (SOTA), novel deep graph clustering methods (papers, codes and datasets).

297 Dec 27, 2022

Paddle implementation for "Highly Efficient Knowledge Graph Embedding Learning with Closed-Form Orthogonal Procrustes Analysis" (NAACL 2021)

ProcrustEs-KGE Paddle implementation for Highly Efficient Knowledge Graph Embedding Learning with Orthogonal Procrustes Analysis ?? A more detailed re

4 Jun 9, 2021

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Contrastive Clustering (CC) This is the code for the paper "Contrastive Clustering" (AAAI 2021) Dependency python>=3.7 pytorch>=1.6.0 torchvision>=0.8

210 Dec 30, 2022

NAACL'2021: Factual Probing Is [MASK]: Learning vs. Learning to Recall

OptiPrompt This is the PyTorch implementation of the paper Factual Probing Is [MASK]: Learning vs. Learning to Recall. We propose OptiPrompt, a simple

150 Dec 20, 2022

Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

L1-Refinement Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021) ?? A more detailed readme is co

4 Jun 9, 2021

Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

42 Nov 24, 2022

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval This repository contains source code and pre-trained/fine-tun

65 Dec 26, 2022

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

70 Nov 27, 2022

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

Contextualized Perturbation for Textual Adversarial Attack Introduction This is a PyTorch implementation of Contextualized Perturbation for Textual Ad

30 Jan 1, 2023

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

104 Dec 7, 2022

Self-training with Weak Supervision (NAACL 2021)

This repo holds the code for our weak supervision framework, ASTRA, described in our NAACL 2021 paper: "Self-Training with Weak Supervision"

148 Nov 20, 2022

Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

56 Jan 2, 2023

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

Open-Ended Commonsense Reasoning (NAACL 2021)

Open-Ended Commonsense Reasoning Quick links: [Paper] | [Video] | [Slides] | [Documentation] This is the repository of the paper, Differentiable Open-

31 Oct 19, 2022

✅ How Robust are Fact Checking Systems on Colloquial Claims?. In NAACL-HLT, 2021.

How Robust are Fact Checking Systems on Colloquial Claims? Official PyTorch implementation of our NAACL paper: Byeongchang Kim*, Hyunwoo Kim*, Seokhee

19 Mar 15, 2022

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

41 Nov 8, 2022