A PyTorch implementation of unsupervised SimCSE

Last update: Dec 23, 2022

Related tags

Overview

A PyTorch implementation of unsupervised SimCSE

SimCSE: Simple Contrastive Learning of Sentence Embeddings

1. 用法

无监督训练

python train_unsup.py ./data/news_title.txt ./path/to/huggingface_pretrained_model

详细参数

python train_unsup.py -h

相似文本检索测试

python test_unsup.py

query title:
基金亏损路未尽 后市看法仍偏谨慎

sim title:
基金亏损路未尽 后市看法仍偏谨慎
海通证券：私募对后市看法偏谨慎
连塑基本面不容乐观 后市仍有下行空间
基金谨慎看待后市行情
稳健投资者继续保持观望 市场走势还未明朗
下半年基金投资谨慎乐观
华安基金许之彦：下半年谨慎乐观
楼市主导 期指后市不容乐观
基金公司谨慎看多明年市
前期乐观预期被否 基金重归谨慎

STS-B数据集训练和测试

中文STS-B数据集，详情见这里

# 训练
python train_unsup.py ./data/STS-B/cnsd-sts-train_unsup.txt

# 验证
python eval_unsup.py

模型	STS-B dev	STS-B test
hfl/chinese-bert-wwm-ext	0.3326	0.3209
simcse	0.7499	0.6909

与苏剑林的实验结果接近，BERT-P1是0.3465，SIMCSE是0.6904

2. 参考

You might also like...

Implementation of "JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting"

JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting Pytorch implementation for the paper "JOKR: Joint Keypoint Repres

45 Dec 25, 2022

Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation"

DSP Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation". Accepted by ACM Multimedia 2021. Authors

20 Oct 24, 2022

Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

Deep Unsupervised Image Hashing by Maximizing Bit Entropy This is the PyTorch implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hash

62 Dec 30, 2022

Implementation of "Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency"

Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency (ICCV2021) Paper Link: https://arxiv.org/abs/2107.11355 This implementation bui

32 Nov 17, 2022

ALBERT-pytorch-implementation - ALBERT pytorch implementation

ALBERT-pytorch-implementation developing... 모델의 개념이해를 돕기 위한 구현물로 현재 변수명을 상세히 적었고

3 Oct 6, 2022

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

1.8k Jan 7, 2023

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

527 Dec 4, 2022

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals This repo contains the Pytorch implementation of our paper: Unsupervised Seman

335 Dec 28, 2022

UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering

UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering This repository holds all the code and data for our recent work on

118 Dec 6, 2022

Comments

怎样实现随机输出

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertTokenizer,BertModel,BertConfig

tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
batch_x = tokenizer(["中国人民公安大学年硕士研究生目录及书目","中国人民公安大学年硕士研究生目录及书目"], return_tensors="pt", padding=True, truncation=True, max_length=128)

class SimCSE(nn.Module):
    def __init__(self, pretrained="bert-base-chinese", pool_type="pooler", dropout_prob=0.3):
        super().__init__()
        conf = BertConfig.from_pretrained(pretrained)
        conf.attention_probs_dropout_prob = dropout_prob
        conf.hidden_dropout_prob = dropout_prob
        self.encoder = BertModel.from_pretrained(pretrained, config=conf)
        assert pool_type in ["cls", "pooler"], "invalid pool_type: %s" % pool_type
        self.pool_type = pool_type

    def forward(self, input_ids, attention_mask, token_type_ids):
        output = self.encoder(input_ids,
                              attention_mask=attention_mask,
                              token_type_ids=token_type_ids)
        if self.pool_type == "cls":
            output = output[0][:, 0]
        elif self.pool_type == "pooler":
            output = output[1]
        return output

model = SimCSE()
pred = model(input_ids = batch_x["input_ids"],attention_mask=batch_x["attention_mask"],token_type_ids=batch_x["token_type_ids"])

您好，我按照这样的思路输入两次，但是最后输出的pred[0]和pred[1]是完全一样的，所以想请教下您的随机输出是怎么实现的？

opened by duruiting 4

SimCSERetrieval.py 中encode_file() 存在不足
你好，看了你的代码受益匪浅，但是在代码SimCSERetrieval.py 中encode_file() 有一些问题如下：

if len(texts) >= self.batch_size: vecs = self.encode_batch(texts) vecs = vecs / vecs.norm(dim=1, keepdim=True) all_texts.extend(texts) all_ids.extend(idxs) all_vecs.append(vecs.cpu()) texts = [] idxs = []

如果 fname中的样本数N， N%self.batch_size = d，那么会遗漏d个样本。我在测试中使用样本数为： N = 559 batch_size = 64 得到all_vecs.shape[0]=512
opened by shencangblue 0

A PyTorch implementation of unsupervised SimCSE

Related tags

Overview

A PyTorch implementation of unsupervised SimCSE

1. 用法

无监督训练

相似文本检索测试

STS-B数据集训练和测试

2. 参考

You might also like...

Implementation of "JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting"

Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation"

Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

Implementation of "Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency"

ALBERT-pytorch-implementation - ALBERT pytorch implementation

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.

UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering

Comments

怎样实现随机输出

SimCSERetrieval.py 中encode_file() 存在不足

Owner

The pytorch implementation of DG-Font: Deformable Generative Networks for Unsupervised Font Generation

A PyTorch implementation for Unsupervised Domain Adaptation by Backpropagation(DANN), support Office-31 and Office-Home dataset

This repo is a PyTorch implementation for Paper "Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds"

Official PyTorch implementation of Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval.

This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation.

Pytorch implementation of the unsupervised object discovery method LOST.

PyTorch code for SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised DA

Official PyTorch code for WACV 2022 paper "CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows"

This is an unofficial implementation of the paper “Student-Teacher Feature Pyramid Matching for Unsupervised Anomaly Detection”.

This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"