Semi-supervised Learning for Sentiment Analysis

Overview

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining

Code, models and Datasets for《Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining》.

Download Models and Dataset

Datasets and Models are found in the follwing list.

  • Download 3.4M IMDB movie reviews. Save the data at [REVIEWS_PATH]. You can download the dataset HERE.
  • Download the vanilla RoBERTa-large model released by HuggingFace. Save the model at [VANILLA_ROBERTA_LARGE_PATH]. You can download the model HERE.
  • Download in-domain pretrained models in the paper and save the model at [PRETRAIN_MODELS]. We provide three following models. You can download HERE.
    • init-roberta-base: RoBERTa-base model(U) trained over 3.4M movie reviews from scratch.
    • semi-roberta-base: RoBERTa-base model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-base model.
    • semi-roberta-large: RoBERTa-large model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-large model.
  • Download the 1M (D` + D) training dataset for the student model, save the data at [STUDENT_DATA_PATH]. You can download it HERE.
    • student_data_base: student training data generated by roberta-base teacher model
    • student_data_large: student training data generated by roberta-large teacher model
  • Download the IMDB dataset from Andrew Maas' paper. Save the data at [IMDB_DATA_PATH]. For IMDB, The training data and test data are saved in two separate files, each line in the file corresponds to one IMDB sample. You can download HERE.
  • Download shannon_preprocssor.whl to install a binarize tool. Save the .whl file at [SHANNON_PREPROCESS_WHL_PATH]. You can download HERE
  • Download the teacher model and student model that we trained. Save them at [CHECKPOINTS]. You can download HERE
    • roberta-base: teacher and student model checkpoint for roberta-base
    • roberta-large: teacher and student model checkpoint for roberta-large

Installation

pip install -r requirements.txt
pip install [SHANNON_PREPROCESS_WHL_PATH]

Quick Tour

train the roberta-large teacher model

Use the roberta model we pretrained over 3.4M reviews data to train teacher model.
Our teacher model had an accuracy rate of 96.2% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_teacher \
roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

train the roberta-large student model

Use the roberta model we pretrained over 3.4M reviews data to train student model.
Our student model had an accuracy rate of 96.8% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

evaluate the student model on the test set

Load student model checkpoint to evaluate over test set to reproduce our result.

cd sstc/tasks/semi-roberta
python evaluate.py \
--checkpoint_path [CHECKPOINTS]/roberta-large/train_student_checkpoint/***.ckpt \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--batch_size=10 \
--gpus=0,

Reproduce paper results step by step

1.Train in-domain LM based on RoBERTa

1.1 binarize 3.4M reviews data

You should modify the shell according to your paths. The result binarize data will be saved in [REVIEWS_PATH]/bin

cd sstc/tasks/roberta_lm
bash binarize.sh

1.2 train RoBERTa-large (or small, as you wish) over 3.4M reviews data

cd sstc/tasks/roberta_lm
python trainer.py \
--roberta_path [VANILLA_ROBERTA_LARGE_PATH] \
--data_dir [REVIEWS_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [PRETRAIN_ROBERTA_CK_PATH] \
--val_check_interval 0.1 \
--precision 16 \
--batch_size 10 \
--distributed_backend=ddp \
--accumulate_grad_batches=50 \
--adam_epsilon 1e-6 \
--weight_decay 0.01 \
--warmup_steps 10000 \
--workers 8 \
--lr 2e-5

Training checkpoints will be saved in [PRETRAIN_ROBERTA_CK_PATH], find the best checkpoint and convert it to HuggingFace bin format, The relevant code can be found in sstc/tasks/roberta_lm/trainer.py. Save the pretrain bin model at [PRETRAIN_MODELS]\semi-roberta-large, or you can just download the model we trained.

2.train the teacher model

2.1 binarize IMDB dataset.

cd sstc/tasks/semi_roberta/scripts
bash binarize_imdb.sh

You can run the above code to binarize IMDB data, or you can just use the file we binarized in [IMDB_DATA_PATH]\bin

2.2 train the teacher model

cd sstc/tasks/semi_roberta
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

After training, teacher model checkpoint will be save in [ROOT_SAVE_PATH]/train_teacher_checkpoint. The teacher model we trained had an accuracy rate of 96.2% on the test set. The download link of teacher model checkpoint can be found in quick tour part.

3.label the unlabeled in-domain data U

3.1 label 3.4M data

Use the teacher model that you trained in previous step to label 3.4M reviews data, notice that [ROOT_SAVE_PATH] should be the same as previous setting. The labeled data will be save in [ROOT_SAVE_PATH]\predictions.

cd sstc/tasks/roberta_lm
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_ROBERTA_PATH] \
--reviews_data_path [REVIEWS_PATH]/bin \
--best_teacher_checkpoint_path [CHECKPOINTS]/roberta-large/train_teacher_checkpoint/***.ckpt \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] 

3.2 select the top-K data points

Firstly, we random sample 3M data from 3.4M reviews data as U', then we select 1M data from U' with the highest score as D', finally, we concat the IMDB train data(D) and D' as train data for student model. The student train data will be saved in [ROOT_SAVE_PATH]\student_data\train.txt, or you can use the data we provide in [STUDENT_DATA_PATH]/student_data_large

cd sstc/tasks/roberta_lm
python data_selector.py \
--imdb_data_path [IMDB_DATA_PATH] \
--save_path [ROOT_SAVE_PATH] 

4.train the student model

4.1 binarize the dataset

You can use the same script in 3.1 to binarize student train data in [ROOT_SAVE_PATH]\student_data\train.txt

4.1 train the student model

use can use the training data we provide in [STUDENT_DATA_PATH]/student_data_large/bin or use your own training data in [ROOT_SAVE_PATH]\student_data\bin, make sure you set the right student_data_path.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

After training, student model checkpoint will be save in [ROOT_SAVE_PATH]/train_student_checkpoint. The student model we trained had an accuracy rate of 96.6% on the test set. The download link of student model checkpoint can be found in Quick tour part.

You might also like...
Semi-Supervised Learning, Object Detection, ICCV2021
Semi-Supervised Learning, Object Detection, ICCV2021

End-to-End Semi-Supervised Object Detection with Soft Teacher By Mengde Xu*, Zheng Zhang*, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai,

Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Semi-supervised-learning-for-medical-image-segmentation. Recently, semi-supervised image segmentation has become a hot topic in medical image computin

The Rich Get Richer: Disparate Impact of Semi-Supervised Learning

The Rich Get Richer: Disparate Impact of Semi-Supervised Learning Preprocess file of the dataset used in implicit sub-populations: (Demographic groups

Semi-supervised Representation Learning for Remote Sensing Image Classification Based on Generative Adversarial Networks

SSRL-for-image-classification Semi-supervised Representation Learning for Remote Sensing Image Classification Based on Generative Adversarial Networks

This is the repository for the AAAI 21 paper [Contrastive and Generative Graph Convolutional Networks for Graph-based Semi-Supervised Learning].

CG3 This is the repository for the AAAI 21 paper [Contrastive and Generative Graph Convolutional Networks for Graph-based Semi-Supervised Learning]. R

TeST: Temporal-Stable Thresholding for Semi-supervised Learning
TeST: Temporal-Stable Thresholding for Semi-supervised Learning

TeST: Temporal-Stable Thresholding for Semi-supervised Learning TeST Illustration Semi-supervised learning (SSL) offers an effective method for large-

Implementation of
Implementation of "Semi-supervised Domain Adaptive Structure Learning"

Semi-supervised Domain Adaptive Structure Learning - ASDA This repo contains the source code and dataset for our ASDA paper. Illustration of the propo

Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)
Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)

Transfer Learning for Text Classification with Tensorflow Tensorflow implementation of Semi-supervised Sequence Learning(https://arxiv.org/abs/1511.01

This is the repository for the NeurIPS-21 paper [Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels].

CGPN This is the repository for the NeurIPS-21 paper [Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels]. Req

Comments
  • roberta-base not found

    roberta-base not found

    Hi: I got an error when initializing pretrained model by RobertaTokenizer. self.tokenizer = RobertaTokenizer.from_pretrained("roberta-base") self.model = RobertaForMaskedLM.from_pretrained('roberta-base')

    OSError: Model name 'roberta-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'roberta-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

    opened by Gavingx 0
Owner
null
Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning.

Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning. Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive

HLT@HIT(SZ) 7 Dec 16, 2021
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 2, 2023
Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

Hybrid-Supervised Object Detection System Object detection system trained by hybrid-supervision/weakly semi-supervision (HSOD/WSSOD): This project is

null 5 Dec 10, 2022
CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

Fraunhofer SCAI 10 Oct 11, 2022
code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning Overview This code is for paper: Not All Unlabeled Data are Equa

Jason Ren 22 Nov 23, 2022
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Multimodal Deep Learning ?? ?? ?? Announcing the multimodal deep learning repository that contains implementation of various deep learning-based model

Deep Cognition and Language Research (DeCLaRe) Lab 398 Dec 30, 2022
PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

null 43 Nov 19, 2022
Code for CoMatch: Semi-supervised Learning with Contrastive Graph Regularization

CoMatch: Semi-supervised Learning with Contrastive Graph Regularization (Salesforce Research) This is a PyTorch implementation of the CoMatch paper [B

Salesforce 107 Dec 14, 2022
The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

DS3L This is the code for paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020. Setups The code is implem

Guolz 36 Oct 19, 2022
noisy labels; missing labels; semi-supervised learning; entropy; uncertainty; robustness and generalisation.

ProSelfLC: CVPR 2021 ProSelfLC: Progressive Self Label Correction for Training Robust Deep Neural Networks For any specific discussion or potential fu

amos_xwang 57 Dec 4, 2022