ConvBERT: Improving BERT with Span-based Dynamic Convolution

Overview

ConvBERT

Introduction

In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU. For detailed description and experimental results, please refer to our NeurIPS 2020 paper ConvBERT: Improving BERT with Span-based Dynamic Convolution.

Requirements

  • Python 3
  • tensorflow 1.15
  • numpy
  • scikit-learn

Experiments

Pre-training

These instructions pre-train a medium-small sized ConvBERT model (17M parameters) using the OpenWebText corpus.

To build the tf-record and pre-train the model, download the OpenWebText corpus (12G) and setup your data directory in build_data.sh and pretrain.sh. Then run

bash build_data.sh

The processed data require roughly 30G of disk space. Then, to pre-train the model, run

bash pretrain.sh

See configure_pretraining.py for the details of the supported hyperparameters.

Fine-tining

We gives the instruction to fine-tune a pre-trained medium-small sized ConvBERT model (17M parameters) on GLUE. You can refer to the Google Colab notebook for a quick example. See our paper for more details on model performance. Pre-trained model can be found here. (You can also download it from baidu cloud with extraction code m9d2.)

To evaluate the performance on GLUE, you can download the GLUE data by running

python3 download_glue_data.py

Set up the data by running mv CoLA cola && mv MNLI mnli && mv MRPC mrpc && mv QNLI qnli && mv QQP qqp && mv RTE rte && mv SST-2 sst && mv STS-B sts && mv diagnostic/diagnostic.tsv mnli && mkdir -p $DATA_DIR/finetuning_data && mv * $DATA_DIR/finetuning_data. After preparing the GLUE data, setup your data directory in finetune.sh and run

bash finetune.sh

And you can test different tasks by changing configs in finetune.sh.

If you find this repo helpful, please consider cite

@article{Jiang2020ConvBERT,
  title={ConvBERT: Improving BERT with Span-based Dynamic Convolution},
  author={Zi-Hang Jiang and Weihao Yu and Daquan Zhou and Y. Chen and Jiashi Feng and S. Yan},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.02496}
}

References

Here are some great resources we benefit:

Codebase: Our codebase are based on ELECTRA.

Dynamic convolution: Implementation from Pay Less Attention with Lightweight and Dynamic Convolutions

Dataset: OpenWebText from Language Models are Unsupervised Multitask Learners

Comments
  • Add toggle to turn off `strip_accents`.

    Add toggle to turn off `strip_accents`.

    In some languages like German the accents are important and change the sementics. Examples:

    1. mochte vs. möchte
    2. musste vs. müsste
    3. etc.

    But when doing lower_case they are automatically always stripped.

    This PR adds a toggle to make it possible to do lower_case but keep the accents. This conforms to the transformers.tokenization_bert.BertTokenizerFast which also has an boolean parameter called strip_accents.

    With default parameters this PR introduces no breaking change because we init strip_accentswith True which is the default behavior.

    This PR is basicaly a copy of this PR coming from Electra. Its got merged some days ago: https://github.com/google-research/electra/pull/88

    More about accents and the samentic can be seen in this video about our German Electra model:
    https://youtu.be/cxgrTd2AQis?t=969

    opened by PhilipMay 4
  • What's the essential difference between ConvBert and LSRA?

    What's the essential difference between ConvBert and LSRA?

    LSRA: Lite Transformer with Long-Short Range Attention.

    LSRA also integrates convolution operations into transformer blocks. I'm just wondering what makes ConvBert differ from LSRA.

    opened by yuanenming 3
  • Training on multiple GPUs for BASE or LARGE Models

    Training on multiple GPUs for BASE or LARGE Models

    Hi,

    here https://github.com/yitu-opensource/ConvBert/issues/16#issuecomment-814256177 you say

    Our code is only tested on a single V100 GPU.

    But in your Paper you write about BASE size ConvBERT models.

    But BASE size models can not be trained (created) on a single GPU. From my experince you need 8 GPUs.

    Could you please explain this? I would like to create a german BASE or maybe even LARGE new language model.

    At https://github.com/yitu-opensource/ConvBert/issues/16#issuecomment-814256177 you say that Hugging Face might be an option for multi GPU training. From my experience they are good at downstream task training but not good at the initial language model creation.

    I would be super happy about some help to create my new ConvBERT BASE or larger model in different languages.

    Many thanks Philip

    opened by PhilipMay 2
  • Train on GPU instead of TPU - differnt distribution strategies

    Train on GPU instead of TPU - differnt distribution strategies

    Hi, many thanks for this nice new model type and your research. We would like to train a ConvBERT but on GPU and not TPU. Do you have any experiences or tips how to do this? We have concerns regarding the differnt distribution strategies between GPUs and TPUs.

    Thanks Philip

    opened by PhilipMay 2
  • 关于mixed-attention推理速度的问题

    关于mixed-attention推理速度的问题

    请问一下,因为我看到论文中提到的FLOPs分别是26.5G和19.3G,请问这个实验数据是怎么得到的,因为我自己测试12层的medium-small模型encoder总体是在1GFLOPs左右。还有后面的推理速度是什么条件下测试出来的呢? 因为我这边得到的结果是推理速度慢于原始的self-attention,我猜想是因为里面虽然浮点计算操作少了,但是数据搬运的时间多了(reshape、transpose)。

    opened by yygle 2
  • 请问你提供的预训练模型是中文预训练模型 还是英文 是基于什么进行训练的 细节可以稍微介绍下吗

    请问你提供的预训练模型是中文预训练模型 还是英文 是基于什么进行训练的 细节可以稍微介绍下吗

    我从readme 里下载了你的预训练模型 convbert_base convbert_medium convbert_small. 这三个模型 文件夹里没有词表, 我根据你项目中的词表 vocab.txt (30522维度) 我理解你这是英文的预训练模型,请问我理解的对吗(我是根据electra 来看 英文模型词表是30522 中文预训练模型 词表是 21128)。谢谢回答

    opened by 652994331 1
  • span light conv疑惑

    span light conv疑惑

    你好,我想请问下,在span light conv中,既然已经用tf.layers.separable_conv1d生成了带span信息的矩阵key_conv_attn_layer,为什么还需要点乘query_layer呢?对应于conv_attn_layer = tf.multiply(key_conv_attn_layer, query_layer)。感觉此处点乘不是很有必要

    opened by psy2013GitHub 1
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
Owner
YITUTech
YITUTech
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.5k Feb 18, 2021
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 8, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models. 基于BERT的文本纠错模型,使用PyTorch实现。

Heng Cai 209 Dec 30, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models

PyTorch impelementations of BERT-based Spelling Error Correction Models

Heng Cai 59 Jun 29, 2021
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

null 94 Dec 8, 2022
IMDB film review sentiment classification based on BERT's supervised learning model.

IMDB film review sentiment classification based on BERT's supervised learning model. On the other hand, the model can be extended to other natural language multi-classification tasks.

Paris 1 Apr 17, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

AI Dynamic Text Reader: This is a simple dynamic text reader based on Artificial

Md. Rakibul Islam 1 Jan 18, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022