ConvBERT: Improving BERT with Span-based Dynamic Convolution

YITUTech

Last update: Dec 10, 2022

Related tags

Text Data & NLP ConvBert

Overview

ConvBERT

Introduction

In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU. For detailed description and experimental results, please refer to our NeurIPS 2020 paper ConvBERT: Improving BERT with Span-based Dynamic Convolution.

Requirements

Python 3
tensorflow 1.15
numpy
scikit-learn

Experiments

Pre-training

These instructions pre-train a medium-small sized ConvBERT model (17M parameters) using the OpenWebText corpus.

To build the tf-record and pre-train the model, download the OpenWebText corpus (12G) and setup your data directory in build_data.sh and pretrain.sh. Then run

bash build_data.sh

The processed data require roughly 30G of disk space. Then, to pre-train the model, run

bash pretrain.sh

See configure_pretraining.py for the details of the supported hyperparameters.

Fine-tining

We gives the instruction to fine-tune a pre-trained medium-small sized ConvBERT model (17M parameters) on GLUE. You can refer to the Google Colab notebook for a quick example. See our paper for more details on model performance. Pre-trained model can be found here. (You can also download it from baidu cloud with extraction code m9d2.)

To evaluate the performance on GLUE, you can download the GLUE data by running

python3 download_glue_data.py

Set up the data by running mv CoLA cola && mv MNLI mnli && mv MRPC mrpc && mv QNLI qnli && mv QQP qqp && mv RTE rte && mv SST-2 sst && mv STS-B sts && mv diagnostic/diagnostic.tsv mnli && mkdir -p $DATA_DIR/finetuning_data && mv * $DATA_DIR/finetuning_data. After preparing the GLUE data, setup your data directory in finetune.sh and run

bash finetune.sh

And you can test different tasks by changing configs in finetune.sh.

If you find this repo helpful, please consider cite

@article{Jiang2020ConvBERT,
  title={ConvBERT: Improving BERT with Span-based Dynamic Convolution},
  author={Zi-Hang Jiang and Weihao Yu and Daquan Zhou and Y. Chen and Jiashi Feng and S. Yan},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.02496}
}

References

Here are some great resources we benefit:

Codebase: Our codebase are based on ELECTRA.

Dynamic convolution: Implementation from Pay Less Attention with Lightweight and Dynamic Convolutions

Dataset: OpenWebText from Language Models are Unsupervised Multitask Learners

Comments

Add toggle to turn off `strip_accents`.
In some languages like German the accents are important and change the sementics. Examples:

mochte vs. möchte

musste vs. müsste

etc.

But when doing lower_case they are automatically always stripped.

This PR adds a toggle to make it possible to do lower_case but keep the accents. This conforms to the transformers.tokenization_bert.BertTokenizerFast which also has an boolean parameter called strip_accents.

With default parameters this PR introduces no breaking change because we init strip_accentswith True which is the default behavior.

This PR is basicaly a copy of this PR coming from Electra. Its got merged some days ago: https://github.com/google-research/electra/pull/88

More about accents and the samentic can be seen in this video about our German Electra model:
https://youtu.be/cxgrTd2AQis?t=969
opened by PhilipMay 4
What's the essential difference between ConvBert and LSRA?

LSRA: Lite Transformer with Long-Short Range Attention.

LSRA also integrates convolution operations into transformer blocks. I'm just wondering what makes ConvBert differ from LSRA.

opened by yuanenming 3
Training on multiple GPUs for BASE or LARGE Models

Hi,

here https://github.com/yitu-opensource/ConvBert/issues/16#issuecomment-814256177 you say

Our code is only tested on a single V100 GPU.

But in your Paper you write about BASE size ConvBERT models.

But BASE size models can not be trained (created) on a single GPU. From my experince you need 8 GPUs.

Could you please explain this? I would like to create a german BASE or maybe even LARGE new language model.

At https://github.com/yitu-opensource/ConvBert/issues/16#issuecomment-814256177 you say that Hugging Face might be an option for multi GPU training. From my experience they are good at downstream task training but not good at the initial language model creation.

I would be super happy about some help to create my new ConvBERT BASE or larger model in different languages.

Many thanks Philip

opened by PhilipMay 2
Train on GPU instead of TPU - differnt distribution strategies

Hi, many thanks for this nice new model type and your research. We would like to train a ConvBERT but on GPU and not TPU. Do you have any experiences or tips how to do this? We have concerns regarding the differnt distribution strategies between GPUs and TPUs.

Thanks Philip

opened by PhilipMay 2
关于mixed-attention推理速度的问题

请问一下，因为我看到论文中提到的FLOPs分别是26.5G和19.3G，请问这个实验数据是怎么得到的，因为我自己测试12层的medium-small模型encoder总体是在1GFLOPs左右。还有后面的推理速度是什么条件下测试出来的呢？因为我这边得到的结果是推理速度慢于原始的self-attention，我猜想是因为里面虽然浮点计算操作少了，但是数据搬运的时间多了（reshape、transpose）。

opened by yygle 2
请问你提供的预训练模型是中文预训练模型还是英文是基于什么进行训练的细节可以稍微介绍下吗

我从readme 里下载了你的预训练模型 convbert_base convbert_medium convbert_small. 这三个模型文件夹里没有词表，我根据你项目中的词表 vocab.txt (30522维度) 我理解你这是英文的预训练模型，请问我理解的对吗(我是根据electra 来看英文模型词表是30522 中文预训练模型词表是 21128)。谢谢回答

opened by 652994331 1
span light conv疑惑

你好，我想请问下，在span light conv中，既然已经用tf.layers.separable_conv1d生成了带span信息的矩阵key_conv_attn_layer，为什么还需要点乘query_layer呢？对应于conv_attn_layer = tf.multiply(key_conv_attn_layer, query_layer)。感觉此处点乘不是很有必要

opened by psy2013GitHub 1
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0

Owner

YITUTech

GitHub

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

59 Dec 1, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.8k Dec 27, 2022

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.5k Feb 18, 2021

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Related tags

Overview

ConvBERT

Introduction

Requirements

Experiments

Pre-training

Fine-tining

References

Comments

Patching CVE-2007-4559

Owner

YITUTech

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Super easy library for BERT based NLP models

Super easy library for BERT based NLP models

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

A BERT-based reverse-dictionary of Korean proverbs

PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models

A BERT-based reverse dictionary of Korean proverbs

IMDB film review sentiment classification based on BERT's supervised learning model.

BERT-based Financial Question Answering System

BERT, LDA, and TFIDF based keyword extraction in Python

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"