PyTorch impelementations of BERT-based Spelling Error Correction Models.

Heng Cai

Last update: Dec 30, 2022

Related tags

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型，使用PyTorch实现

数据准备

从 http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录

请确保以下文件在 datasets/csc 中

train.sgml
B1_training.sgml
C1_training.sgml  
SIGHAN15_CSC_A2_Training.sgml  
SIGHAN15_CSC_B2_Training.sgml  
SIGHAN15_CSC_TestInput.txt
SIGHAN15_CSC_TestTruth.txt

环境准备

使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境（推荐）
克隆本项目并进入项目根目录
安装所需依赖 pip install -r requirements.txt
如果出现报错 GLIBC 版本过低的问题（GLIBC 的版本更迭容易出事故，不推荐更新），openCC 改为安装较低版本（例如 1.1.0）
在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型，首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型，目前支持以下配置文件：

train_bert4csc.yml
train_macbert4csc.yml
train_SoftMaskedBert.yml

如有其他需求，可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component	sentence level acc	p	r	f
Detection	0.5045	0.8252	0.8416	0.8333
Correction	0.8055	0.9395	0.8748	0.9060

Bert类

char level

MODEL	p	r	f
BERT4CSC	0.9269	0.8651	0.8949
MACBERT4CSC	0.9380	0.8736	0.9047

sentence level

model	acc	p	r	f
BERT4CSC	0.7990	0.8482	0.7214	0.7797
MACBERT4CSC	0.8027	0.8525	0.7251	0.7836

推理

方法一，使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下：

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二，直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重，使用transformers或pycorrector调用

使用convert_to_pure_state_dict.py导出bert权重
后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目，请按如下格式引用：

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0，可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护，侵权必究。

更新记录

20210618

修复数据处理的编码报错问题

20210518

将BERT4CSC检错任务改为使用FocalLoss
更新修改后的模型实验结果
降低数据处理时保留原文的概率

20210517

对BERT4CSC模型新增检错任务
新增基于LineByLine文件的inference

References

Comments

使用cpu训练报错

使用config文件如下： MODEL: BERT_CKPT: "bert-base-chinese" DEVICE: "cpu" NAME: "SoftMaskedBertModel"

[loss_coefficient]

HYPER_PARAMS: [0.8]

DATASETS: TRAIN: "datasets/csc/train.json" VALID: "datasets/csc/dev.json" TEST: "datasets/csc/test.json"

SOLVER: BASE_LR: 0.0001 WEIGHT_DECAY: 5e-8 BATCH_SIZE: 32 MAX_EPOCHS: 10 ACCUMULATE_GRAD_BATCHES: 4

TEST: BATCH_SIZE: 16

TASK: NAME: "csc"

OUTPUT_DIR: "checkpoints/SoftMaskedBert"

运行命令： python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

报错： /Users//opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: Checkpoint directory /Users///Documents/personal/BertBasedCorrectionModels/checkpoints/SoftMaskedBert exists and is not empty. warnings.warn(*args, kwargs) GPU available: False, used: False TPU available: None, using: 0 TPU cores Traceback (most recent call last): File "tools/train_csc.py", line 52, in main() File "tools/train_csc.py", line 48, in main train(cfg, model, loaders, ckpt_callback) File "/Users///Documents/personal/BertBasedCorrectionModels/tools/bases.py", line 78, in train trainer.fit(model, train_loader, valid_loader) File "/Users//opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 454, in fit self.accelerator_backend.setup(model) File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/accelerators/cpu_accelerator.py", line 49, in setup self.setup_optimizers(model) File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model) File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/trainer/optimizers.py", line 30, in init_optimizers optim_conf = model.configure_optimizers() File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/engine/bases.py", line 21, in configure_optimizers scheduler = build_lr_scheduler(self.cfg, optimizer) File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/solver/build.py", line 49, in build_lr_scheduler scheduler = getattr(lr_scheduler, cfg.SOLVER.SCHED)(scheduler_args) File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/solver/lr_scheduler.py", line 73, in init super().init(optimizer, last_epoch, verbose) TypeError: init() takes from 2 to 3 positional arguments but 4 were given

opened by woyijkl1 6
cannot Reproduce the result

I fellow the steps. And get different result.

Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████| 199/199 [00:55<00:00, 3.56it/s, loss=0.103, v_num=1] /home/dell/workspace/jiangbingyu/correction/checkpoints/SoftMaskedBert/epoch=09-val_loss=0.13123.ckpt
Testing: 0it [00:00, ?it/s]2021-09-08 23:47:58,342 SoftMaskedBertModel INFO: Testing... Testing: 97%|█████████████████████████████████████████████████████████████████████████████████████████████▏ | 67/69 [00:03<00:00, 18.43it/s] 2021-09-08 23:48:02,103 SoftMaskedBertModel INFO: Test. 2021-09-08 23:48:02,105 SoftMaskedBertModel INFO: loss: 0.08779423662285873 2021-09-08 23:48:02,105 SoftMaskedBertModel INFO: Detection: acc: 0.5000 2021-09-08 23:48:02,106 SoftMaskedBertModel INFO: Correction: acc: 0.6900 2021-09-08 23:48:02,114 SoftMaskedBertModel INFO: The detection result is precision=0.8228782287822878, recall=0.6308345120226309 and F1=0.7141713370696557 2021-09-08 23:48:02,115 SoftMaskedBertModel INFO: The correction result is precision=0.7399103139013453, recall=0.6534653465346535 and F1=0.694006309148265 2021-09-08 23:48:02,116 SoftMaskedBertModel INFO: Sentence Level: acc:0.690000, precision:0.829508, recall:0.466790, f1:0.597403 Testing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 69/69 [00:03<00:00, 18.27it/s]

DATALOADER:0 TEST RESULTS {'val_loss': 0.08779423662285873}

opened by leon2milan 5
Fix: A faulty var name in `pl.Trainer`.

This issue from the last commit makes the repo not work now.
The feeding vars for callbacks should be a list, the ckpt_callback is an instance of ModelCheckpoint.

opened by okcd00 4

数据预处理的时候编码报错

初次运行模型在数据预处理B1_training.sgml的时候编码报错，文件是通过给的网址下的，有尝试在open的时候加入encoding='utf-8'但是没有作用。人工看了下文件也看不出问题出在哪..问题第一次似乎出在处理第5842行的时候。

<PASSAGE id="B1-0826-1">因為那是我的第一次去北京，我的朋友就是我的導遊。跟他我們一起去了北京特別的地方，必如說長城、故宮、天堂公園什麼的。</PASSAGE>

Traceback (most recent call last):
  File "/home/BertBasedCorrect/tools/train_csc.py", line 51, in <module>
    main()
  File "/home/BertBasedCorrect/tools/train_csc.py", line 28, in main
    preproc()
  File "/home/BertBasedCorrect/bbcm/data/processors/csc.py", line 185, in preproc
    for item in read_data(get_abs_path('datasets', 'csc')):
  File "/home/BertBasedCorrect/bbcm/data/processors/csc.py", line 116, in read_data
    for line in f:
  File "/home/anaconda3/envs/torch/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 4867: invalid start byte

Process finished with exit code 1

opened by kovnew 3

ValueError: Expected input batch_size (80) to match target batch_size (88)

l can train correctly the model with the public datasets, but when l use my company data to train model,an error occurs,as follows: ValueError: Expected input batch_size (80) to match target batch_size (88)

note: my data format is json, the same as the above public datasets, an special example as follows: { "id": "--", "original_text": "播放我的世界之梦想大陆", "wrong_ids": [], "correct_text": "播放我的世界之梦想大陆" }

opened by wshzd 2
运行train_csc.py报错：AttributeError: Can't pickle local object 'get_csc_loader.._collate_fn'
依赖包严格按照给出版本号安装 D:\SoftRun\Anaconda3\envs\torch161\python.exe E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py 2021-04-30 13:58:02,496 bert4csc INFO: Namespace(config_file='', opts=[]) 2021-04-30 13:58:02,496 bert4csc INFO: Loaded configuration file csc/train_bert4csc.yml 2021-04-30 13:58:02,496 bert4csc INFO: MODEL: BERT_CKPT: "bert-base-chinese" DEVICE: "cuda:0" NAME: "bert4csc"

[loss_coefficient]

HYPER_PARAMS: [ 1.0 ] GPU_IDS: [0]

DATASETS: TRAIN: "datasets/csc/train.json" VALID: "datasets/csc/dev.json" TEST: "datasets/csc/test.json"

SOLVER: BASE_LR: 0.001 WEIGHT_DECAY: 0.00001 BATCH_SIZE: 16 WARMUP_EPOCHS: 8 MAX_EPOCHS: 20 ACCUMULATE_GRAD_BATCHES: 16

TEST: BATCH_SIZE: 16

TASK: NAME: "csc"

OUTPUT_DIR: "checkpoints/bert4csc"

2021-04-30 13:58:02,496 bert4csc INFO: Running with config: DATALOADER: NUM_WORKERS: 4 DATASETS: TEST: datasets/csc/test.json TRAIN: datasets/csc/train.json VALID: datasets/csc/dev.json INPUT: MAX_LEN: 512 MODE: ['train', 'test'] MODEL: BERT_CKPT: bert-base-chinese DEVICE: cuda:0 GPU_IDS: [0] HYPER_PARAMS: [1.0] NAME: bert4csc NUM_CLASSES: 10 WEIGHTS: OUTPUT_DIR: checkpoints/bert4csc SOLVER: ACCUMULATE_GRAD_BATCHES: 16 BASE_LR: 0.001 BATCH_SIZE: 16 BIAS_LR_FACTOR: 2 CHECKPOINT_PERIOD: 10 GAMMA: 0.1 LOG_PERIOD: 100 MAX_EPOCHS: 20 MOMENTUM: 0.9 OPTIMIZER_NAME: AdamW STEPS: (30000,) WARMUP_EPOCHS: 8 WARMUP_FACTOR: 0.3333333333333333 WARMUP_ITERS: 500 WARMUP_METHOD: linear WEIGHT_DECAY: 1e-05 WEIGHT_DECAY_BIAS: 0 TASK: NAME: csc TEST: BATCH_SIZE: 16 CKPT_FN: Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\utilities\distributed.py:49: UserWarning: Checkpoint directory E:\nlpcode\BertBasedCorrectionModels-master\checkpoints\bert4csc exists and is not empty. warnings.warn(*args, **kwargs) GPU available: True, used: True TPU available: None, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | bert | BertForMaskedLM | 102 M

102 M Trainable params 0 Non-trainable params 102 M Total params Validation sanity check: 0it [00:00, ?it/s]2021-04-30 13:58:11,384 bert4csc INFO: Valid. Traceback (most recent call last): File "E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py", line 53, in main() File "E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py", line 49, in main train(cfg, model, loaders, ckpt_callback) File "E:\nlpcode\BertBasedCorrectionModels-master\tools\bases.py", line 78, in train trainer.fit(model, train_loader, valid_loader) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 470, in fit results = self.accelerator_backend.train() File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 68, in train results = self.train_or_test() File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 69, in train_or_test results = self.trainer.train() File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 492, in train self.run_sanity_check(self.get_model()) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 690, in run_sanity_check _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 593, in run_evaluation for batch_idx, batch in enumerate(dataloader): File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\torch\utils\data\dataloader.py", line 291, in iter return _MultiProcessingDataLoaderIter(self) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\torch\utils\data\dataloader.py", line 737, in init w.start() File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'get_csc_loader.._collate_fn'

Process finished with exit code 1
opened by xubinxin123 2

关于det_labels的意思

您好，感谢开源，请问一下，模型训练过程中det_labels的意义是什么？

class BertForCsc(CscTrainingModel):
    def __init__(self, cfg, tokenizer):
        super().__init__(cfg)
        self.cfg = cfg
        self.bert = BertForMaskedLM.from_pretrained(cfg.MODEL.BERT_CKPT)
        self.tokenizer = tokenizer

    def forward(self, texts, cor_labels=None, det_labels=None):
        # print('text: ', texts)
        # print('cor_labels: ', cor_labels)
        # print('det labels: ', det_labels)
        if cor_labels is not None:
            # 正确样本不为空
            text_labels = self.tokenizer(cor_labels, padding=True, return_tensors='pt')['input_ids']
            text_labels = text_labels.to(self.cfg.MODEL.DEVICE)
            print('text labels: ', text_labels)
            # Tokens with indices set to -100 are ignored (masked)
            text_labels[text_labels == 0] = -100

看懂了..这个是对模型做检错，但是好像模型并没有做检错这个工作是吗？似乎是直接完成纠错的。因此计算出的det_acc也是恒定为1

opened by kovnew 2

是用自己的数据集训练报错

用作者的数据集训练可以，使用自己的数据集结果报错， ValueError: Expected input batch_size (304) to match target batch_size (336).

我把自己的数据集融合到作者的数据集中也能训练，唯独无法单独训练自己的，不知道是不是还应该修改其他参数？

opened by nvliajia 1

无法加载训练的模型，程序自动从HuggingFace下载模型，这是什么原因？

你好，通过调用inference.py中的load_model_directly()方法，无法加载训练的模型，具体代码如下：

① 代码部分：

def load_model_directly(): ckpt_file = 'SoftMaskedBert/epoch=05-val_loss=0.03253.ckpt' config_file = 'csc/train_SoftMaskedBert.yml'

from bbcm.config import cfg
cp = get_abs_path('checkpoints', ckpt_file)
cfg.merge_from_file(get_abs_path('configs', config_file))
tokenizer = BertTokenizer.from_pretrained(cfg.MODEL.BERT_CKPT)
print("###tokenizer加载完毕")
print("### tokenizer: ", tokenizer)
if cfg.MODEL.NAME in ['bert4csc', 'macbert4csc']:
    model = BertForCsc.load_from_checkpoint(cp,
                                            cfg=cfg,
                                            tokenizer=tokenizer)
else:
    print("###加载模型")
    print("###cp : ", cp)
    model = SoftMaskedBertModel.load_from_checkpoint(cp,
                                                     cfg=cfg,
                                                     tokenizer=tokenizer)
print("###model加载完毕")
model.eval()
model.to(cfg.MODEL.DEVICE)
return model

② 问题：感觉这段代码没有起作用，ckpt文件无法加载，程序还是自动从huggingface下载了。 model = SoftMaskedBertModel.load_from_checkpoint(cp, cfg=cfg, tokenizer=tokenizer) 我查了一下load_from_checkpoint() 方法，对于参数cp, cfg的传递，没有看明白。

opened by TommyTang930 0

关于train_SoftMaskedBert中的HYPER_PARAMS问题

train_SoftMaskedBert中的HYPER_PARAMS数值，是指关于detection loss 与 correction loss的权重占比吗？这个数值是应用于CscTrainingModel.training_step下的loss = self.w * outputs[1] + (1 - self.w) * outputs[0]吗？ 0.8: detection0.2 + correction0.8 我是否可以通过修改该数值达到模型侧重于提高detection的prf

opened by new-cainiao 1

评测脚本问题

用你的配置训练的bert纠错模型，用你的评测脚本：

Sentence Level: 
acc:0.793636, precision:0.828810, recall:0.732472, f1:0.777669

使用realise模型的评测脚本：

{'sent-detect-acc': 82.18181818181817, 
'sent-detect-p': 72.86689419795222, 
'sent-detect-r': 78.9279112754159, 
'sent-detect-f1': 75.77639751552793, 
'sent-correct-acc': 79.9090909090909, 
'sent-correct-p': 68.60068259385666, 
'sent-correct-r': 74.3068391866913, 
'sent-correct-f1': 71.33984028393967}

你只在src == tgt时统计了FP，统计出的FP偏小，导致计算precision时的分母偏小，最终的precision偏大

good first issue

opened by FrankWork 1

Owner

Heng Cai

NLPer

GitHub

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

98 Nov 24, 2022

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

227 Jan 2, 2023

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

751 Dec 30, 2022

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

59 Dec 1, 2022

🤕 spelling exceptions builder for lazy people

?? spelling exceptions builder for lazy people

3 May 12, 2022

Checking spelling of form elements

Checking spelling of form elements. You can check the source files of external workflows/reports and configuration files

15 Sep 12, 2022

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.8k Dec 27, 2022

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.5k Feb 18, 2021

Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究基于序列标注的方法所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022

Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

316 Jan 3, 2023

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

3.5k Dec 30, 2022

Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

14 Nov 2, 2022

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

1 Feb 7, 2022

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

137 Oct 26, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Related tags

Overview

BertBasedCorrectionModels

数据准备

环境准备

训练

实验结果

SoftMaskedBert

Bert类

char level

sentence level

推理

方法一，使用inference脚本:

方法二，直接调用

方法三、导出bert权重，使用transformers或pycorrector调用

引用

License

更新记录

20210618

20210518

20210517

References

Comments

[loss_coefficient]

[loss_coefficient]

0 | bert | BertForMaskedLM | 102 M

Owner

Heng Cai

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

🤕 spelling exceptions builder for lazy people

Checking spelling of form elements

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Super easy library for BERT based NLP models

Super easy library for BERT based NLP models

Chinese Grammatical Error Diagnosis

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Pre-training BERT masked language models with custom vocabulary

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.