PyTorch impelementations of BERT-based Spelling Error Correction Models.

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型,使用PyTorch实现

数据准备

  1. http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
  2. 解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
  3. 复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
  4. 下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录
  5. 请确保以下文件在 datasets/csc 中
    train.sgml
    B1_training.sgml
    C1_training.sgml  
    SIGHAN15_CSC_A2_Training.sgml  
    SIGHAN15_CSC_B2_Training.sgml  
    SIGHAN15_CSC_TestInput.txt
    SIGHAN15_CSC_TestTruth.txt
    

环境准备

  1. 使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境(推荐)
  2. 克隆本项目并进入项目根目录
  3. 安装所需依赖 pip install -r requirements.txt
  4. 如果出现报错 GLIBC 版本过低的问题(GLIBC 的版本更迭容易出事故,不推荐更新),openCC 改为安装较低版本(例如 1.1.0)
  5. 在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型,首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型,目前支持以下配置文件:

  • train_bert4csc.yml
  • train_macbert4csc.yml
  • train_SoftMaskedBert.yml

如有其他需求,可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component sentence level acc p r f
Detection 0.5045 0.8252 0.8416 0.8333
Correction 0.8055 0.9395 0.8748 0.9060

Bert类

char level

MODEL p r f
BERT4CSC 0.9269 0.8651 0.8949
MACBERT4CSC 0.9380 0.8736 0.9047

sentence level

model acc p r f
BERT4CSC 0.7990 0.8482 0.7214 0.7797
MACBERT4CSC 0.8027 0.8525 0.7251 0.7836

推理

方法一,使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下:

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二,直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重,使用transformers或pycorrector调用

  1. 使用convert_to_pure_state_dict.py导出bert权重
  2. 后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目,请按如下格式引用:

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0,可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护,侵权必究。

更新记录

20210618

  1. 修复数据处理的编码报错问题

20210518

  1. 将BERT4CSC检错任务改为使用FocalLoss
  2. 更新修改后的模型实验结果
  3. 降低数据处理时保留原文的概率

20210517

  1. 对BERT4CSC模型新增检错任务
  2. 新增基于LineByLine文件的inference

References

  1. Spelling Error Correction with Soft-Masked BERT
  2. http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
  3. https://github.com/wdimmy/Automatic-Corpus-Generation
  4. transformers
  5. https://github.com/sunnyqiny/Confusionset-guided-Pointer-Networks-for-Chinese-Spelling-Check
  6. SoftMaskedBert-PyTorch
  7. Deep-Learning-Project-Template
  8. https://github.com/lonePatient/TorchBlocks
  9. https://github.com/shibing624/pycorrector
Comments
  • 使用cpu训练报错

    使用cpu训练报错

    使用config文件如下: MODEL: BERT_CKPT: "bert-base-chinese" DEVICE: "cpu" NAME: "SoftMaskedBertModel"

    [loss_coefficient]

    HYPER_PARAMS: [0.8]

    DATASETS: TRAIN: "datasets/csc/train.json" VALID: "datasets/csc/dev.json" TEST: "datasets/csc/test.json"

    SOLVER: BASE_LR: 0.0001 WEIGHT_DECAY: 5e-8 BATCH_SIZE: 32 MAX_EPOCHS: 10 ACCUMULATE_GRAD_BATCHES: 4

    TEST: BATCH_SIZE: 16

    TASK: NAME: "csc"

    OUTPUT_DIR: "checkpoints/SoftMaskedBert"

    运行命令: python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

    报错: /Users//opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: Checkpoint directory /Users///Documents/personal/BertBasedCorrectionModels/checkpoints/SoftMaskedBert exists and is not empty. warnings.warn(*args, kwargs) GPU available: False, used: False TPU available: None, using: 0 TPU cores Traceback (most recent call last): File "tools/train_csc.py", line 52, in main() File "tools/train_csc.py", line 48, in main train(cfg, model, loaders, ckpt_callback) File "/Users///Documents/personal/BertBasedCorrectionModels/tools/bases.py", line 78, in train trainer.fit(model, train_loader, valid_loader) File "/Users//opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 454, in fit self.accelerator_backend.setup(model) File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/accelerators/cpu_accelerator.py", line 49, in setup self.setup_optimizers(model) File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model) File "/Users///opt/anaconda3/envs/sc/lib/python3.6/site-packages/pytorch_lightning/trainer/optimizers.py", line 30, in init_optimizers optim_conf = model.configure_optimizers() File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/engine/bases.py", line 21, in configure_optimizers scheduler = build_lr_scheduler(self.cfg, optimizer) File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/solver/build.py", line 49, in build_lr_scheduler scheduler = getattr(lr_scheduler, cfg.SOLVER.SCHED)(scheduler_args) File "/Users///Documents/personal/BertBasedCorrectionModels/bbcm/solver/lr_scheduler.py", line 73, in init super().init(optimizer, last_epoch, verbose) TypeError: init() takes from 2 to 3 positional arguments but 4 were given

    opened by woyijkl1 6
  • cannot Reproduce the result

    cannot Reproduce the result

    I fellow the steps. And get different result.

    Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████| 199/199 [00:55<00:00, 3.56it/s, loss=0.103, v_num=1] /home/dell/workspace/jiangbingyu/correction/checkpoints/SoftMaskedBert/epoch=09-val_loss=0.13123.ckpt
    Testing: 0it [00:00, ?it/s]2021-09-08 23:47:58,342 SoftMaskedBertModel INFO: Testing... Testing: 97%|█████████████████████████████████████████████████████████████████████████████████████████████▏ | 67/69 [00:03<00:00, 18.43it/s] 2021-09-08 23:48:02,103 SoftMaskedBertModel INFO: Test. 2021-09-08 23:48:02,105 SoftMaskedBertModel INFO: loss: 0.08779423662285873 2021-09-08 23:48:02,105 SoftMaskedBertModel INFO: Detection: acc: 0.5000 2021-09-08 23:48:02,106 SoftMaskedBertModel INFO: Correction: acc: 0.6900 2021-09-08 23:48:02,114 SoftMaskedBertModel INFO: The detection result is precision=0.8228782287822878, recall=0.6308345120226309 and F1=0.7141713370696557 2021-09-08 23:48:02,115 SoftMaskedBertModel INFO: The correction result is precision=0.7399103139013453, recall=0.6534653465346535 and F1=0.694006309148265 2021-09-08 23:48:02,116 SoftMaskedBertModel INFO: Sentence Level: acc:0.690000, precision:0.829508, recall:0.466790, f1:0.597403 Testing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 69/69 [00:03<00:00, 18.27it/s]

    DATALOADER:0 TEST RESULTS {'val_loss': 0.08779423662285873}

    opened by leon2milan 5
  • Fix: A faulty var name in `pl.Trainer`.

    Fix: A faulty var name in `pl.Trainer`.

    This issue from the last commit makes the repo not work now.
    The feeding vars for callbacks should be a list, the ckpt_callback is an instance of ModelCheckpoint.

    image

    opened by okcd00 4
  • 数据预处理的时候编码报错

    数据预处理的时候编码报错

    初次运行模型在数据预处理B1_training.sgml的时候编码报错,文件是通过给的网址下的,有尝试在open的时候加入encoding='utf-8'但是没有作用。人工看了下文件也看不出问题出在哪..问题第一次似乎出在处理第5842行的时候。

    <PASSAGE id="B1-0826-1">因為那是我的第一次去北京,我的朋友就是我的導遊。跟他我們一起去了北京特別的地方,必如說長城、故宮、天堂公園什麼的。</PASSAGE>

    Traceback (most recent call last):
      File "/home/BertBasedCorrect/tools/train_csc.py", line 51, in <module>
        main()
      File "/home/BertBasedCorrect/tools/train_csc.py", line 28, in main
        preproc()
      File "/home/BertBasedCorrect/bbcm/data/processors/csc.py", line 185, in preproc
        for item in read_data(get_abs_path('datasets', 'csc')):
      File "/home/BertBasedCorrect/bbcm/data/processors/csc.py", line 116, in read_data
        for line in f:
      File "/home/anaconda3/envs/torch/lib/python3.7/codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 4867: invalid start byte
    
    Process finished with exit code 1
    
    opened by kovnew 3
  • ValueError: Expected input batch_size (80) to match target batch_size (88)

    ValueError: Expected input batch_size (80) to match target batch_size (88)

    l can train correctly the model with the public datasets, but when l use my company data to train model,an error occurs,as follows: ValueError: Expected input batch_size (80) to match target batch_size (88)

    note: my data format is json, the same as the above public datasets, an special example as follows: { "id": "--", "original_text": "播放我的世界之梦想大陆", "wrong_ids": [], "correct_text": "播放我的世界之梦想大陆" }

    opened by wshzd 2
  • 运行train_csc.py报错:AttributeError: Can't pickle local object 'get_csc_loader.<locals>._collate_fn'

    运行train_csc.py报错:AttributeError: Can't pickle local object 'get_csc_loader.._collate_fn'

    依赖包严格按照给出版本号安装 D:\SoftRun\Anaconda3\envs\torch161\python.exe E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py 2021-04-30 13:58:02,496 bert4csc INFO: Namespace(config_file='', opts=[]) 2021-04-30 13:58:02,496 bert4csc INFO: Loaded configuration file csc/train_bert4csc.yml 2021-04-30 13:58:02,496 bert4csc INFO: MODEL: BERT_CKPT: "bert-base-chinese" DEVICE: "cuda:0" NAME: "bert4csc"

    [loss_coefficient]

    HYPER_PARAMS: [ 1.0 ] GPU_IDS: [0]

    DATASETS: TRAIN: "datasets/csc/train.json" VALID: "datasets/csc/dev.json" TEST: "datasets/csc/test.json"

    SOLVER: BASE_LR: 0.001 WEIGHT_DECAY: 0.00001 BATCH_SIZE: 16 WARMUP_EPOCHS: 8 MAX_EPOCHS: 20 ACCUMULATE_GRAD_BATCHES: 16

    TEST: BATCH_SIZE: 16

    TASK: NAME: "csc"

    OUTPUT_DIR: "checkpoints/bert4csc"

    2021-04-30 13:58:02,496 bert4csc INFO: Running with config: DATALOADER: NUM_WORKERS: 4 DATASETS: TEST: datasets/csc/test.json TRAIN: datasets/csc/train.json VALID: datasets/csc/dev.json INPUT: MAX_LEN: 512 MODE: ['train', 'test'] MODEL: BERT_CKPT: bert-base-chinese DEVICE: cuda:0 GPU_IDS: [0] HYPER_PARAMS: [1.0] NAME: bert4csc NUM_CLASSES: 10 WEIGHTS: OUTPUT_DIR: checkpoints/bert4csc SOLVER: ACCUMULATE_GRAD_BATCHES: 16 BASE_LR: 0.001 BATCH_SIZE: 16 BIAS_LR_FACTOR: 2 CHECKPOINT_PERIOD: 10 GAMMA: 0.1 LOG_PERIOD: 100 MAX_EPOCHS: 20 MOMENTUM: 0.9 OPTIMIZER_NAME: AdamW STEPS: (30000,) WARMUP_EPOCHS: 8 WARMUP_FACTOR: 0.3333333333333333 WARMUP_ITERS: 500 WARMUP_METHOD: linear WEIGHT_DECAY: 1e-05 WEIGHT_DECAY_BIAS: 0 TASK: NAME: csc TEST: BATCH_SIZE: 16 CKPT_FN: Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

    • This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

    • This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\utilities\distributed.py:49: UserWarning: Checkpoint directory E:\nlpcode\BertBasedCorrectionModels-master\checkpoints\bert4csc exists and is not empty. warnings.warn(*args, **kwargs) GPU available: True, used: True TPU available: None, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

      | Name | Type | Params


    0 | bert | BertForMaskedLM | 102 M

    102 M Trainable params 0 Non-trainable params 102 M Total params Validation sanity check: 0it [00:00, ?it/s]2021-04-30 13:58:11,384 bert4csc INFO: Valid. Traceback (most recent call last): File "E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py", line 53, in main() File "E:/nlpcode/BertBasedCorrectionModels-master/tools/train_csc.py", line 49, in main train(cfg, model, loaders, ckpt_callback) File "E:\nlpcode\BertBasedCorrectionModels-master\tools\bases.py", line 78, in train trainer.fit(model, train_loader, valid_loader) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 470, in fit results = self.accelerator_backend.train() File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 68, in train results = self.train_or_test() File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 69, in train_or_test results = self.trainer.train() File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 492, in train self.run_sanity_check(self.get_model()) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 690, in run_sanity_check _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 593, in run_evaluation for batch_idx, batch in enumerate(dataloader): File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\torch\utils\data\dataloader.py", line 291, in iter return _MultiProcessingDataLoaderIter(self) File "D:\SoftRun\Anaconda3\envs\torch161\lib\site-packages\torch\utils\data\dataloader.py", line 737, in init w.start() File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "D:\SoftRun\Anaconda3\envs\torch161\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'get_csc_loader.._collate_fn'

    Process finished with exit code 1

    opened by xubinxin123 2
  • 关于det_labels的意思

    关于det_labels的意思

    您好,感谢开源,请问一下,模型训练过程中det_labels的意义是什么?

    class BertForCsc(CscTrainingModel):
        def __init__(self, cfg, tokenizer):
            super().__init__(cfg)
            self.cfg = cfg
            self.bert = BertForMaskedLM.from_pretrained(cfg.MODEL.BERT_CKPT)
            self.tokenizer = tokenizer
    
        def forward(self, texts, cor_labels=None, det_labels=None):
            # print('text: ', texts)
            # print('cor_labels: ', cor_labels)
            # print('det labels: ', det_labels)
            if cor_labels is not None:
                # 正确样本不为空
                text_labels = self.tokenizer(cor_labels, padding=True, return_tensors='pt')['input_ids']
                text_labels = text_labels.to(self.cfg.MODEL.DEVICE)
                print('text labels: ', text_labels)
                # Tokens with indices set to -100 are ignored (masked)
                text_labels[text_labels == 0] = -100
    

    看懂了..这个是对模型做检错,但是好像模型并没有做检错这个工作是吗?似乎是直接完成纠错的。因此计算出的det_acc也是恒定为1

    opened by kovnew 2
  • 是用自己的数据集训练报错

    是用自己的数据集训练报错

    用作者的数据集训练可以,使用自己的数据集结果报错, ValueError: Expected input batch_size (304) to match target batch_size (336).

    我把自己的数据集融合到作者的数据集中也能训练,唯独无法单独训练自己的,不知道是不是还应该修改其他参数?

    opened by nvliajia 1
  • 无法加载训练的模型,程序自动从HuggingFace下载模型,这是什么原因?

    无法加载训练的模型,程序自动从HuggingFace下载模型,这是什么原因?

    你好,通过调用inference.py中的load_model_directly()方法,无法加载训练的模型,具体代码如下:

    ① 代码部分:

    def load_model_directly(): ckpt_file = 'SoftMaskedBert/epoch=05-val_loss=0.03253.ckpt' config_file = 'csc/train_SoftMaskedBert.yml'

    from bbcm.config import cfg
    cp = get_abs_path('checkpoints', ckpt_file)
    cfg.merge_from_file(get_abs_path('configs', config_file))
    tokenizer = BertTokenizer.from_pretrained(cfg.MODEL.BERT_CKPT)
    print("###tokenizer加载完毕")
    print("### tokenizer: ", tokenizer)
    if cfg.MODEL.NAME in ['bert4csc', 'macbert4csc']:
        model = BertForCsc.load_from_checkpoint(cp,
                                                cfg=cfg,
                                                tokenizer=tokenizer)
    else:
        print("###加载模型")
        print("###cp : ", cp)
        model = SoftMaskedBertModel.load_from_checkpoint(cp,
                                                         cfg=cfg,
                                                         tokenizer=tokenizer)
    print("###model加载完毕")
    model.eval()
    model.to(cfg.MODEL.DEVICE)
    return model
    

    ② 问题: 感觉这段代码没有起作用,ckpt文件无法加载,程序还是自动从huggingface下载了。 model = SoftMaskedBertModel.load_from_checkpoint(cp, cfg=cfg, tokenizer=tokenizer) 我查了一下load_from_checkpoint() 方法,对于参数cp, cfg的传递,没有看明白。

    opened by TommyTang930 0
  • 关于train_SoftMaskedBert中的HYPER_PARAMS问题

    关于train_SoftMaskedBert中的HYPER_PARAMS问题

    train_SoftMaskedBert中的HYPER_PARAMS数值,是指关于detection loss 与 correction loss的权重占比吗? 这个数值是应用于CscTrainingModel.training_step下的loss = self.w * outputs[1] + (1 - self.w) * outputs[0]吗? 0.8: detection0.2 + correction0.8 我是否可以通过修改该数值达到模型侧重于提高detection的prf

    opened by new-cainiao 1
  • 评测脚本问题

    评测脚本问题

    用你的配置训练的bert纠错模型,用你的评测脚本:

    Sentence Level: 
    acc:0.793636, precision:0.828810, recall:0.732472, f1:0.777669
    

    使用realise模型的评测脚本:

    {'sent-detect-acc': 82.18181818181817, 
    'sent-detect-p': 72.86689419795222, 
    'sent-detect-r': 78.9279112754159, 
    'sent-detect-f1': 75.77639751552793, 
    'sent-correct-acc': 79.9090909090909, 
    'sent-correct-p': 68.60068259385666, 
    'sent-correct-r': 74.3068391866913, 
    'sent-correct-f1': 71.33984028393967}
    

    你只在src == tgt时统计了FP,统计出的FP偏小,导致计算precision时的分母偏小,最终的precision偏大

    good first issue 
    opened by FrankWork 1
Owner
Heng Cai
NLPer
Heng Cai
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 2, 2023
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
🤕 spelling exceptions builder for lazy people

?? spelling exceptions builder for lazy people

Vlad Bokov 3 May 12, 2022
Checking spelling of form elements

Checking spelling of form elements. You can check the source files of external workflows/reports and configuration files

СКБ Контур (команда 1с) 15 Sep 12, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.5k Feb 18, 2021
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

null 12 Nov 25, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 3, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 2, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 7, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022