Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Last update: Dec 6, 2022

Related tags

Deep Learning LEBERT

Overview

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter

Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Arxiv link of the paper: https://arxiv.org/abs/2105.07148

Requirement

Python 3.7.0
Transformer 3.4.0
Numpy 1.18.5
Packaging 17.1
skicit-learn 0.23.2
torch 1.16.0+cu92
tqdm 4.50.2
multiprocess 0.70.10
tensorflow 2.3.1
tensorboardX 2.1
seqeval 1.2.1

Input Format

CoNLL format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

美   B-LOC  
国   E-LOC  
的   O  
华   B-PER  
莱   I-PER  
士   E-PER  

我   O  
跟   O  
他   O  
谈   O  
笑   O  
风   O  
生   O

Chinese BERT，Chinese Word Embedding, and Checkpoints

Chinese BERT

Chinese BERT: https://cdn.huggingface.co/bert-base-chinese-pytorch_model.bin

Chinese word embedding:

Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz

Checkpoints and Shells

Directory Structure of data

berts
- bert
  - config.json
  - vocab.txt
  - pytorch_model.bin
dataset
- NER
  - weibo
  - note4
  - msra
  - resume
- POS
  - ctb5
  - ctb6
  - ud1
  - ud2
- CWS
  - ctb6
  - msr
  - pku
vocab
- tencent_vocab.txt, the vocab of pre-trained word embedding table.
embedding
- word_embedding.txt
result
- NER
  - weibo
  - note4
  - msra
  - resume
- POS
  - ctb5
  - ctb6
  - ud1
  - ud2
- CWS
  - ctb6
  - msr
  - pku
log

Run

1.Convert .char.bmes file to .json file, python3 to_json.py
2.run the shell, sh run_ner.sh

If you want to load my checkpoints, you need to make some revisions to your transformers.

My model is trained in distribution mode so it can not be directly loaded by single-GPU mode. You can follow the below steps to revise the transformers before load my checkpoints.

Enter the source code director of Transformer, cd source/transformers-master
Find the modeling_util.py, and positioned to about 995 lines
change the code as follows:
Compile the revised source code and install. python3 setup.py install

Cite

@misc{liu2021lexicon,
      title={Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter}, 
      author={Wei Liu and Xiyan Fu and Yue Zhang and Wenming Xiao},
      year={2021},
      eprint={2105.07148},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

IndexError: list index out of range

你好，感谢你的开源。我在做do_evaluate和do_predict时，不管是你提供的数据，还是我自己的数据，都会出现错误，不知道这个问题时什么导致的，想请教一下你。

Traceback (most recent call last): File "Trainer.py", line 598, in main() File "Trainer.py", line 574, in main train(model, args, train_dataset, dev_dataset, test_dataset, label_vocab, tb_writer) File "Trainer.py", line 377, in train metrics, _ = evaluate(model, args, dev_dataset, label_vocab, global_step, description="Dev", write_file=True) File "Trainer.py", line 465, in evaluate all_label_ids, all_predict_ids, all_attention_mask, label_vocab) File "LEBERT/function/metrics.py", line 40, in seq_f1_with_mask tmp_pred.append(label_vocab.convert_id_to_item(all_pred_labels[i][j]).replace("M-", "I-")) File "LEBERT/feature/vocab.py", line 84, in convert_id_to_item return self.idx2item[id] IndexError: list index out of range

opened by bultiful 21
NER代码运行问题

你好,大佬: 首先，感谢开源！尝试复现论文结果的时候遇到了一些问题,不知能否抽空解答一下. 1、在运行weiboNER的实验代码时，超参数设置与论文中一样，训练时loss下降有些异常(震荡下降，且前几个epoch验证集测试集f1均为0)，训练日志已邮件发送； 2、具体环境及运行设置： GPU：A100-SXM4-40GB； torch:1.8.1+cu111 训练方式：单卡期待大佬回复指导，谢谢！

opened by JucksonP 18
IndexError: list index out of range
def evaluate(model, args, dataset, label_vocab, global_step, description="dev", write_file=False): """ evaluate the model's performance """ dataloader = get_dataloader(dataset, args, mode='dev') if (not args.do_train) and (not args.no_cuda) and args.local_rank != -1: model = model.cuda() model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True )

batch_size = dataloader.batch_size if args.local_rank == 0 or args.local_rank == -1: logger.info("***** Running %s *****", description) logger.info(" Num examples = %d", len(dataloader.dataset)) logger.info(" Batch size = %d", batch_size) eval_losses = [] model.eval() all_input_ids = None all_label_ids = None all_predict_ids = None all_attention_mask = None for batch in tqdm(dataloader, desc=description): # new batch data: [input_ids, token_type_ids, attention_mask, matched_word_ids, # matched_word_mask, boundary_ids, labels batch_data = (batch[0], batch[2], batch[1], batch[3], batch[4], batch[5], batch[6]) new_batch = batch_data batch = tuple(t.to(args.device) for t in new_batch) inputs = {"input_ids": batch[0], "attention_mask": batch[1], "token_type_ids": batch[2], "matched_word_ids": batch[3], "matched_word_mask": batch[4], "boundary_ids": batch[5], "labels": batch[6], "flag": "Predict"} batch_data = None new_batch = None with torch.no_grad(): outputs = model(**inputs) preds = outputs[0]

========================================================================= training has no problem，but weibo/labels.txt has 28 tags，two tags '' and ''，add to 30. but 31 in pred value.

O B-PER.NOM E-PER.NOM B-LOC.NAM E-LOC.NAM B-PER.NAM I-PER.NAM E-PER.NAM S-PER.NOM B-GPE.NAM E-GPE.NAM B-ORG.NAM I-ORG.NAM E-ORG.NAM I-PER.NOM S-GPE.NAM B-ORG.NOM E-ORG.NOM I-LOC.NAM I-ORG.NOM B-LOC.NOM I-LOC.NOM E-LOC.NOM B-GPE.NOM E-GPE.NOM I-GPE.NAM S-PER.NAM S-LOC.NOM

class ItemVocabFile(): """ Build vocab from file. Note, each line is a item in vocab, or each items[0] is in vocab """ def init(self, files, is_word=False, has_default=False, unk_num=0): self.files = files self.item2idx = {} self.idx2item = [] self.item_size = 0 self.is_word = is_word if not has_default and not self.is_word: self.item2idx[''] = self.item_size self.idx2item.append('') self.item_size += 1 self.item2idx[''] = self.item_size self.idx2item.append('') self.item_size += 1 # for unk words for i in range(unk_num): self.item2idx['{}'.format(i+1)] = self.item_size self.idx2item.append('{}'.format(i+1)) self.item_size += 1

self.init_vocab() print('=======labels info========') print(self.item2idx) print(self.idx2item)

=======labels info======== {'': 0, '': 1, 'O': 2, 'B-PER.NOM': 3, 'E-PER.NOM': 4, 'B-LOC.NAM': 5, 'E-LOC.NAM': 6, 'B-PER.NAM': 7, 'I-PER.NAM': 8, 'E-PER.NAM': 9, 'S-PER.NOM': 10, 'B-GPE.NAM': 11, 'E-GPE.NAM': 12, 'B-ORG.NAM': 13, 'I-ORG.NAM': 14, 'E-ORG.NAM': 15, 'I-PER.NOM': 16, 'S-GPE.NAM': 17, 'B-ORG.NOM': 18, 'E-ORG.NOM': 19, 'I-LOC.NAM': 20, 'I-ORG.NOM': 21, 'B-LOC.NOM': 22, 'I-LOC.NOM': 23, 'E-LOC.NOM': 24, 'B-GPE.NOM': 25, 'E-GPE.NOM': 26, 'I-GPE.NAM': 27, 'S-PER.NAM': 28, 'S-LOC.NOM': 29} ['', '', 'O', 'B-PER.NOM', 'E-PER.NOM', 'B-LOC.NAM', 'E-LOC.NAM', 'B-PER.NAM', 'I-PER.NAM', 'E-PER.NAM', 'S-PER.NOM', 'B-GPE.NAM', 'E-GPE.NAM', 'B-ORG.NAM', 'I-ORG.NAM', 'E-ORG.NAM', 'I-PER.NOM', 'S-GPE.NAM', 'B-ORG.NOM', 'E-ORG.NOM', 'I-LOC.NAM', 'I-ORG.NOM', 'B-LOC.NOM', 'I-LOC.NOM', 'E-LOC.NOM', 'B-GPE.NOM', 'E-GPE.NOM', 'I-GPE.NAM', 'S-PER.NAM', 'S-LOC.NOM']
opened by s1162276945 7

IndexError: list index out of range

when I run run_ner.sh and --do_evaluate I have encountered these kind of problems：

Traceback (most recent call last):
  File "Trainer.py", line 598, in <module>
    main()
  File "Trainer.py", line 574, in main
    train(model, args, train_dataset, dev_dataset, test_dataset, label_vocab, tb_writer)
  File "Trainer.py", line 377, in train
    metrics, _ = evaluate(model, args, dev_dataset, label_vocab, global_step, description="Dev", write_file=True)
  File "Trainer.py", line 465, in evaluate
    all_label_ids, all_predict_ids, all_attention_mask, label_vocab)
  File "LEBERT/function/metrics.py", line 40, in seq_f1_with_mask
    tmp_pred.append(label_vocab.convert_id_to_item(all_pred_labels[i][j]).replace("M-", "I-"))
  File "LEBERT/feature/vocab.py", line 84, in convert_id_to_item
    return self.idx2item[id]
IndexError: list index out of range

opened by MisakaMikoto96 7

改完transformers的源码，运行shell脚本出现问题

/opt/conda/lib/python3.6/site-packages/transformers-4.7.0.dev0-py3.6.egg/transformers/tokenization_utils_base.py:1631: FutureWarning: Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated and won't be possible anymore in v5. Use a model identifier or the path to a directory instead. FutureWarning, Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in main() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'Trainer.py', '--local_rank=0', '--do_train', '--do_eval', '--do_predict', '--evaluate_during_training', '--data_dir=data/dataset/NER/weibo', '--output_dir=data/result/NER/weibo/lebertcrf', '--config_name=data/berts/bert/config.json', '--model_name_or_path=data/berts/bert/pytorch_model.bin', '--vocab_file=data/berts/bert/vocab.txt', '--word_vocab_file=data/vocab/tencent_vocab.txt', '--max_scan_num=100', '--max_word_num=5', '--label_file=data/dataset/NER/weibo/labels.txt', '--word_embedding=data/embedding/word_embedding.txt', '--saved_embedding_dir=data/dataset/NER/weibo', '--model_type=LEBertCRF_Token', '--seed=106524', '--per_gpu_train_batch_size=4', '--per_gpu_eval_batch_size=4', '--learning_rate=1e-5', '--max_steps=-1', '--max_seq_length=256', '--num_train_epochs=2', '--warmup_steps=190', '--save_steps=600', '--logging_steps=600']' died with <Signals.SIGKILL: 9>.

opened by s1162276945 6

trainner can not support LEBertCRF_Token model type

when I prepare all of dataset, and run ./run_ner.sh, it will throw the error:

Traceback (most recent call last):
  File "Trainer.py", line 593, in <module>
    main()
  File "Trainer.py", line 552, in main
    model = model.cuda()
UnboundLocalError: local variable 'model' referenced before assignment
Traceback (most recent call last):
  File "/home/human/miniconda3/envs/qzqExp/lib/python3.7/runpy.py", line 193, in _run_module_as_main

After checking the code, it's:

https://github.com/liuwei1206/LEBERT/blob/main/Trainer.py#L537-L550

there is no 'LEBertCRF_Token' in Trainner.

opened by wj-Mcat 5

where download the checkpoints?

Hello, I have downloaded all the files you provide in the github, but I didn't find the checkpoints. Can you tell how to download it? Providing the links is the best, thank you.

opened by dghlnvyps 4
ontonote4 checkpoint and experimental replication problems
Hi, I met some problems when replicating your experiments.

I use the ontonote4 checkpint you provided to do prediction. It shows that the crf.transitions has a shape torch.Size([20, 20]) from checkpoint, but actually this dataset only has 17 labels and the crf.transitions should be [19, 19]. So could you please check the checkpoint file of ontonote4?

I used a single GPU to train the models on weibo and ontonote4 dataset without changing any code or paramters. However the best F1 score of weibo is 0.68 and the ontonote4 is 0.80, which is lower than your result. If it is because you used the ditributed training, would you please provide the detailed paramters of distributed training, or the training parameters of single GPU that could reach your scores?

Many thanks in advance.
opened by founting 4
Results all "O" on weibo

Your work is really amazing! I am currently learning your code. When I try to train LEBERT on the weibo dataset, I find that the predicted results are all "O" despite that I haven't done any changes to your code. However, using the checkpoint you provide can indeed get good results. What could be the reason for this? How can I train on my own to get these checkpoints provided by you? I would be very appreciated if you helped me!

opened by Ononoki-Yotsugi 3
Note复现

@liuwei1206 您好，我的脚本参照给出的note4 shell设置如下: CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --master_port 13017 --nproc_per_node=1 \ Trainer.py --do_train --do_eval --do_predict --evaluate_during_training \ --data_dir="data/dataset/NER/note4" \ --output_dir="data/result/NER/note4/wcbertcrf" \ --config_name="data/berts/bert/config.json" \ --model_name_or_path="/home/root1/lizheng/pretrainModels/torch/chinese/bert-base-chinese/pytorch_model.bin" \ --vocab_file="/home/root1/lizheng/pretrainModels/torch/chinese/bert-base-chinese/vocab.txt" \ --word_vocab_file="data/vocab/tencent_vocab.txt" \ --max_scan_num=1500000 \ --max_word_num=5 \ --label_file="data/dataset/NER/note4/labels.txt" \ --word_embedding="data/embedding/word_embedding.txt" \ --saved_embedding_dir="data/dataset/NER/note4" \ --model_type="WCBertCRF_Token" \ --seed=106524 \ --per_gpu_train_batch_size=4 \ --per_gpu_eval_batch_size=32 \ --learning_rate=1e-5 \ --max_steps=-1 \ --max_seq_length=256 \ --num_train_epochs=20 \ --warmup_steps=190 \ --save_steps=600 \ --logging_steps=300

但结果test F1为80左右，是否因为您是多卡训练，我是单卡训练的差异，可否看一下脚本是否无误

opened by 447428054 3
Problems on my own task

Here's the running error.

Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated Traceback (most recent call last): File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in main() File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/wen/anaconda3/envs/pytorch/bin/python3', '-u', 'Trainer.py', '--local_rank=0', '--do_train', '--do_eval', '--do_predict', '--evaluate_during_training', '--data_dir=data/dataset/COIE/origin', '--output_dir=data/result/COIE/origin/lebertcrf', '--config_name=data/berts/bert/config.json', '--model_name_or_path=data/berts/bert/pytorch_model.bin', '--vocab_file=data/berts/bert/vocab.txt', '--word_vocab_file=data/vocab/tencent_vocab.txt', '--max_scan_num=1000000', '--max_word_num=5', '--label_file=data/dataset/COIE/origin/labels.txt', '--word_embedding=data/embedding/Tencent_AILab_ChineseEmbedding.txt', '--saved_embedding_dir=data/dataset/COIE/origin', '--model_type=WCBertCRF_Token', '--seed=106524', '--per_gpu_train_batch_size=4', '--per_gpu_eval_batch_size=16', '--learning_rate=1e-5', '--max_steps=-1', '--max_seq_length=256', '--num_train_epochs=20', '--warmup_steps=190', '--save_steps=600', '--logging_steps=100']' died with <Signals.SIGKILL: 9>.

And here's the log.

2021-07-07 15:40:29:INFO: Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False 2021-07-07 15:40:29:INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, config_name='data/berts/bert/config.json', data_dir='data/dataset/COIE/origin', default_label='O', device=device(type='cuda', index=0), do_eval=True, do_predict=True, do_shuffle=True, do_train=True, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, label_file='data/dataset/COIE/origin/labels.txt', learning_rate=1e-05, local_rank=0, logging_dir='data/log', logging_steps=100, max_grad_norm=1.0, max_scan_num=1000000, max_seq_length=256, max_steps=-1, max_word_num=5, model_name_or_path='data/berts/bert/pytorch_model.bin', model_type='WCBertCRF_Token', n_gpu=1, no_cuda=False, nodes=1, num_train_epochs=20, output_dir='data/result/COIE/origin/lebertcrf', overwrite_cache=True, per_gpu_eval_batch_size=16, per_gpu_train_batch_size=4, save_steps=600, save_total_limit=50, saved_embedding_dir='data/dataset/COIE/origin', seed=106524, sgd_momentum=0.9, vocab_file='data/berts/bert/vocab.txt', warmup_steps=190, weight_decay=0.0, word_embed_dim=200, word_embedding='data/embedding/Tencent_AILab_ChineseEmbedding.txt', word_vocab_file='data/vocab/tencent_vocab.txt')

Hope you can reply. Thx.

opened by cjwen15 3

Owner

a hard-working boy!

GitHub

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

AMR-Dialogue An implementation for paper "Semantic Representation for Dialogue Modeling". You may find our paper here. Requirements python 3.6 pytorch

45 Dec 26, 2022

Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

42 Dec 9, 2022

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

43 Nov 7, 2022

Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

UniRE Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021. Requirements python: 3.7.6 pytorch: 1.8.1 transformers:

109 Nov 29, 2022

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

106 Dec 29, 2022

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

104 Jan 1, 2023

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

3 Jan 3, 2023

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow

18 Oct 6, 2022

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

66 Dec 26, 2022

Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

29 Dec 28, 2022

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

73 Nov 6, 2022

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

1 Oct 26, 2021

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

32 Nov 9, 2021