Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

Last update: Nov 26, 2022

Related tags

Deep Learning PLOME

Overview

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021)

This repository provides the code and data of the work in ACL2021: PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction

Requirements:

python3
tensorflow1.14
horovod

Instructions:

Finetune:

train and evaluation file format: original sentence \t golden sentence

step1: cd finetune_src ; 
step2: download the pretrained PLOME model and corpus from https://drive.google.com/file/d/1aip_siFdXynxMz6-2iopWvJqr5jtUu3F/view?usp=sharing ;
step3: sh start.sh

Pre-train
```
step1: cd pre_train_src ;
step2: sh gen_train_tfrecords.sh ;
step3: sh start.sh
```
Our pre-trained model: https://drive.google.com/file/d/1aip_siFdXynxMz6-2iopWvJqr5jtUu3F/view?usp=sharing

Comments

测试数据集及评价方式？

感谢公开源码！

我使用放出的plome模型，使用放出的训练集微调，使用放出的测试集（542条），最后的结果是 pinyin result: token check: p=0.933, r=0.874, f=0.902 token correction: p=0.975, r=0.852, f=0.909 fusion result: token check: p=0.936, r=0.871, f=0.902 token correction: p=0.966, r=0.841, f=0.899 sent check: p=0.867, r=0.800, f=0.832 sent correction: p=0.841, r=0.776, f=0.807 不知是否正确？

论文给出的结果为

不过测试集好像是1100条

这两个结果是同一种评价方式吗？

求解答

opened by wangwang110 8
预测部分代码

你好，请教一下关于预测部分代码实现的问题： ① 根据 train_eval_tagging.py，从中抽取了evaluate() 部分的代码进行修改。 ② 在实现predict过程中，由于测试样本没有标签，剔除了bert_tagging.py中关于label相关的代码，在调用create_model()方法时，出现了一些Error，请问在create_model()之前是否需要添加 graph=tf.get_default_graph() ? 还有为啥我的中间输出了一些类似这样的数值： (?, 40, 768) , (?, 4, 32) ，这里的？是不是表示输出有问题？

opened by TommyTang930 1
cbert使用

作者，您好，请问cbert是最终训练好的模型吗?评测直接使用cbert,会报错 Key loss/output_py_bias not found in checkpoint Not found: Key py_emb/GRU/rnn/gru_cell/candidate/bias not found in checkpoint

应该是用plome那个模型

opened by joyJZhang 0
执行程序问题

您好，我在执行您的程序的时候遇到了 Traceback (most recent call last): File "G:/PLOME-main/finetune_src/train_eval_tagging.py", line 268, in print('\tsk_or_py=', FLAGS.sk_or_py) File "G:/PLOME-main/finetune_src/train_eval_tagging.py", line 203, in train start_time = time.time() File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 950, in run run_metadata_ptr) File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run run_metadata) File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Key: stroke_ids. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[IteratorGetNext]] [[IteratorGetNext/_475]] (1) Invalid argument: Key: stroke_ids. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[IteratorGetNext]] 0 successful operations. 0 derived errors ignored. 请问这个问题您有解决方案么，本地环境和您在readme里面说的一样，谢谢您

opened by zhoukaiwei66 3

关乎mask策略，是只mask前面的字吗？

def mask_process(self, input_sample):
    valid_ids = [idx for (idx, v) in enumerate(input_sample) if v not in self.invalid_ids]
    masked_sample = copy.deepcopy(list(input_sample))
    seq_len = len(masked_sample)
    masked_flgs = [0] * seq_len
    n_masked = int(len(valid_ids) * self.config['global_rate'])
    if n_masked < 1:
        n_masked = 1
    random.shuffle(valid_ids)
    for pos in valid_ids[:n_masked]:
        method = self.get_mask_method()
        if method == 'pinyin':
            new_c = self.same_py_confusion.get_confusion_item_by_ids(input_sample[pos])
            if new_c is not None:
                masked_sample[pos] = new_c
                masked_flgs[pos] = 1
        elif method == 'jinyin':
            new_c = self.simi_py_confusion.get_confusion_item_by_ids(input_sample[pos])
            if new_c is not None:
                masked_sample[pos] = new_c
                masked_flgs[pos] = 1
        elif method == 'stroke':
            new_c = self.sk_confusion.get_confusion_item_by_ids(input_sample[pos]) 
            if new_c is not None:
                masked_sample[pos] = new_c
                masked_flgs[pos] = 1
        elif method == 'random':
            new_c = self.all_token_ids[random.randint(0, self.n_all_token_ids)]
            if new_c is not None:
                masked_sample[pos] = new_c
                masked_flgs[pos] = 1
        elif method == 'keep': 
            masked_flgs[pos] = 1
    masked_py_ids = [self.tokenid_pyid.get(x, 1) for x in masked_sample]  
    masked_sk_ids = [self.tokenid_skid.get(x, 1) for x in masked_sample] 
    return np.asarray(masked_sample, dtype=np.int32), np.asarray(masked_flgs, dtype=np.int32), np.asarray(masked_py_ids, dtype=np.int32), np.asarray(masked_sk_ids, dtype=np.int32)

pre_train_src/mask.py中Mask这个类中的代码，看起来对每个样本，都只mask前面的字。

opened by liwenju0 0

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

Related tags

Overview

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021)

Comments

测试数据集及评价方式？

预测部分代码

cbert使用

执行程序问题

关乎mask策略，是只mask前面的字吗？

Owner

Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

Open source code for Paper "A Co-Interactive Transformer for Joint Slot Filling and Intent Detection"

Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".