Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

Related tags

Deep Learning PLOME
Overview

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021)

This repository provides the code and data of the work in ACL2021: PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction

Requirements:

  • python3

  • tensorflow1.14

  • horovod

Instructions:

  • Finetune:

    train and evaluation file format: original sentence \t golden sentence

    step1: cd finetune_src ; 
    step2: download the pretrained PLOME model and corpus from https://drive.google.com/file/d/1aip_siFdXynxMz6-2iopWvJqr5jtUu3F/view?usp=sharing ;
    step3: sh start.sh
  • Pre-train

    step1: cd pre_train_src ;
    step2: sh gen_train_tfrecords.sh ;
    step3: sh start.sh

    Our pre-trained model: https://drive.google.com/file/d/1aip_siFdXynxMz6-2iopWvJqr5jtUu3F/view?usp=sharing

Comments
  • 测试数据集及评价方式?

    测试数据集及评价方式?

    感谢公开源码!

    我使用放出的plome模型,使用放出的训练集微调,使用放出的测试集(542条),最后的结果是 pinyin result: token check: p=0.933, r=0.874, f=0.902 token correction: p=0.975, r=0.852, f=0.909 fusion result: token check: p=0.936, r=0.871, f=0.902 token correction: p=0.966, r=0.841, f=0.899 sent check: p=0.867, r=0.800, f=0.832 sent correction: p=0.841, r=0.776, f=0.807 不知是否正确?

    论文给出的结果为 image

    不过测试集好像是1100条

    这两个结果是同一种评价方式吗?

    求解答

    opened by wangwang110 8
  • 预测部分代码

    预测部分代码

    你好,请教一下关于预测部分代码实现的问题: ① 根据 train_eval_tagging.py,从中抽取了evaluate() 部分的代码进行修改。 ② 在实现predict过程中,由于测试样本没有标签,剔除了bert_tagging.py中关于label相关的代码,在调用create_model()方法时,出现了一些Error,请问在create_model()之前是否需要添加 graph=tf.get_default_graph() ? 还有为啥我的中间输出了一些类似这样的数值: (?, 40, 768) , (?, 4, 32) ,这里的 ?是不是表示输出有问题?

    opened by TommyTang930 1
  • cbert使用

    cbert使用

    作者,您好,请问cbert是最终训练好的模型吗?评测直接使用cbert,会报错 Key loss/output_py_bias not found in checkpoint Not found: Key py_emb/GRU/rnn/gru_cell/candidate/bias not found in checkpoint

    应该是用plome那个模型

    opened by joyJZhang 0
  • 执行程序问题

    执行程序问题

    您好,我在执行您的程序的时候遇到了 Traceback (most recent call last): File "G:/PLOME-main/finetune_src/train_eval_tagging.py", line 268, in print('\tsk_or_py=', FLAGS.sk_or_py) File "G:/PLOME-main/finetune_src/train_eval_tagging.py", line 203, in train start_time = time.time() File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 950, in run run_metadata_ptr) File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run run_metadata) File "G:\PLOME-main\venv\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Key: stroke_ids. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[IteratorGetNext]] [[IteratorGetNext/_475]] (1) Invalid argument: Key: stroke_ids. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[IteratorGetNext]] 0 successful operations. 0 derived errors ignored. 请问这个问题您有解决方案么,本地环境和您在readme里面说的一样,谢谢您

    opened by zhoukaiwei66 3
  • 关乎mask策略,是只mask前面的字吗?

    关乎mask策略,是只mask前面的字吗?

    def mask_process(self, input_sample):
        valid_ids = [idx for (idx, v) in enumerate(input_sample) if v not in self.invalid_ids]
        masked_sample = copy.deepcopy(list(input_sample))
        seq_len = len(masked_sample)
        masked_flgs = [0] * seq_len
        n_masked = int(len(valid_ids) * self.config['global_rate'])
        if n_masked < 1:
            n_masked = 1
        random.shuffle(valid_ids)
        for pos in valid_ids[:n_masked]:
            method = self.get_mask_method()
            if method == 'pinyin':
                new_c = self.same_py_confusion.get_confusion_item_by_ids(input_sample[pos])
                if new_c is not None:
                    masked_sample[pos] = new_c
                    masked_flgs[pos] = 1
            elif method == 'jinyin':
                new_c = self.simi_py_confusion.get_confusion_item_by_ids(input_sample[pos])
                if new_c is not None:
                    masked_sample[pos] = new_c
                    masked_flgs[pos] = 1
            elif method == 'stroke':
                new_c = self.sk_confusion.get_confusion_item_by_ids(input_sample[pos]) 
                if new_c is not None:
                    masked_sample[pos] = new_c
                    masked_flgs[pos] = 1
            elif method == 'random':
                new_c = self.all_token_ids[random.randint(0, self.n_all_token_ids)]
                if new_c is not None:
                    masked_sample[pos] = new_c
                    masked_flgs[pos] = 1
            elif method == 'keep': 
                masked_flgs[pos] = 1
        masked_py_ids = [self.tokenid_pyid.get(x, 1) for x in masked_sample]  
        masked_sk_ids = [self.tokenid_skid.get(x, 1) for x in masked_sample] 
        return np.asarray(masked_sample, dtype=np.int32), np.asarray(masked_flgs, dtype=np.int32), np.asarray(masked_py_ids, dtype=np.int32), np.asarray(masked_sk_ids, dtype=np.int32)
    

    pre_train_src/mask.py中Mask这个类中的代码,看起来对每个样本,都只mask前面的字。

    opened by liwenju0 0
Owner
null
Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

UniRE Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021. Requirements python: 3.7.6 pytorch: 1.8.1 transformers:

Wang Yijun 109 Nov 29, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling

null 274 Dec 6, 2022
Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

AMR-Dialogue An implementation for paper "Semantic Representation for Dialogue Modeling". You may find our paper here. Requirements python 3.6 pytorch

xfbai 45 Dec 26, 2022
Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

Bo Zheng 42 Dec 9, 2022
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

null 108 Dec 23, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

null 104 Jan 1, 2023
Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Transformers for variable misuse, function naming and code completion tasks The official PyTorch implementation of: Empirical Study of Transformers fo

Bayesian Methods Research Group 56 Nov 15, 2022
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Zhensu Sun 1 Oct 26, 2021
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 3, 2023
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 6, 2022
Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

Facebook Research 296 Dec 29, 2022
[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

TBE The source code for our paper "Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Le

Jinpeng Wang 150 Dec 28, 2022
Open source code for Paper "A Co-Interactive Transformer for Joint Slot Filling and Intent Detection"

A Co-Interactive Transformer for Joint Slot Filling and Intent Detection This repository contains the PyTorch implementation of the paper: A Co-Intera

null 67 Dec 5, 2022
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Code Transformer This is an official PyTorch implementation of the CodeTransformer model proposed in: D. Zügner, T. Kirschstein, M. Catasta, J. Leskov

Daniel Zügner 131 Dec 13, 2022