A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Overview

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

Environment

  • Python: 3.6
  • Cuda: 10.0
  • Packages: pip install -r requirements.txt

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

  • BERT: chinese-roberta-wwm-ext

    Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
    Local: /data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/

  • Phonetic Encoder: pretrain_pho.sh

  • Graphic Encoder: pretrain_res.sh

  • Merge: merge.py

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

Metrics

  • "D" means "Detection Level", "C" means "Correction Level".
  • "A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.

SIGHAN15

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
Soft-Masked BERT 80.9 73.7 73.2 73.5 77.4 66.7 66.2 66.4
SpellGCN - 74.8 80.7 77.7 - 72.1 77.7 75.9
BERT 82.4 74.2 78.0 76.1 81.0 71.6 75.3 73.4
ReaLiSe 84.7 77.3 81.3 79.3 84.0 75.9 79.9 77.8

SIGHAN14

Method D-A D-P D-R D-F C-A C-P C-R C-F
Pointer Network - 63.2 82.5 71.6 - 79.3 68.9 73.7
SpellGCN - 65.1 69.5 67.2 - 63.1 67.2 65.3
BERT 75.7 64.5 68.6 66.5 74.6 62.4 66.3 64.3
ReaLiSe 78.4 67.8 71.5 69.6 77.7 66.3 70.0 68.1

SIGHAN13

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 63.1 76.2 63.2 69.1 60.5 73.1 60.5 66.2
SpellGCN 78.8 85.7 78.8 82.1 77.8 84.6 77.8 81.0
BERT 77.0 85.0 77.0 80.8 77.4 83.0 75.2 78.9
ReaLiSe 82.7 88.6 82.5 85.4 81.4 87.2 81.2 84.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Comments
  • issues on src/run.py

    issues on src/run.py

    Hi! I found the code in src/run.py 432 has some problems "model = model_class.from_pretrained(args.model_name_or_path, config=config, cache_dir=args.cache_dir if args.cache_dir else None)". The error message is :ValueError: Parameter config in SpellBertPho2ResArch3(config) should be an instance of class PretrainedConfig. To create a model from a pretrained model use model = SpellBertPho2ResArch3.from_pretrained(PRETRAINED_MODEL_NAME).

    opened by Yangyi-Chen 1
  • 在使用训练过的模型进行测试时遇到问题RuntimeError: storage has wrong size

    在使用训练过的模型进行测试时遇到问题RuntimeError: storage has wrong size

    你好,感谢开源!但是我在测试训练的模型时遇到了以下问题,请问是哪里出问题了呢?

    model_type: bert-pho2-res-arch3
    weight_dir: output
    ckpt_name: saved_ckpt-11000
    100%|████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00, 10.99it/s]
    test_batches: 35
    Load model begin...
    Traceback (most recent call last):
      File "src/test.py", line 180, in <module>
        device=args.device,
      File "src/test.py", line 126, in test
        model = model_class.from_pretrained(model_dir)
      File "/home/ReaLiSe/ReaLiSe-master/transformers/modeling_utils.py", line 409, in from_pretrained
        state_dict = torch.load(resolved_archive_file, map_location='cpu')
      File "/root/miniconda3/envs/f36/lib/python3.6/site-packages/torch/serialization.py", line 386, in load
        return _load(f, map_location, pickle_module, **pickle_load_args)
      File "/root/miniconda3/envs/f36/lib/python3.6/site-packages/torch/serialization.py", line 580, in _load
        deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
    RuntimeError: storage has wrong size: expected -4698162557506238965 got 2359296
    
    opened by bestFannie 0
  • 统一回复一些复现上的问题

    统一回复一些复现上的问题

    1. 首先不要装官方的transformers库,因为我们魔改过代码,实验前请一定pip install --editable .
    2. ACL anthology的supplementary matrrial里也有我们开源的代码和模型,也有README讲解怎么评测;
    3. 模型训练过程中,虽然有每1000 step保存一次模型,但报的指标仍然是训练结束后最终的模型(直接取最后一个checkpoint,没用test set去挑)。论文中讲到的每个实验细节均是真实的。另外,我们有持续关注该任务,intern搞baseline时有对ReaLiSe在V100 (以前是P100) 上用相同的脚本进行重新实验,也能复现出 detection 79 F1, correction ~78 F1的指标结果,并进一步提升了ReaLiSe
    4. 关于标点符号的问题,预处理是有的,另外一些中文标点会被识别为UNK,为此我们修改了模型的vocab.txt(具体看google drive OR supplementary)
    5. 拼音编码器的预训练语料和fine-tune数据一样,只是objective不同。拼音和图像encoder的预训练模型是根据下游任务指标去挑的。
    opened by Neutralzz 4
  • Bert baseline

    Bert baseline

    你好,我跑了只使用bert(Semantic)信息的实验(model type = bert , with_res=no),发现15年的correction可以达到76, 几乎和我跑的ReaLiSe模型的性能差不多(~77)。请问我bert的实验配置正确吗?如果正确的话,用你的代码和数据似乎比其他release的代码性能要高很多,请问你做了哪些优化呢(目前我看到,你对数据中unk的情况做了一些处理),谢谢

    opened by sdzhangbo 2
  • 预训练细节问题

    预训练细节问题

    首先祝贺这份工作被录用,其次论文中对拼音和字形的预训练我很感兴趣,请问你们对这两个部分预训练分别采使用什么样规模的数据,比如说多少条句子,文件总大小约为多少;其次,在预训练过程中,你们是否采用了验证集,如果采用了,验证集和训练集的大小比例是什么样的;最后,你们在预训练的时候采用的learningrate是什么级别的,是bert建议在下游任务上finetune的lr还是设置了比较大的lr,十分感谢!

    opened by Aopolin-Lv 0
Owner
DaDa
A student majoring in Computer Science in BIT.
DaDa
Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling

null 274 Dec 6, 2022
Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

null 197 Nov 26, 2022
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 385 Jan 6, 2023
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

AMR-Dialogue An implementation for paper "Semantic Representation for Dialogue Modeling". You may find our paper here. Requirements python 3.6 pytorch

xfbai 45 Dec 26, 2022
Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

Bo Zheng 42 Dec 9, 2022
Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

UniRE Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021. Requirements python: 3.7.6 pytorch: 1.8.1 transformers:

Wang Yijun 109 Nov 29, 2022
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

null 104 Jan 1, 2023
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

null 108 Dec 23, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

Counterfactual Zero-Shot and Open-Set Visual Recognition This project provides implementations for our CVPR 2021 paper Counterfactual Zero-S

null 144 Dec 24, 2022
NP DRAW paper released code

NP-DRAW: A Non-Parametric Structured Latent Variable Model for Image Generation This repo contains the official implementation for the NP-DRAW paper.

ZENG Xiaohui 22 Mar 13, 2022
Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Updates (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training. Pyr

null 1.3k Jan 4, 2023
Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21

MonoFlex Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21. Work in progress. Installation This repo is tested w

Yunpeng 169 Dec 6, 2022
[ICCV 2021] Released code for Causal Attention for Unbiased Visual Recognition

CaaM This repo contains the codes of training our CaaM on NICO/ImageNet9 dataset. Due to my recent limited bandwidth, this codebase is still messy, wh

Wang Tan 66 Dec 31, 2022
This repository contains code released by Google Research.

This repository contains code released by Google Research.

Google Research 26.6k Dec 31, 2022
This repo contains research materials released by members of the Google Brain team in Tokyo.

Brain Tokyo Workshop ?? ?? This repo contains research materials released by members of the Google Brain team in Tokyo. Past Projects Weight Agnostic

Google 1.2k Jan 2, 2023