A Multi-modal Model Chinese Spell Checker Released on ACL2021.

DaDa

Last update: Dec 29, 2022

Related tags

Deep Learning ReaLiSe

Overview

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

Environment

Python: 3.6
Cuda: 10.0
Packages: pip install -r requirements.txt

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

BERT: chinese-roberta-wwm-ext

Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
Local: /data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/
Phonetic Encoder: pretrain_pho.sh
Graphic Encoder: pretrain_res.sh
Merge: merge.py

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

FASpell: FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm
Soft-Masked BERT: Spelling Error Correction with Soft-Masked BERT
SpellGCN: SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
BERT: Our implementation

Metrics

"D" means "Detection Level", "C" means "Correction Level".
"A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.

SIGHAN15

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
FASpell	74.2	67.6	60.0	63.5	73.7	66.6	59.1	62.6
Soft-Masked BERT	80.9	73.7	73.2	73.5	77.4	66.7	66.2	66.4
SpellGCN	-	74.8	80.7	77.7	-	72.1	77.7	75.9
BERT	82.4	74.2	78.0	76.1	81.0	71.6	75.3	73.4
ReaLiSe	84.7	77.3	81.3	79.3	84.0	75.9	79.9	77.8

SIGHAN14

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
Pointer Network	-	63.2	82.5	71.6	-	79.3	68.9	73.7
SpellGCN	-	65.1	69.5	67.2	-	63.1	67.2	65.3
BERT	75.7	64.5	68.6	66.5	74.6	62.4	66.3	64.3
ReaLiSe	78.4	67.8	71.5	69.6	77.7	66.3	70.0	68.1

SIGHAN13

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
FASpell	63.1	76.2	63.2	69.1	60.5	73.1	60.5	66.2
SpellGCN	78.8	85.7	78.8	82.1	77.8	84.6	77.8	81.0
BERT	77.0	85.0	77.0	80.8	77.4	83.0	75.2	78.9
ReaLiSe	82.7	88.6	82.5	85.4	81.4	87.2	81.2	84.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

issues on src/run.py

Hi! I found the code in src/run.py 432 has some problems "model = model_class.from_pretrained(args.model_name_or_path, config=config, cache_dir=args.cache_dir if args.cache_dir else None)". The error message is :ValueError: Parameter config in SpellBertPho2ResArch3(config) should be an instance of class PretrainedConfig. To create a model from a pretrained model use model = SpellBertPho2ResArch3.from_pretrained(PRETRAINED_MODEL_NAME).

opened by Yangyi-Chen 1

在使用训练过的模型进行测试时遇到问题RuntimeError: storage has wrong size

你好，感谢开源！但是我在测试训练的模型时遇到了以下问题，请问是哪里出问题了呢？

model_type: bert-pho2-res-arch3
weight_dir: output
ckpt_name: saved_ckpt-11000
100%|████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00, 10.99it/s]
test_batches: 35
Load model begin...
Traceback (most recent call last):
  File "src/test.py", line 180, in <module>
    device=args.device,
  File "src/test.py", line 126, in test
    model = model_class.from_pretrained(model_dir)
  File "/home/ReaLiSe/ReaLiSe-master/transformers/modeling_utils.py", line 409, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location='cpu')
  File "/root/miniconda3/envs/f36/lib/python3.6/site-packages/torch/serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/root/miniconda3/envs/f36/lib/python3.6/site-packages/torch/serialization.py", line 580, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected -4698162557506238965 got 2359296

opened by bestFannie 0

统一回复一些复现上的问题
首先不要装官方的transformers库，因为我们魔改过代码，实验前请一定要pip install --editable .；

ACL anthology的supplementary matrrial里也有我们开源的代码和模型，也有README讲解怎么评测；

模型训练过程中，虽然有每1000 step保存一次模型，但报的指标仍然是训练结束后最终的模型（直接取最后一个checkpoint，没用test set去挑）。论文中讲到的每个实验细节均是真实的。另外，我们有持续关注该任务，intern搞baseline时有对ReaLiSe在V100 (以前是P100) 上用相同的脚本进行重新实验，也能复现出 detection 79 F1， correction ~78 F1的指标结果，并进一步提升了ReaLiSe；

关于标点符号的问题，预处理是有的，另外一些中文标点会被识别为UNK，为此我们修改了模型的vocab.txt（具体看google drive OR supplementary）

拼音编码器的预训练语料和fine-tune数据一样，只是objective不同。拼音和图像encoder的预训练模型是根据下游任务指标去挑的。
opened by Neutralzz 4
Bert baseline

你好，我跑了只使用bert（Semantic）信息的实验（model type = bert , with_res=no），发现15年的correction可以达到76, 几乎和我跑的ReaLiSe模型的性能差不多（～77）。请问我bert的实验配置正确吗？如果正确的话，用你的代码和数据似乎比其他release的代码性能要高很多，请问你做了哪些优化呢（目前我看到，你对数据中unk的情况做了一些处理），谢谢

opened by sdzhangbo 2
预训练细节问题

首先祝贺这份工作被录用，其次论文中对拼音和字形的预训练我很感兴趣，请问你们对这两个部分预训练分别采使用什么样规模的数据，比如说多少条句子，文件总大小约为多少；其次，在预训练过程中，你们是否采用了验证集，如果采用了，验证集和训练集的大小比例是什么样的；最后，你们在预训练的时候采用的learningrate是什么级别的，是bert建议在下游任务上finetune的lr还是设置了比较大的lr，十分感谢！

opened by Aopolin-Lv 0

Owner

DaDa

A student majoring in Computer Science in BIT.

GitHub

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling

274 Dec 6, 2022

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL)

385 Jan 6, 2023

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

43 Nov 7, 2022

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

AMR-Dialogue An implementation for paper "Semantic Representation for Dialogue Modeling". You may find our paper here. Requirements python 3.6 pytorch

45 Dec 26, 2022

Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

42 Dec 9, 2022

Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

UniRE Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021. Requirements python: 3.7.6 pytorch: 1.8.1 transformers:

109 Nov 29, 2022

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

104 Jan 1, 2023

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

43 Nov 7, 2022

[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

Counterfactual Zero-Shot and Open-Set Visual Recognition This project provides implementations for our CVPR 2021 paper Counterfactual Zero-S

144 Dec 24, 2022

NP DRAW paper released code

NP-DRAW: A Non-Parametric Structured Latent Variable Model for Image Generation This repo contains the official implementation for the NP-DRAW paper.

22 Mar 13, 2022

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Updates (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training. Pyr

1.3k Jan 4, 2023

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Related tags

Overview

ReaLiSe

Environment

Data

Raw Data

Data Processing

Pretrain

Train

Test

Well-trained Model

SIGHAN15

SIGHAN14

SIGHAN13

Citation

Comments

issues on src/run.py

在使用训练过的模型进行测试时遇到问题RuntimeError: storage has wrong size

统一回复一些复现上的问题

Bert baseline

预训练细节问题

Owner

DaDa

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

NP DRAW paper released code

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21

[ICCV 2021] Released code for Causal Attention for Unbiased Visual Recognition

This repository contains code released by Google Research.

This repo contains research materials released by members of the Google Brain team in Tokyo.