A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Overview

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

Environment

  • Python: 3.6
  • Cuda: 10.0
  • Packages: pip install -r requirements.txt

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

  • BERT: chinese-roberta-wwm-ext

    Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
    Local: /data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/

  • Phonetic Encoder: pretrain_pho.sh

  • Graphic Encoder: pretrain_res.sh

  • Merge: merge.py

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

Metrics

  • "D" means "Detection Level", "C" means "Correction Level".
  • "A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.

SIGHAN15

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
Soft-Masked BERT 80.9 73.7 73.2 73.5 77.4 66.7 66.2 66.4
SpellGCN - 74.8 80.7 77.7 - 72.1 77.7 75.9
BERT 82.4 74.2 78.0 76.1 81.0 71.6 75.3 73.4
ReaLiSe 84.7 77.3 81.3 79.3 84.0 75.9 79.9 77.8

SIGHAN14

Method D-A D-P D-R D-F C-A C-P C-R C-F
Pointer Network - 63.2 82.5 71.6 - 79.3 68.9 73.7
SpellGCN - 65.1 69.5 67.2 - 63.1 67.2 65.3
BERT 75.7 64.5 68.6 66.5 74.6 62.4 66.3 64.3
ReaLiSe 78.4 67.8 71.5 69.6 77.7 66.3 70.0 68.1

SIGHAN13

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 63.1 76.2 63.2 69.1 60.5 73.1 60.5 66.2
SpellGCN 78.8 85.7 78.8 82.1 77.8 84.6 77.8 81.0
BERT 77.0 85.0 77.0 80.8 77.4 83.0 75.2 78.9
ReaLiSe 82.7 88.6 82.5 85.4 81.4 87.2 81.2 84.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Comments
  • issues on src/run.py

    issues on src/run.py

    Hi! I found the code in src/run.py 432 has some problems "model = model_class.from_pretrained(args.model_name_or_path, config=config, cache_dir=args.cache_dir if args.cache_dir else None)". The error message is :ValueError: Parameter config in SpellBertPho2ResArch3(config) should be an instance of class PretrainedConfig. To create a model from a pretrained model use model = SpellBertPho2ResArch3.from_pretrained(PRETRAINED_MODEL_NAME).

    opened by Yangyi-Chen 1
  • 在使用训练过的模型进行测试时遇到问题RuntimeError: storage has wrong size

    在使用训练过的模型进行测试时遇到问题RuntimeError: storage has wrong size

    你好,感谢开源!但是我在测试训练的模型时遇到了以下问题,请问是哪里出问题了呢?

    model_type: bert-pho2-res-arch3
    weight_dir: output
    ckpt_name: saved_ckpt-11000
    100%|████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00, 10.99it/s]
    test_batches: 35
    Load model begin...
    Traceback (most recent call last):
      File "src/test.py", line 180, in <module>
        device=args.device,
      File "src/test.py", line 126, in test
        model = model_class.from_pretrained(model_dir)
      File "/home/ReaLiSe/ReaLiSe-master/transformers/modeling_utils.py", line 409, in from_pretrained
        state_dict = torch.load(resolved_archive_file, map_location='cpu')
      File "/root/miniconda3/envs/f36/lib/python3.6/site-packages/torch/serialization.py", line 386, in load
        return _load(f, map_location, pickle_module, **pickle_load_args)
      File "/root/miniconda3/envs/f36/lib/python3.6/site-packages/torch/serialization.py", line 580, in _load
        deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
    RuntimeError: storage has wrong size: expected -4698162557506238965 got 2359296
    
    opened by bestFannie 0
  • 统一回复一些复现上的问题

    统一回复一些复现上的问题

    1. 首先不要装官方的transformers库,因为我们魔改过代码,实验前请一定pip install --editable .
    2. ACL anthology的supplementary matrrial里也有我们开源的代码和模型,也有README讲解怎么评测;
    3. 模型训练过程中,虽然有每1000 step保存一次模型,但报的指标仍然是训练结束后最终的模型(直接取最后一个checkpoint,没用test set去挑)。论文中讲到的每个实验细节均是真实的。另外,我们有持续关注该任务,intern搞baseline时有对ReaLiSe在V100 (以前是P100) 上用相同的脚本进行重新实验,也能复现出 detection 79 F1, correction ~78 F1的指标结果,并进一步提升了ReaLiSe
    4. 关于标点符号的问题,预处理是有的,另外一些中文标点会被识别为UNK,为此我们修改了模型的vocab.txt(具体看google drive OR supplementary)
    5. 拼音编码器的预训练语料和fine-tune数据一样,只是objective不同。拼音和图像encoder的预训练模型是根据下游任务指标去挑的。
    opened by Neutralzz 4
  • Bert baseline

    Bert baseline

    你好,我跑了只使用bert(Semantic)信息的实验(model type = bert , with_res=no),发现15年的correction可以达到76, 几乎和我跑的ReaLiSe模型的性能差不多(~77)。请问我bert的实验配置正确吗?如果正确的话,用你的代码和数据似乎比其他release的代码性能要高很多,请问你做了哪些优化呢(目前我看到,你对数据中unk的情况做了一些处理),谢谢

    opened by sdzhangbo 2
  • 预训练细节问题

    预训练细节问题

    首先祝贺这份工作被录用,其次论文中对拼音和字形的预训练我很感兴趣,请问你们对这两个部分预训练分别采使用什么样规模的数据,比如说多少条句子,文件总大小约为多少;其次,在预训练过程中,你们是否采用了验证集,如果采用了,验证集和训练集的大小比例是什么样的;最后,你们在预训练的时候采用的learningrate是什么级别的,是bert建议在下游任务上finetune的lr还是设置了比较大的lr,十分感谢!

    opened by Aopolin-Lv 0
Owner
DaDa
A student majoring in Computer Science in BIT.
DaDa
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 8, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Yiming Cui 1.2k Dec 30, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 4, 2021
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │   ├── __init__.py │   ├── log

null 2 Nov 20, 2022
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

null 1 Jan 6, 2022
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 3, 2023
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 44 Jan 6, 2023
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 2, 2023
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc. Copyright &

Fang-Pen Lin 82 Jun 28, 2022
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 4, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022