Code for evaluating Japanese pretrained models provided by NTT Ltd.

Overview

japanese-dialog-transformers

日本語の説明文はこちら

This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialogue model provided by NTT on fairseq.


Table of contents.
Update log
Notice for using the codes
Model download
Quick start
LICENSE

Update log

  • 2021/09/17 Published dialogue models (fairseq version japanese-dialog-transformer-1.6B) and evaluation codes.

Notice for using the codes

The dialogue models provided are for evaluation and verification of model performance. Before downloading these models, please read the LICENSE and CAUTION documents. You can download and use these models only if you agree to the following three points.

  1. LICENSE
  2. To be used only for the purpose of evaluation and verification of this model, and not for the purpose of providing dialogue service itself.
  3. Take all possible care and measures to prevent damage caused by the generated text, and take responsibility for the text you generate, whether appropriate or inappropriate.

BibTeX

When publishing results using this model, please cite the following paper.

@misc{sugiyama2021empirical,
      title={Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems}, 
      author={Hiroaki Sugiyama and Masahiro Mizukami and Tsunehiro Arimoto and Hiromi Narimatsu and Yuya Chiba and Hideharu Nakajima and Toyomi Meguro},
      year={2021},
      eprint={2109.05217},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model download


Quick start

The models published on this page can be used for utterance generation and additional fine-tuning using the scripts included in fairseq.

Install dependent libraries

The verification environment is as follows.

  • Python 3.8.10 on miniconda
  • CUDA 11.1/10.2
  • Pytorch 1.8.2 (For the installation commands, be sure to check the official page. We recommend using pip.)
  • fairseq 1.0.0(validated commit ID: 8adff65ab30dd5f3a3589315bbc1fafad52943e7)
  • sentencepiece 0.19.6

When installing fairseq, please check the official page and install the latest version. Normal pip install will only install the older version 0.10.2. If you want to run finetune with your own data, you need to install the standalone version of sentencepiece.

fairseq-interactive

Since fairseq-interactive does not have any way to keep the context, it generates responses based on the input sentences only, which is different from the setting that uses the context in Finetune and the paper experiment, so it is easy to generate inappropriate utterances.

In the following command, a small value (10) is used for beam and nbest (number of output candidates) to make the results easier to read. In actual use, it would be better to set the number to 20 or more for better results.

fairseq-interactive data/sample/bin/ \
 --path checkpoints/persona50k-flat_1.6B_33avog1i_4.16.pt\
 --beam 10 \
 --seed 0 \
 --min-len 10 \
 --source-lang src \
 --target-lang dst \
 --tokenizer space \
 --bpe sentencepiece \
 --sentencepiece-model data/dicts/sp_oall_32k.model \
--no-repeat-ngram-size 3 \
--nbest 10 \
--sampling \
--sampling-topp 0.9 \
--temperature 1.0 

dialog.py

The system utilizes a context of about four utterances, which is equivalent to the settings used in the Finetune and paper experiments.

python scripts/dialog.py data/sample/bin/ \
 --path checkpoints/dials5_1e-4_1li20zh5_tw5.143_step85.pt \
 --beam 80 \
 --min-len 10 \
 --source-lang src \
 --target-lang dst \
 --tokenizer space \
 --bpe sentencepiece \
 --sentencepiece-model data/dicts/sp_oall_32k.model \
 --no-repeat-ngram-size 3 \
 --nbest 80 \
 --sampling \
 --sampling-topp 0.9 \
 --temperature 1.0 \
 --show-nbest 5

Perplexity calculation on a specific data set

Computes the perplexity (ppl) on a particular dataset. The lower the ppl, the better the model can represent the interaction on that dataset.

fairseq-validate $DATA_PATH \
 --path $MODEL_PATH \
 --task translation \
 --source-lang src \
 --target-lang dst \
 --batch-size 2 \ 
 --ddp-backend no_c10d \
 --valid-subset test \ 
 --skip-invalid-size-inputs-valid-test 

Finetuning with Persona-chat and EmpatheticDialogues

By finetuning the Pretrained model with PersonaChat or EmpatheticDialogues, you can create a model that is almost identical to the finetuned model provided.

If you have your own dialogue data, you can place the data in the same format in data/*/raw and perform Finetune on that data. Please note, however, that we do not allow the release or distribution of Finetune models under the LISENCE. You can release your own data and let a third party run Finetune from this model.

Downloading and converting datasets

Convert data from Excel to a simple input statement (src) and output statement (dst) format, where the same row in src and dst is the corresponding input/output pair. 50000 rows are split and output as a train.

python scripts/extract_ed.py japanese_empathetic_dialogues.xlsx data/empdial/raw/

License

LISENCE

Comments
  • Minor typos

    Minor typos

    Thank you for releasing the models and datasets.

    I found the following minor typos:

    • "Finetuning with Persona-chat and EmpatheticDialogues" section: .. under the LISENCE -> .. under the LICENSE
    • "License" section: LISENSE -> LICENSE
    opened by tomohideshibata 1
  • sentencepiece version

    sentencepiece version

    In Install dependent libraries of Quick start, sentencepiece 0.19.6 is said, but I think it is sentencepiece 0.1.96 correctly.

    https://github.com/google/sentencepiece/releases/tag/v0.1.96

    opened by gatakaba 1
  • Could you release your fine-tuning scripts in your experiment?

    Could you release your fine-tuning scripts in your experiment?

    Thank you for your excellent work. I have one question about fine-tuning your pre-trained model.

    I am trying to reproduce the same results in your paper but am not able to get the same perplexity on JEmpatheticDialogues and JPersonaChat.

    My fine-tuned model gets 30.87 ppl, which is much worse than 21.32 ppl in your Finetuned model with JPersonaChat. I also tried some seeds, but I could not improve the perplexity.

    Here is my code for fine-tuning JPersonaChat. (Before fine-tuning, I tokenized and processed the dataset by using tokenized.sh and fairseq-preprocess)

    export CUDA_VISIBLE_DEVICES=0,1 fairseq-train $DATA_DIR \
        --arch transformer \
        --finetune-from-model $PRETEAINED_MODEL \
        --task translation \
        --save-dir $SAVE_DIR \
        --dropout 0.1 \
        --encoder-normalize-before \
        --encoder-embed-dim 1920 \
        --encoder-ffn-embed-dim 7680 \
        --encoder-layers 2 \
        --decoder-normalize-before \
        --decoder-embed-dim 1920 \
        --decoder-ffn-embed-dim 7680 \
        --decoder-layers 24 \
        --criterion cross_entropy \
        --batch-size 4 \
        --lr 0.0001 \
        --max-update 3000 \
        --warmup-updates 3000 \
        --optimizer adafactor \
        --best-checkpoint-metric loss \
        --keep-best-checkpoints 1 \
        --tensorboard-logdir $LOG_DIR \
        --update-freq 32 \
        --keep-interval-updates 5 \
        --seed 0 \
        --ddp-backend no_c10d 
    

    Could you share your fine-tuning recipes in your experiment? (Sorry if I have overlooked) I look forward to hearing from you.

    opened by meguruin 0
  • Fine-tuning recipes available?

    Fine-tuning recipes available?

    Thank you for publishing and sharing your great work!

    I'd like to try to reproduce the fine-tuned models by Persona-chat and EmpatheticDialogues datasets.
    Can I get the fairseq-train recipes to fine-tune the pre-trained model with these datasets?
    (Sorry if I have overlooked the existence)

    opened by SeitaroShinagawa 0
  • Any plans to release other models?

    Any plans to release other models?

    Thank you for the great chat-bot models and datasets.

    Do you have any plans to release trained models for other sizes (0.3B, 0.7B and 1.1B)? Also, do you have any plans to release models trained on mixed datasets?

    In particular, I think the smaller size models are useful for fine-tuning by people who do not have enough computation resources (like me).

    Thank you in advance.

    opened by tealgreen0503 2
Owner
NTT Communication Science Laboratories
NTT Communication Science Laboratories
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 6, 2023
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 2, 2023
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO ?? ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 1, 2023
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 9, 2023
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 7k Feb 18, 2021
A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NEC Laboratories Europe 13 Sep 8, 2022
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 6, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

null 5 Oct 29, 2022