JaQuAD: Japanese Question Answering Dataset

SkelterLabs

Last update: Dec 27, 2022

Related tags

Text Data & NLP JaQuAD

Overview

JaQuAD: Japanese Question Answering Dataset

Overview

Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese. JaQuAD contains 39,696 question-answer pairs. Questions and answers are manually curated by human annotators. Contexts are collected from Japanese Wikipedia articles.

For more information on how the dataset was created, refer to our paper, JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.

Data

JaQuAD consists of three sets: train, validation, and test. They were created from disjoint sets of Wikipedia articles. The following table shows statistics for each set:

Set	Number of Articles	Number of Contexts	Number of Questions
Train	691	9713	31748
Validation	101	1431	3939
Test	109	1479	4009

You can also download our dataset here. (The test set is not publicly released yet.)

from datasets import load_dataset
jaquad_data = load_dataset('SkelterLabsInc/JaQuAD')

Baseline

We also provide a baseline model for JaQuAD for comparison. We created this model by fine-tuning a publicly available Japanese BERT model on JaQuAD. You can see the performance of the baseline model in the table below.

For more information on the model's creation, refer to JaQuAD.ipynb.

Pre-trained LM	Dev F1	Dev EM	Test F1	Test EM
BERT-Japanese	77.35	61.01	78.92	63.38

You can download the baseline model here.

Usage

from transformers import AutoModelForQuestionAnswering, AutoTokenizer

question = 'アレクサンダー・グラハム・ベルは、どこで生まれたの?'
context = 'アレクサンダー・グラハム・ベルは、スコットランド生まれの科学者、発明家、工学者である。世界初の>実用的電話の発明で知られている。'

model = AutoModelForQuestionAnswering.from_pretrained(
    'SkelterLabsInc/bert-base-japanese-jaquad')
tokenizer = AutoTokenizer.from_pretrained(
    'SkelterLabsInc/bert-base-japanese-jaquad')

inputs = tokenizer(
    question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

# Get the most likely start of the answer with the argmax of the score.
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of the answer with the argmax of the score.
# 1 is added to `answer_end` because the index of the score is inclusive.
answer_end = torch.argmax(answer_end_scores) + 1

answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
# answer = 'スコットランド'

Limitations

This dataset is not yet complete. The social biases of this dataset have not yet been investigated.

If you find any errors in JaQuAD, please contact jaquad@skelterlabs.com.

Reference

If you use our dataset or code, please cite our paper:

@misc{so2022jaquad,
      title={{JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension}},
      author={ByungHoon So and Kyuhong Byun and Kyungwon Kang and Seongjin Cho},
      year={2022},
      eprint={2202.01764},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

LICENSE

The JaQuAD dataset is licensed under the [CC BY-SA 3.0] (https://creativecommons.org/licenses/by-sa/3.0/) license.

Have Questions?

Ask us at jaquad@skelterlabs.com.

You might also like...

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

823 Dec 28, 2022

Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

69 Nov 4, 2022

chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

33 Dec 18, 2022

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

2 Apr 20, 2022

BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

61 Sep 18, 2022

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 4, 2023

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

491 Jan 7, 2023

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

160 Dec 23, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Comments

What does get_offset function do? (Insufficient info in docs)

I ran into this assertion error in get_offset function when using models other than from cl-tohoku in JaQuAD.ipynb:

assert unk_pointer is not None, \
                'Normalized context and tokens are not matched'

I know this is something related to tokenization but I still can't quite figure it out even after going through the docstring:

'''The character-level start/end offsets of a token within a context.
    Algorithm:
    1. Make offsets of normalized context within the original context.
    2. Make offsets of tokens (input_ids) within the normalized context.

    Arguments:
    input_ids -- Token ids of tokenized context (by tokenizer).
    context -- String of context
    tokenizer
    norm_form

    Return:
        List[Tuple[int, int]]: Offsets of tokens within the input context.
        For each token, the offsets are presented as a tuple of (start
        position index, end position index). Both indices are inclusive.
    '''

What is the motivation behind this function and in what circumstance would you need it?

Thanks.

opened by yuenherny 0

Small modifications to enhance the baseline performance like dev EM = 75%

Thank you for sharing the great Japanese QA dataset!

I would like to share my changes, which improve the baseline performance by 10%+ (EM).

Inference log:

  1/3939 | EM: 1.0000, F1: 1.0000
        (Sample) pred: "奈良", answer: "奈良"
Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors
  201/3939 | EM: 0.7861, F1: 0.8720
        (Sample) pred: "スティーブンズ・プリンター社", answer: "スティーブンズ・プリンター社"
  401/3939 | EM: 0.7781, F1: 0.8698
        (Sample) pred: "湿潤状態", answer: "湿潤状態"
  601/3939 | EM: 0.7604, F1: 0.8593
        (Sample) pred: "1881年", answer: "1881年"
  801/3939 | EM: 0.7491, F1: 0.8490
        (Sample) pred: "器具メーカー", answer: "光源メーカー"
  1001/3939 | EM: 0.7652, F1: 0.8555
        (Sample) pred: "Graduation", answer: "Graduation"
  1201/3939 | EM: 0.7619, F1: 0.8515
        (Sample) pred: "黒田孝高", answer: "黒田孝高"
  1401/3939 | EM: 0.7523, F1: 0.8445
        (Sample) pred: "煙害", answer: "煙害"
  1601/3939 | EM: 0.7552, F1: 0.8440
        (Sample) pred: "カルロス門", answer: "カルロス門"
  1801/3939 | EM: 0.7601, F1: 0.8479
        (Sample) pred: "「一心寮」", answer: "「一心寮」"
  2001/3939 | EM: 0.7651, F1: 0.8517
        (Sample) pred: "2015年7月18日", answer: "2015年7月18日"
  2201/3939 | EM: 0.7665, F1: 0.8517
        (Sample) pred: "SE車", answer: "SE車"
  2401/3939 | EM: 0.7668, F1: 0.8523
        (Sample) pred: "久原房之助", answer: "久原房之助"
  2601/3939 | EM: 0.7655, F1: 0.8507
        (Sample) pred: "1900年", answer: "地方議員"
  2801/3939 | EM: 0.7608, F1: 0.8491
        (Sample) pred: "藤山一郎", answer: "東海林太郎"
  3001/3939 | EM: 0.7614, F1: 0.8503
        (Sample) pred: "フィリップ・ファンデンベルク", answer: "フィリップ・ファンデンベルク"
  3201/3939 | EM: 0.7619, F1: 0.8500
        (Sample) pred: "大峯奥駈道", answer: "大峯奥駈道"
  3401/3939 | EM: 0.7601, F1: 0.8482
        (Sample) pred: "エラーヒューゼン", answer: "アルリック・エラーヒューゼン"
  3601/3939 | EM: 0.7540, F1: 0.8435
        (Sample) pred: "「道路標示黄色見本」", answer: "「道路標示黄色見本」"
  3801/3939 | EM: 0.7506, F1: 0.8402
        (Sample) pred: "『ヘントの祭壇画』", answer: "『ヘントの祭壇画』"
F1 score: 0.8404927719328006
Exact Match: 0.751967504442752

Performance by types:

スクリーンショット 2022-03-01 15 26 33

opened by akeyhero 5

Plan for releasing test set

Hi, I am searching for a Japanese QA dataset and luckily find this work. Really appreciate your hard work! I wonder if the test set is planned to be publicly released or kept as private like the original SQuAD.

Thanks!

opened by asahi417 2

Owner

SkelterLabs

An artificial intelligence technology company developing innovative machine intelligence technology that is designed to enhance the quality of the users’ daily.

GitHub

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

10 Jan 6, 2023

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

14 Jul 23, 2022

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

7 Nov 7, 2022

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

52 Jun 21, 2022

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

29 Nov 30, 2022

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

6.4k Jan 9, 2023

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

220 Dec 11, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

184 Feb 10, 2021

JaQuAD: Japanese Question Answering Dataset

Related tags

Overview

JaQuAD: Japanese Question Answering Dataset

Overview

Data

Baseline

Usage

Limitations

Reference

LICENSE

Have Questions?

You might also like...

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Baseline code for Korean open domain question answering(ODQA)

chaii - hindi & tamil question answering

Contact Extraction with Question Answering.

BERT-based Financial Question Answering System

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Comments

What does get_offset function do? (Insufficient info in docs)

Small modifications to enhance the baseline performance like dev EM = 75%

Plan for releasing test set

Owner

SkelterLabs

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT