JaQuAD: Japanese Question Answering Dataset

Overview

JaQuAD: Japanese Question Answering Dataset

Overview

Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese. JaQuAD contains 39,696 question-answer pairs. Questions and answers are manually curated by human annotators. Contexts are collected from Japanese Wikipedia articles.

For more information on how the dataset was created, refer to our paper, JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.

Data

JaQuAD consists of three sets: train, validation, and test. They were created from disjoint sets of Wikipedia articles. The following table shows statistics for each set:

Set Number of Articles Number of Contexts Number of Questions
Train 691 9713 31748
Validation 101 1431 3939
Test 109 1479 4009

You can also download our dataset here. (The test set is not publicly released yet.)

from datasets import load_dataset
jaquad_data = load_dataset('SkelterLabsInc/JaQuAD')

Baseline

We also provide a baseline model for JaQuAD for comparison. We created this model by fine-tuning a publicly available Japanese BERT model on JaQuAD. You can see the performance of the baseline model in the table below.

For more information on the model's creation, refer to JaQuAD.ipynb.

Pre-trained LM Dev F1 Dev EM Test F1 Test EM
BERT-Japanese 77.35 61.01 78.92 63.38

You can download the baseline model here.

Usage

from transformers import AutoModelForQuestionAnswering, AutoTokenizer

question = 'アレクサンダー・グラハム・ベルは、どこで生まれたの?'
context = 'アレクサンダー・グラハム・ベルは、スコットランド生まれの科学者、発明家、工学者である。世界初の>実用的電話の発明で知られている。'

model = AutoModelForQuestionAnswering.from_pretrained(
    'SkelterLabsInc/bert-base-japanese-jaquad')
tokenizer = AutoTokenizer.from_pretrained(
    'SkelterLabsInc/bert-base-japanese-jaquad')

inputs = tokenizer(
    question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

# Get the most likely start of the answer with the argmax of the score.
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of the answer with the argmax of the score.
# 1 is added to `answer_end` because the index of the score is inclusive.
answer_end = torch.argmax(answer_end_scores) + 1

answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
# answer = 'スコットランド'

Limitations

This dataset is not yet complete. The social biases of this dataset have not yet been investigated.

If you find any errors in JaQuAD, please contact [email protected].

Reference

If you use our dataset or code, please cite our paper:

@misc{so2022jaquad,
      title={{JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension}},
      author={ByungHoon So and Kyuhong Byun and Kyungwon Kang and Seongjin Cho},
      year={2022},
      eprint={2202.01764},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

LICENSE

The JaQuAD dataset is licensed under the [CC BY-SA 3.0] (https://creativecommons.org/licenses/by-sa/3.0/) license.

Have Questions?

Ask us at [email protected].

You might also like...
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Baseline code for Korean open domain question answering(ODQA)
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

BERT-based Financial Question Answering System
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Comments
  • What does get_offset function do? (Insufficient info in docs)

    What does get_offset function do? (Insufficient info in docs)

    I ran into this assertion error in get_offset function when using models other than from cl-tohoku in JaQuAD.ipynb:

    assert unk_pointer is not None, \
                    'Normalized context and tokens are not matched'
    

    I know this is something related to tokenization but I still can't quite figure it out even after going through the docstring:

    '''The character-level start/end offsets of a token within a context.
        Algorithm:
        1. Make offsets of normalized context within the original context.
        2. Make offsets of tokens (input_ids) within the normalized context.
    
        Arguments:
        input_ids -- Token ids of tokenized context (by tokenizer).
        context -- String of context
        tokenizer
        norm_form
    
        Return:
            List[Tuple[int, int]]: Offsets of tokens within the input context.
            For each token, the offsets are presented as a tuple of (start
            position index, end position index). Both indices are inclusive.
        '''
    

    What is the motivation behind this function and in what circumstance would you need it?

    Thanks.

    opened by yuenherny 0
  • Small modifications to enhance the baseline performance like dev EM = 75%

    Small modifications to enhance the baseline performance like dev EM = 75%

    Thank you for sharing the great Japanese QA dataset!

    I would like to share my changes, which improve the baseline performance by 10%+ (EM).

    Inference log:

      1/3939 | EM: 1.0000, F1: 1.0000
            (Sample) pred: "奈良", answer: "奈良"
    Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors
      201/3939 | EM: 0.7861, F1: 0.8720
            (Sample) pred: "スティーブンズ・プリンター社", answer: "スティーブンズ・プリンター社"
      401/3939 | EM: 0.7781, F1: 0.8698
            (Sample) pred: "湿潤状態", answer: "湿潤状態"
      601/3939 | EM: 0.7604, F1: 0.8593
            (Sample) pred: "1881年", answer: "1881年"
      801/3939 | EM: 0.7491, F1: 0.8490
            (Sample) pred: "器具メーカー", answer: "光源メーカー"
      1001/3939 | EM: 0.7652, F1: 0.8555
            (Sample) pred: "Graduation", answer: "Graduation"
      1201/3939 | EM: 0.7619, F1: 0.8515
            (Sample) pred: "黒田孝高", answer: "黒田孝高"
      1401/3939 | EM: 0.7523, F1: 0.8445
            (Sample) pred: "煙害", answer: "煙害"
      1601/3939 | EM: 0.7552, F1: 0.8440
            (Sample) pred: "カルロス門", answer: "カルロス門"
      1801/3939 | EM: 0.7601, F1: 0.8479
            (Sample) pred: "「一心寮」", answer: "「一心寮」"
      2001/3939 | EM: 0.7651, F1: 0.8517
            (Sample) pred: "2015年7月18日", answer: "2015年7月18日"
      2201/3939 | EM: 0.7665, F1: 0.8517
            (Sample) pred: "SE車", answer: "SE車"
      2401/3939 | EM: 0.7668, F1: 0.8523
            (Sample) pred: "久原房之助", answer: "久原房之助"
      2601/3939 | EM: 0.7655, F1: 0.8507
            (Sample) pred: "1900年", answer: "地方議員"
      2801/3939 | EM: 0.7608, F1: 0.8491
            (Sample) pred: "藤山一郎", answer: "東海林太郎"
      3001/3939 | EM: 0.7614, F1: 0.8503
            (Sample) pred: "フィリップ・ファンデンベルク", answer: "フィリップ・ファンデンベルク"
      3201/3939 | EM: 0.7619, F1: 0.8500
            (Sample) pred: "大峯奥駈道", answer: "大峯奥駈道"
      3401/3939 | EM: 0.7601, F1: 0.8482
            (Sample) pred: "エラーヒューゼン", answer: "アルリック・エラーヒューゼン"
      3601/3939 | EM: 0.7540, F1: 0.8435
            (Sample) pred: "「道路標示黄色見本」", answer: "「道路標示黄色見本」"
      3801/3939 | EM: 0.7506, F1: 0.8402
            (Sample) pred: "『ヘントの祭壇画』", answer: "『ヘントの祭壇画』"
    F1 score: 0.8404927719328006
    Exact Match: 0.751967504442752
    

    Performance by types:

    スクリーンショット 2022-03-01 15 26 33

    opened by akeyhero 5
  • Plan for releasing test set

    Plan for releasing test set

    Hi, I am searching for a Japanese QA dataset and luckily find this work. Really appreciate your hard work! I wonder if the test set is planned to be publicly released or kept as private like the original SQuAD.

    Thanks!

    opened by asahi417 2
Owner
SkelterLabs
An artificial intelligence technology company developing innovative machine intelligence technology that is designed to enhance the quality of the users’ daily.
SkelterLabs
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 6, 2023
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

Google Research Datasets 52 Jun 21, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

deepset 6.4k Jan 9, 2023
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.1k Feb 14, 2021
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 184 Feb 10, 2021