Evaluating Cross-lingual Sentence Representations

Related tags

Deep Learning XNLI
Overview

XNLI: The Cross-Lingual NLI Corpus

XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages.

New: XLM and Multilingual BERT use XNLI to evaluate the quality of the cross-lingual representations.

Introduction

Many NLP systems (e.g. sentiment analysis, topic classification, feed ranking) rely on training data in one high-resource language, but cannot be directly used to make predictions for other languages at test time. This problem happens in almost any industrial application that involves cross-lingual data.

Machine translation can be used to translate any sample into the high-resource language to alleviate this issue. However, having an MT system in every direction is costly and may not the best solution for cross-lingual classification. Cross-lingual encoders are a cheaper and more elegant alternative.

To evaluate such cross-lingual sentence understanding methods, we built XNLI, an extension of the SNLI/MultiNLI corpus in 15 languages. XNLI raises the following research question: how can we make predictions in any language at test time, when we only have English training data for that task?

While industrial applications may not include NLI in their routine tasks, we believe that NLI is a good testbed for evaluating cross-lingual sentence representations, and that better approaches for XNLI will result in better XLU methods.

The XNLI corpus

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations.

Examples

Download

XNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt). The English training data can be found on the MultiNLI website.

Download: XNLI 1.0 (17MB, ZIP)

XNLI can also be used as a 15way parallel corpus of 10,000 sentences, for building or evaluating Machine Translation systems. XNLI provides additional open parallel data for low-resource languages such as Swahili or Urdu.

Download XNLI-15way (12MB, ZIP)

Useful XLU resources

Parallel data for aligning sentence encoders: OPUS: the open parallel corpus

Monolingual word embeddings in 157 languages: fastText CommonCrawl vectors

Aligning word embeddings: MUSE

Data description paper and citation

A description of the data can be found here or in the corpus package zip. If you use the corpus in an academic paper, please consider citing:

@InProceedings{conneau2018xnli,
  author = "Conneau, Alexis
        and Rinott, Ruty
        and Lample, Guillaume
        and Williams, Adina
        and Bowman, Samuel R.
        and Schwenk, Holger
        and Stoyanov, Veselin",
  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods
               in Natural Language Processing",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  location = "Brussels, Belgium",
}

Baselines and MT data

The XNLI paper presents several baselines for language adaptation.

We also release the machine translated data for reproducing the TRANSLATE-TRAIN and TRANSLATE-TEST:

Download: XNLI-MT 1.0 (445MB, ZIP)

License

XNLI is licensed under the license found in the LICENSE file. See details in the XNLI paper.

Comments
  • can not download XNLI-15way.zip

    can not download XNLI-15way.zip

    when i open the XNLI-15way.zip link, https://s3.amazonaws.com/xnli/XNLI-15way.zip

    return this below, help me, pls, many thanks.

    This XML file does not appear to have any style information associated with it. The document tree is shown below.
    <Error>
    <Code>NoSuchBucket</Code>
    <Message>The specified bucket does not exist</Message>
    <BucketName>xnli</BucketName>
    <RequestId>98B2A67BB673D19F</RequestId>
    <HostId>
    h6joAzTWqJSaOLywynelTZTxW8WYTVLsK79Ezdt/sd6mLQeqwQ4Nrni647VHIeZmliBbnXgfKo4=
    </HostId>
    </Error>
    
    opened by yucc2018 3
  • XNLI-15way and XNLI-MT 1.0 (445MB, ZIP) links are broken

    XNLI-15way and XNLI-MT 1.0 (445MB, ZIP) links are broken

    <Error>
    <Code>AccessDenied</Code>
    <Message>Access Denied</Message>
    <RequestId>090FC4747CC264FC</RequestId>
    <HostId>
    acZyB0MmmiFHNDvUp/G+B/Ok0/WdReOvjCICqv4+Dhi2V82CUpeTAh+R0vNy9LbsFpNFlOnhTCA=
    </HostId>
    </Error>
    
    opened by delmaksym 2
  • Updated the broken links

    Updated the broken links

    Updated the broken links of the following anchor texts with their working alternatives.

    • XNLI 1.0
    • XNLI-MT 1.0

    Added comments to the following anchors to denote that their links are broken and there seem to be no alternatives for them in another website.

    • Examples
    • XNLI-15way
    opened by e-budur 1
  • Adding Code of Conduct file

    Adding Code of Conduct file

    This is pull request was created automatically because we noticed your project was missing a Code of Conduct file.

    Code of Conduct files facilitate respectful and constructive communities by establishing expected behaviors for project contributors.

    This PR was crafted with love by Facebook's Open Source Team.

    CLA Signed 
    opened by facebook-github-bot 0
  • Adding Contributing file

    Adding Contributing file

    This is pull request was created automatically because we noticed your project was missing a Contributing file.

    CONTRIBUTING files explain how a developer can contribute to the project - which you should actively encourage.

    This PR was crafted with love by Facebook's Open Source Team.

    CLA Signed 
    opened by facebook-github-bot 0
  • Label change 'contradictory' to 'contradiction' issue

    Label change 'contradictory' to 'contradiction' issue

    in the original MNLI and XNLI paper, labels when two sentences contradict are named as 'contradiction',

    but in the XNLI 1.0.zip, the labels are 'contradictory' instead of 'contradiction'. (for all languages)

    it can be a problem when training multiple languages at once.

    i think changing label 'contradictory' to 'contradiction' would be proper.

    opened by raqoon886 0
  • Where are the translate-test files located?

    Where are the translate-test files located?

    Hello

    I have managed to download the XNLI-MT files but only the translate-train files are located in the zip. Where can I find the translate-test files?

    Thanks

    opened by nikitacs16 0
  • XNLI Arabic data read problem

    XNLI Arabic data read problem

    Following this and this script, I tried to load Arabic data by the following script.

    # data paths
    MAIN_PATH=$PWD
    OUTPATH=$PWD/data/xnli
    XNLI_PATH=$PWD/data/xnli/XNLI-1.0
    
    # tools paths
    TOOLS_PATH=$PWD/tools
    TOKENIZE=$TOOLS_PATH/tokenize.sh
    LOWER_REMOVE_ACCENT=$TOOLS_PATH/lowercase_and_remove_accent.py
    
    # install tools
    ./scripts/install-tools.sh
    
    # create directories
    mkdir -p $OUTPATH
    
    # download data
    if [ ! -d $OUTPATH/XNLI-MT-1.0 ]; then
      if [ ! -f $OUTPATH/XNLI-MT-1.0.zip ]; then
        wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip -P $OUTPATH
      fi
      unzip $OUTPATH/XNLI-MT-1.0.zip -d $OUTPATH
    fi
    if [ ! -d $OUTPATH/XNLI-1.0 ]; then
      if [ ! -f $OUTPATH/XNLI-1.0.zip ]; then
        wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip -P $OUTPATH
      fi
      unzip $OUTPATH/XNLI-1.0.zip -d $OUTPATH
    fi
    
    
    # English train set
    for lg in ar; do
      echo "*** Preparing $lg train set ****"
      echo -e "premise\thypo\tlabel" > $XNLI_PATH/$lg.train
      sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f1 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f1
      sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f2 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f2
      sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f3 | sed 's/contradictory/contradiction/g' > $XNLI_PATH/train.f3
      paste $XNLI_PATH/train.f1 $XNLI_PATH/train.f2 $XNLI_PATH/train.f3 >> $XNLI_PATH/$lg.train
    done
    

    Now the lines from 390702-392701 in $XNLI_PATH/train.f2 are empty. So from header premise\thypo\tlabel, hypo will always be empty from 390702-392701 in ar.train file.

    Is this correct behavior?

    @aconneau

    opened by sbmaruf 0
Owner
Meta Research
Meta Research
Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding PyTorch implementation for the Scalable Attentive Sentence-Pair Modeling vi

Microsoft 25 Dec 2, 2022
Meta Representation Transformation for Low-resource Cross-lingual Learning

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning This repo hosts the code for MetaXL, published at NAACL 2021. [Meta

Microsoft 36 Aug 17, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

L1-Refinement Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021) ?? A more detailed readme is co

Lincedo Lab 4 Jun 9, 2021
Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

Bo Zheng 42 Dec 9, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
For AILAB: Cross Lingual Retrieval on Yelp Search Engine

Cross-lingual Information Retrieval Model for Document Search Train Phase CUDA_VISIBLE_DEVICES="0,1,2,3" \ python -m torch.distributed.launch --nproc_

Chilia Waterhouse 104 Nov 12, 2022
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at [email protected]

null 9 Oct 28, 2022
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

Institute of Computational Perception 45 Dec 29, 2022
Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

Deep Representation One-class Classification (DROC). This is not an officially supported Google product. Tensorflow 2 implementation of the paper: Lea

Google Research 137 Dec 23, 2022
[ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction

REval Table of Contents Introduction Overview Requirements Installation Probing Usage Citation License ?? Introduction REval is a simple framework for

null 13 Jan 6, 2023
《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

The Most Important Thing. Our code is developed based on: LXMERT: Learning Cross-Modality Encoder Representations from Transformers

null 53 Dec 16, 2022
Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Cross-Quality Labeled Faces in the Wild (XQLFW) Here, we release the database, evaluation protocol and code for the following paper: Cross Quality LFW

Martin Knoche 10 Dec 12, 2022
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

null 4.2k Jan 1, 2023
UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

UNION Automatic Evaluation Metric described in the paper UNION: An UNreferenced MetrIc for Evaluating Open-eNded Story Generation (EMNLP 2020). Please

null 50 Dec 30, 2022
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 3, 2023