Amazon Multilingual Counterfactual Dataset (AMCD)

Last update: Sep 20, 2022

Related tags

Text Data & NLP amazon-multilingual-counterfactual-dataset

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

This repository contains a dataset described in the paper:

I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Reviews. James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, Danushka Bollegala. EMNLP'21. arxiv version

The dataset contains sentences from Amazon customer reviews (sampled from Amazon product review dataset) annotated for counterfactual detection (CFD) binary classification. Counterfactual statements describe events that did not or cannot take place. Counterfactual statements may be identified as statements of the form – If p was true, then q would be true (i.e. assertions whose antecedent (p) and consequent (q) are known or assumed to be false).

The key features of this dataset are:

The dataset is multilingual and contains sentences in English, German, and Japanese.
The labeling was done by professional linguists and high quality was ensured.
The dataset is supplemented with the annotation guidelines and definitions, which were worked out by professional linguists. We also provide the clue word lists, which are typical for counterfactual sentences and were used for initial data filtering. The clue word lists were also compiled by professional linguists.

Please see paper for the data statistics, detailed description of data collection and annotation.

For the dataset format please see README.txt.

Cite

If you use this dataset in your research, please cite the paper.

License Summary

The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.

Comments

English clue word statistics

Hey!

Thanks for the great work and sharing with a great documentation.

I wanted to reproduce the statistics in Table 13 with using word_tokenize function from nltk and sklearn's CountVectorizer, but I could not.

The problems which I observed are,

the number of occurrences of the clue words which I get do not match with yours,
the clue word doesn't tokenized as does and n't by word_tokenize. I suspect that you have used another tokenization method while generating this statistics.

Can you please help me to get the same results?

The below is a minimal code to reproduce my results.

import os
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer


ext_train_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_train.tsv", sep='\t')
ext_eval_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_valid.tsv", sep='\t')
ext_test_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_test.tsv", sep='\t')
df = pd.concat([ext_train_df, ext_eval_df, ext_test_df])

with open(f"{AMAZON_DATA_PATH}/clue_words/counterfactual_clue_words_en.txt")) as f:
    clue_words = f.readlines()
clue_words = [clue_word.strip() for clue_word in clue_words]


corpus = df.sentence.values
vectorizer = CountVectorizer(tokenizer=word_tokenize)
X = vectorizer.fit_transform(corpus)


for i, clue_word in enumerate(clue_words):
  arr = X[:, vectorizer.vocabulary_.get(clue_word)].toarray()
  freq = len(np.nonzero(arr.flatten())[0])
  print(f"{i}. {clue_word} => {freq}")

opened by dopc 0

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

13 Sep 2, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Amazon Multilingual Counterfactual Dataset (AMCD)

Related tags

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

Cite

License Summary

You might also like...

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

A python package for deep multilingual punctuation prediction.

Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Comments

English clue word statistics

Owner

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Multilingual text (NLP) processing toolkit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

A library for Multilingual Unsupervised or Supervised word Embeddings

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.