This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Overview

XL-Sum

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Table of Contents

Datasets

Disclaimer: You must agree to the license and terms of use before using the dataset.

We are releasing two versions of the dataset: an older version that has been reported in the paper; and a newer version with another added language (Traditional Chinese), more data, better formatting, better extraction, larger evaluation splits, and deduplication. We recommend using the latter and thus have organized the repository with data counts and benchmarks of the newer version. The new version contains a total of 1.35 million article-summary pairs, making XL-Sum the largest text summarization dataset publicly available.

All dataset files are in .jsonl format i.e. one JSON per line. One example from the english dataset is given below in JSON format. The fields are self-explanatory.

{
  "id": "technology-17657859",
  "url": "https://www.bbc.com/news/technology-17657859",
  "title": "Yahoo files e-book advert system patent applications",
  "summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.",
  "text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\""
}

Download the complete dataset. See the legacy section for the older version(s).

We used a 80%-10%-10% split for all languages with a few exceptions. English was split 93%-3.5%-3.5% for the evaluation set size to resemble that of CNN/DM and XSum; Scottish Gaelic, Kyrgyz and Sinhala had relatively fewer samples, their evaluation sets were increased to 500 samples for more reliable evaluation. Same articles were used for evaluation in the two variants of Chinese and Serbian to prevent data leakage in multilingual training. Individual dataset download links with train-dev-test example counts are given below:

Language ISO 639-1 Code BBC subdomain(s) Train Dev Test Total Link
Amharic am https://www.bbc.com/amharic 5761 719 719 7199 Download
Arabic ar https://www.bbc.com/arabic 37519 4689 4689 46897 Download
Azerbaijani az https://www.bbc.com/azeri 6478 809 809 8096 Download
Bengali bn https://www.bbc.com/bengali 8102 1012 1012 10126 Download
Burmese my https://www.bbc.com/burmese 4569 570 570 5709 Download
Chinese (Simplified) zh-CN https://www.bbc.com/ukchina/simp, https://www.bbc.com/zhongwen/simp 37362 4670 4670 46702 Download
Chinese (Traditional) zh-TW https://www.bbc.com/ukchina/trad, https://www.bbc.com/zhongwen/trad 37373 4670 4670 46713 Download
English en https://www.bbc.com/english, https://www.bbc.com/sinhala * 306522 11535 11535 329592 Download
French fr https://www.bbc.com/afrique 8697 1086 1086 10869 Download
Gujarati gu https://www.bbc.com/gujarati 9119 1139 1139 11397 Download
Hausa ha https://www.bbc.com/hausa 6418 802 802 8022 Download
Hindi hi https://www.bbc.com/hindi 70778 8847 8847 88472 Download
Igbo ig https://www.bbc.com/igbo 4183 522 522 5227 Download
Indonesian id https://www.bbc.com/indonesia 38242 4780 4780 47802 Download
Japanese ja https://www.bbc.com/japanese 7113 889 889 8891 Download
Kirundi rn https://www.bbc.com/gahuza 5746 718 718 7182 Download
Korean ko https://www.bbc.com/korean 4407 550 550 5507 Download
Kyrgyz ky https://www.bbc.com/kyrgyz 2266 500 500 3266 Download
Marathi mr https://www.bbc.com/marathi 10903 1362 1362 13627 Download
Nepali np https://www.bbc.com/nepali 5808 725 725 7258 Download
Oromo om https://www.bbc.com/afaanoromoo 6063 757 757 7577 Download
Pashto ps https://www.bbc.com/pashto 14353 1794 1794 17941 Download
Persian fa https://www.bbc.com/persian 47251 5906 5906 59063 Download
Pidgin** n/a https://www.bbc.com/pidgin 9208 1151 1151 11510 Download
Portuguese pt https://www.bbc.com/portuguese 57402 7175 7175 71752 Download
Punjabi pa https://www.bbc.com/punjabi 8215 1026 1026 10267 Download
Russian ru https://www.bbc.com/russian, https://www.bbc.com/ukrainian * 62243 7780 7780 77803 Download
Scottish Gaelic gd https://www.bbc.com/naidheachdan 1313 500 500 2313 Download
Serbian (Cyrillic) sr https://www.bbc.com/serbian/cyr 7275 909 909 9093 Download
Serbian (Latin) sr https://www.bbc.com/serbian/lat 7276 909 909 9094 Download
Sinhala si https://www.bbc.com/sinhala 3249 500 500 4249 Download
Somali so https://www.bbc.com/somali 5962 745 745 7452 Download
Spanish es https://www.bbc.com/mundo 38110 4763 4763 47636 Download
Swahili sw https://www.bbc.com/swahili 7898 987 987 9872 Download
Tamil ta https://www.bbc.com/tamil 16222 2027 2027 20276 Download
Telugu te https://www.bbc.com/telugu 10421 1302 1302 13025 Download
Thai th https://www.bbc.com/thai 6616 826 826 8268 Download
Tigrinya ti https://www.bbc.com/tigrinya 5451 681 681 6813 Download
Turkish tr https://www.bbc.com/turkce 27176 3397 3397 33970 Download
Ukrainian uk https://www.bbc.com/ukrainian 43201 5399 5399 53999 Download
Urdu ur https://www.bbc.com/urdu 67665 8458 8458 84581 Download
Uzbek uz https://www.bbc.com/uzbek 4728 590 590 5908 Download
Vietnamese vi https://www.bbc.com/vietnamese 32111 4013 4013 40137 Download
Welsh cy https://www.bbc.com/cymrufyw 9732 1216 1216 12164 Download
Yoruba yo https://www.bbc.com/yoruba 6350 793 793 7936 Download

* A lot of articles in BBC Sinhala and BBC Ukrainian were written in English and Russian respectively. They were identified using Fasttext and moved accordingly.

** West African Pidgin English

Models

We are releasing a multilingual model checkpoint trained for 50k steps on the new data. To use this model for evaluation/inference refer to Training & Evaluation.

Benchmarks

Multilingual model scores on test sets are given below. We are also releasing the model-generated outputs for future analysis.

Language ROUGE-1 / ROUGE-2 / ROUGE-L
Amharic 20.0485 / 7.4111 / 18.0753
Arabic 34.9107 / 14.7937 / 29.1623
Azerbaijani 21.4227 / 9.5214 / 19.3331
Bengali 29.5653 / 12.1095 / 25.1315
Burmese 15.9626 / 5.1477 / 14.1819
Chinese (Simplified) 39.4071 / 17.7913 / 33.406
Chinese (Traditional) 37.1866 / 17.1432 / 31.6184
English 37.601 / 15.1536 / 29.8817
French 35.3398 / 16.1739 / 28.2041
Gujarati 21.9619 / 7.7417 / 19.86
Hausa 39.4375 / 17.6786 / 31.6667
Hindi 38.5882 / 16.8802 / 32.0132
Igbo 31.6148 / 10.1605 / 24.5309
Indonesian 37.0049 / 17.0181 / 30.7561
Japanese 48.1544 / 23.8482 / 37.3636
Kirundi 31.9907 / 14.3685 / 25.8305
Korean 23.6745 / 11.4478 / 22.3619
Kyrgyz 18.3751 / 7.9608 / 16.5033
Marathi 22.0141 / 9.5439 / 19.9208
Nepali 26.6547 / 10.2479 / 24.2847
Oromo 18.7025 / 6.1694 / 16.1862
Pashto 38.4743 / 15.5475 / 31.9065
Persian 36.9425 / 16.1934 / 30.0701
Pidgin 37.9574 / 15.1234 / 29.872
Portuguese 37.1676 / 15.9022 / 28.5586
Punjabi 30.6973 / 12.2058 / 25.515
Russian 32.2164 / 13.6386 / 26.1689
Scottish Gaelic 29.0231 / 10.9893 / 22.8814
Serbian (Cyrillic) 23.7841 / 7.9816 / 20.1379
Serbian (Latin) 21.6443 / 6.6573 / 18.2336
Sinhala 27.2901 / 13.3815 / 23.4699
Somali 31.5563 / 11.5818 / 24.2232
Spanish 31.5071 / 11.8767 / 24.0746
Swahili 37.6673 / 17.8534 / 30.9146
Tamil 24.3326 / 11.0553 / 22.0741
Telugu 19.8571 / 7.0337 / 17.6101
Thai 37.3951 / 17.275 / 28.8796
Tigrinya 25.321 / 8.0157 / 21.1729
Turkish 32.9304 / 15.5709 / 29.2622
Ukrainian 23.9908 / 10.1431 / 20.9199
Urdu 39.5579 / 18.3733 / 32.8442
Uzbek 16.8281 / 6.3406 / 15.4055
Vietnamese 32.8826 / 16.2247 / 26.0844
Welsh 32.6599 / 11.596 / 26.1164
Yoruba 31.6595 / 11.6599 / 25.0898

Multilingual ROUGE

Training & Evaluation

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution 4.0 International License. Copyright of the dataset contents belongs to the original copyright holders.

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@inproceedings{hasan-etal-2021-xlsum,
    title = "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md Saiful and
      Samin, Kazi  and
      Li, Yuan-Fang and
      Kang, Yong-Bin and 
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2021",
    month = "August",
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/2106.13822"
}
You might also like...
Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

Data augmentation for NLP, accepted at EMNLP 2021 Findings
Data augmentation for NLP, accepted at EMNLP 2021 Findings

AEDA: An Easier Data Augmentation Technique for Text Classification This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Techni

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Data from "HateCheck: Functional Tests for Hate Speech Detection Models" (Röttger et al., ACL 2021)

In this repo, you can find the data from our ACL 2021 paper "HateCheck: Functional Tests for Hate Speech Detection Models". "test_suite_cases.csv" con

Pytorch Implementation for NeurIPS (oral) paper: Pixel Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Pixel-Level Cycle Association This is the Pytorch implementation of our NeurIPS 2020 Oral paper Pixel-Level Cycle Association: A New Perspective for D

CLEAR algorithm for multi-view data association
CLEAR algorithm for multi-view data association

CLEAR: Consistent Lifting, Embedding, and Alignment Rectification Algorithm The Matlab, Python, and C++ implementation of the CLEAR algorithm, as desc

graph-theoretic framework for robust pairwise data association
graph-theoretic framework for robust pairwise data association

CLIPPER: A Graph-Theoretic Framework for Robust Data Association Data association is a fundamental problem in robotics and autonomy. CLIPPER provides

This repository contains the code and models for the following paper.
This repository contains the code and models for the following paper.

DC-ShadowNet Introduction This is an implementation of the following paper DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised

Comments
  • mt5 small generating wrong predictions

    mt5 small generating wrong predictions

    I am trying to finetune the mt5-small with Telugu corpus, all the generated summaries includes <extra_id_0> tokens, please suggest how to fixt it.

    Example generated output:

    <extra_id_0> హోమియోపతి కళాశాలను న్యూఢిల్లీ సెంట్రల్ కౌన్సిల్ ఆఫ్ ప్రత్యేక బృందం బుధవారం పరిశీలించింది.

    How to avoid this <extra_id_0> token in the summary? see this issue for better understanding

    opened by ashokurlana 3
  • TypeError when using RougeScorer

    TypeError when using RougeScorer

    When I try to run scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True , lang='danish') I get the following error: TypeError: __init__() got an unexpected keyword argument 'lang'

    opened by sarakolding 2
  • How to use rouge scorer client correctly

    How to use rouge scorer client correctly

    I followed the readme under multilingual_rouge_scoring/and setup successfully but then, when I used python -m rouge.rouge --args I got

    Error while finding module specification for 'rouge.rouge' (ModuleNotFoundError: __path__ attribute not found on 'rouge' while trying to find 'rouge.rouge')
    

    when I used python rouge.py --args I got

    File "/net/papilio/storage2/bowenz/tools/xl-sum/multilingual_rouge_scoring/rouge.py", line 71, in main
        scorer = rouge_scorer.RougeScorer(FLAGS.rouge_types, FLAGS.use_stemmer, lang=FLAGS.lang)
    TypeError: __init__() takes 2 positional arguments but 3 were given
    

    and when I used rouge --args I got

    rougeval.sh: 1: rouge: not found
    

    Any help? thanks.

    bug 
    opened by 18445864529 2
  • when the text is very short

    when the text is very short

    hi, first of all, really thanks for such great project. When i tried your inference code on hugging face, i found that when the text is short, for example like just "weather is fine", it will generate a longer unrelated summary. i am totally new to summarization, so is there any thing i should pay attention to in such case? for example, in such case, generating inputs_ids will be different? really appreciate your guide if possible, thanks!

    opened by 08tjlys 2
Owner
null
Official PyTorch Implementation of SSMix (Findings of ACL 2021)

SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021) Official PyTorch Implementation of SSMix | Paper Abstract Data augment

Clova AI Research 52 Dec 27, 2022
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 8, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

null 124 Dec 27, 2022
Code for our CVPR 2022 Paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection"

GEN-VLKT Code for our CVPR 2022 paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection". Contributed by Yue Lia

Yue Liao 47 Dec 4, 2022
This repository contains the data and code for the paper "Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors" (SPNLP@ACL2022)

GP-VAE This repository provides datasets and code for preprocessing, training and testing models for the paper: Diverse Text Generation via Variationa

Wanyu Du 18 Dec 29, 2022
The implementation of our CIKM 2021 paper titled as: "Cross-Market Product Recommendation"

FOREC: A Cross-Market Recommendation System This repository provides the implementation of our CIKM 2021 paper titled as "Cross-Market Product Recomme

Hamed Bonab 16 Sep 12, 2022
Code for reproducing our analysis in the paper titled: Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency

Image Crop Analysis This is a repo for the code used for reproducing our Image Crop Analysis paper as shared on our blog post. If you plan to use this

Twitter Research 239 Jan 2, 2023
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Pan Lu 81 Dec 27, 2022