MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Daniel Varab

Last update: Dec 16, 2022

Related tags

Deep Learning massive-summ

Overview

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

This repository contains links to data and code to fetch and reproduce the data described in our EMNLP 2021 paper titled "MassiveSumm: a very large-scale, very multilingual, news summarisation dataset". A (massive) multilingual dataset consisting of 92 diverse languages, across 35 writing scripts. With this work we attempt to take the first steps towards providing a diverse data foundation for in summarisation in many languages.

Disclaimer: The data is noisy and recall-oriented. In fact, we highly recommend reading our analysis on the efficacy of this type of methods for data collection.

Get the Data

Redistributing data from web is a tricky matter. We are working on providing efficient access to the entire dataset, as well as expanding it even further. For the time being we only provide links to reproduce subsets of the entire dataset through either common crawl and the wayback machine. The dataset is also available upon request ([email protected]).

In the table below is a listing of files containing URLs and metadata required to fetch data from common crawl.

lang	wayback	cc
afr	link	-
amh	link	link
ara	link	link
asm	link	-
aym	link	-
aze	link	link
bam	link	link
ben	link	link
bod	link	link
bos	link	link
bul	link	link
cat	link	-
ces	link	link
cym	link	link
dan	link	link
deu	link	link
ell	link	link
eng	link	link
epo	link	-
fas	link	link
fil	link	-
fra	link	link
ful	link	link
gle	link	link
guj	link	link
hat	link	link
hau	link	link
heb	link	-
hin	link	link
hrv	link	-
hun	link	link
hye	link	link
ibo	link	link
ind	link	link
isl	link	link
ita	link	link
jpn	link	link
kan	link	link
kat	link	link
khm	link	link
kin	link	-
kir	link	link
kor	link	link
kur	link	link
lao	link	link
lav	link	link
lin	link	link
lit	link	link
mal	link	link
mar	link	link
mkd	link	link
mlg	link	link
mon	link	link
mya	link	link
nde	link	link
nep	link	link
nld	link	-
ori	link	link
orm	link	link
pan	link	link
pol	link	link
por	link	link
prs	link	link
pus	link	link
ron	link	-
run	link	link
rus	link	link
sin	link	link
slk	link	link
slv	link	link
sna	link	link
som	link	link
spa	link	link
sqi	link	link
srp	link	link
swa	link	link
swe	link	-
tam	link	link
tel	link	link
tet	link	-
tgk	link	-
tha	link	link
tir	link	link
tur	link	link
ukr	link	link
urd	link	link
uzb	link	link
vie	link	link
xho	link	link
yor	link	link
yue	link	link
zho	link	link
bis	-	link
gla	-	link

Cite Us!

Please cite us if you use our data or methodology

@inproceedings{varab-schluter-2021-massivesumm,
    title = "{M}assive{S}umm: a very large-scale, very multilingual, news summarisation dataset",
    author = "Varab, Daniel  and
      Schluter, Natalie",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.797",
    pages = "10150--10161",
    abstract = "Current research in automatic summarisation is unapologetically anglo-centered{--}a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.",
}

LIVECell - A large-scale dataset for label-free live cell segmentation

LIVECell dataset This document contains instructions of how to access the data associated with the submitted manuscript "LIVECell - A large-scale data

112 Jan 7, 2023

A large-scale face dataset for face parsing, recognition, generation and editing.

CelebAMask-HQ [Paper] [Demo] CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA da

1.7k Dec 26, 2022

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks ACL 2020 Microsoft Research [Paper] [Video] Releasing [XtremeDistilTransf

125 Jan 4, 2023

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Related tags

Overview

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Get the Data

Cite Us!

You might also like...

LIVECell - A large-scale dataset for label-free live cell segmentation

A large-scale face dataset for face parsing, recognition, generation and editing.

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

A multilingual version of MS MARCO passage ranking dataset

Open-AI's DALL-E for large scale training in mesh-tensorflow.

Apache Spark - A unified analytics engine for large-scale data processing

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Owner

Daniel Varab

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

A very tiny, very simple, and very secure file encryption tool.

Technical Indicators implemented in Python only using Numpy-Pandas as Magic - Very Very Fast! Very tiny! Stock Market Financial Technical Analysis Python library . Quant Trading automation or cryptocoin exchange

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

A pytorch implementation of the CVPR2021 paper "VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild"

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination