Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller

Last update: Jan 3, 2023

Related tags

Overview

Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller and Michael Gertz
Heidelberg University

Under submission at LREC 2022
A preprint version of the paper can be found on arXiv!
For easy access, we have also made the dataset available on Huggingface Datasets!

Data Availability

To use data in your experiments, we suggest the existing training/validation/test split, available in ./data/splits/. This split has been generated with a stratified sampling strategy (based on document lengths) and a 80/10/10 split, which ensure that the samples are somewhat evenly distributed.

Alternatively, please refer to our Huggingface Datasets version for easy access of the preprocessed data.

Installation

This repository contains the code to crawl the Klexikon data set presented in our paper, as well as all associated baselines and splits. You can work on the existing code base by simply cloning this repository.

Install all required dependencies with the following command:

python3 -m pip install -r requirements.txt

The experiments were run on Python 3.8.4, but should run fine with any version >3.7. To run files, relative imports are required, which forces you to run them as modules, e.g.,

python3 -m klexikon.analysis.compare_offline_stats

instead of

python3 klexikon/analysis/compare_offline_stats.py

Furthermore, this requires the working directory to be the root folder as well, to ensure correct referencing of relative data paths. I.e., if you cloned this repository into /home/dennis/projects/klexikon, make sure to run scripts directly from this path.

Extended Explanation

Manually Replaced Articles in `articles.json`

Aside from all the manual matches, which can be produced by create_matching_url_list.py, there are some articles which simply link to an incorrect article in Wikipedia.
We approximate this by the number of paragraphs in the Wikipedia article, which is generally much longer than the Klexikon article, and therefore should have at least 15 paragraphs. Note that most of the pages are disambiguations, which unfortunately don't necessarily correspond neatly to a singular Wikipedia page. We remove the article if it is not possible to find a singular Wikipedia article that covers more than 66% of the paragraphs in the Klexikon article. Some examples for manual changes were:

"Aal" to "Aale"
"Abendmahl" to "Abendmahl Jesu"
"Achse" to "Längsachse"
"Ader" to "Blutgefäß"
"Albino" to "Albinismus"
"Alkohol" to "Ethanol"
"Android" to "Android (Betriebssystem)"
"Anschrift" to "Postanschrift"
"Apfel" to "Kulturapfel"
"App" to "Mobile App"
"Appenzell" to "Appenzellerland"
"Arabien" to "Arabische Halbinsel"
"Atlas" to "Atlas (Kartografie)"
"Atmosphäre" to "Erdatmospähre"

Merging sentences that end in a semicolon (`;`)

This applies to any position in the document. The reason is rectifying some unwanted splits by spaCy.

Merge of short lines in lead 3 baseline

Also checking for lines that have less than 10 characters in the first three sentences. This helps with fixing the lead-3 baseline, and most issues arise from some incorrect splits to begin with.

Removal of coordinates

Sometimes, coordinate information is leading in the data, which seems to be embedded in some Wikipedia articles. We remove any coordinate with a simple regex.

Sentences that do not end in a period

Manual correction of sentences (in the lead 3) that do not end in periods. This has been automatically fixed by merging content similarly to the semicolon case. Specifically, we only merge if the subsequent line is not just an empty line.

Using your own data

Currently, the systems expect input data to be processed in a line-by-line fashion, where every line represents a sentence, and each file represents an input document. Note that we currently do not support multi-document summarization.

Criteria for discarding articles

Articles where Wikipedia has less than 15 paragraphs. Otherwise, manually discarding when there are no matching articles in Wikipedia (see above). Examples of the latter case are for example "Kiwi" or "Washington"

Reasons for not using lists

As described in the paper, we discard any element that is not a

tag in the HTLM code. This helps getting rid of actual unwanted information (images, image captions, meta-descriptors, etc.), but also removes list items. After reviewing some examples, we have decided to discard list elements altogether. This means that some articles (especially disambiguation pages) are also easier to detect.

Final number of valid article pairs: 2898

This means we had to discard around 250 articles from the original list at the time of crawling (April 2021). In the meantime, there have been new articles added to Klexikon, which leaves room for future improvements.

Execution Order of Scripts

TK: I'll include a better reference to the particular scripts in the near future, as well as a script that actually executes everything relevant in order.

Generate JSON file with article URLs
Crawl texts
Fix lead sentences
Remove unused articles (optional)
Generate stratified split

License Information

Both Wikipedia and Klexikon make their textual contents available under the CC BY-SA license. Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. Data will be re-distributed under the CC BY-SA license.

Contributions

Contributions are very welcome. Please either open an issue or pull request if you have any suggestion on how this data can be improved. Open TODOs:

So far, the data does not have more than a few simplistic baselines, and lacks an actually trained system on top of the data.
The dataset is "out-of-date", since it does not include any of the more recently articles (~100 since the inception of my version). Potentially, we can increase the availability to almost 3000 articles.
Adding a top-level script that adds correct execution order of different scripts to generate baselines/results/etc.
Adding a proper data managing script for the Huggingface Datasets version of this dataset.

How to Cite?

If you use our dataset, or code from this repository, please cite

@article{aumiller-gertz-2022-klexikon,  
  title   = {{Klexikon: A German Dataset for Joint Summarization and Simplification}},  
  author  = {Aumiller, Dennis and Gertz, Michael},  
  year    = {2022},  
  journal = {arXiv preprint arXiv:2201.07198},  
  url     = {https://arxiv.org/abs/2201.07198},  
}

Comments

Remove empty sentences from HF dataset

I just noticed that there are some empty lines in samples of the published version of the dataset, which should have been removed (presumably, these were newlines/spaces, that weren't filtered correctly).
bug

opened by dennlinger 3
Oracle baseline

Adding oracle baseline by modifying literature's ROUGE-2 oracle. Instead of greedily selecting sentences (in order), we choose 1:1 highest-overlapping sentences for each target sentence, and then sort & aggregate the extracted source sentences to get a more or less coherent summary.

opened by dennlinger 0
Reviewer feedback
To make this process as transparent as possible, here is some of the (relevant) reviewer feedback from the LREC submission process. Any feedback relevant only to the paper & writing is not listed here, but feel free to comment on any points for clarification.

[ ] (R1) "I do not find their reasons for doing so (“avoid encoding errors”) convincing: They could simply use a character encoding defined in the Unicode standard, e.g. UTF-8." Comment: This is a fair point, but the main problem was that Klexikon does not use non-Latin characters. This means that cities like "Århus" will never appear as such, and instead have "Aarhus" in Klexikon. Unfortunately, Python does not have any sufficient libraries for dealing with this, as it would additionally turn the German Umlauts (Ä, Ö, Ü) into (A, O, U), which is an incorrect transformation that likely would happen more frequently than other non-Latin characters. Further, and I'm not sure if this is explained sufficiently well, I have made sure to replace the topmost-occurring characters in a manual "translation table" to ensure the correct treatment of most of the letter characters at least (or merging " '' etc.)

[ ] (R2) "It is clear why the authors chose to disregard Wikipedia articles with less than 15 paragraphs given their specific goal, however, the dataset would be useful to a much wider audience (e.g. researchers interested in TS only) if all Wikipedia-Klexiko alignments were kept regardless of the posterior case-specific filtering by length" Comment: This is actually a good idea for a raw corpus. The original reasoning is based on the fact that shorter articles were mostly uninformative in my personal opinion. Since we remove the list-like elements, it creates a certain bias towards texts that only contain the descriptor of a subsequent list, which generally is not very applicable. Other examples included biology-related articles, where generally the article would consist only of several explanations of sub-species, without actual content information. However, a raw corpus could be re-crawled either way with the most recent number of articles. Then again, this would require re-matching all ambiguous articles, which is a bit more time-consuming.

[x] (R2) "It would be very informative to give some statistical information in 4.1.1., i.e. what was the starting number of documents in Klexikon and how was it affected by each of the steps 1 to 4 to get to the final 2,898 documents." Comment: I'll have to see if I can produce that information again, but would be a nice addition.

[x] (R3) "I wonder why the authors only consider lead-3, lead-k and the full input text as baseline sets, and do not attempt a simple extractive summarization algorithm such as the classic Luhn algorithm to gain a more reliable set" Comment: Actually a good idea. I have some intermediate results that go towards that direction, so it should be fairly easy to generate these results.

In case any of the reviewers should read this: Thanks for your constructive feedback, I genuinely appreciated the helpful comments!
opened by dennlinger 0
Compute baselines for test set only

The lead baselines in particular (available for the full dataset in the preprint) could be re-computed for the files of the test set only.

Additionally, I recently found sumy, which sounds like more promising baselines.

opened by dennlinger 2
Add sentence alignments

For better use with neural systems, we should provide alignments at the sentence level for the articles. So far, I've tried to create simple alignment strategies with some off-the-shelf embedding models (sentence-transformers).

However, the main problem is that the sentence splitting and merging operations would not be well reflected by this.

opened by dennlinger 0

Owner

Dennis Aumiller

PhD student in Information Retrieval & NLP at Heidelberg University. Python is awesome, and so is Huggingface

GitHub

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

169 Dec 21, 2022

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

70 Dec 12, 2022

German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

6 Aug 28, 2022

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

4 Jul 20, 2022

Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

3k Jan 8, 2023

Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

1.9k Jan 6, 2023

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

211 Dec 28, 2022

Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

2.5k Feb 17, 2021

Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

1.4k Feb 17, 2021

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

137 Feb 1, 2021

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

39 Dec 14, 2022

Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

6 Oct 22, 2022

Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.

1 Oct 18, 2022

Klexikon: A German Dataset for Joint Summarization and Simplification

Related tags

Overview

Klexikon: A German Dataset for Joint Summarization and Simplification

Data Availability

Installation

Extended Explanation

Manually Replaced Articles in articles.json

Merging sentences that end in a semicolon (;)

Merge of short lines in lead 3 baseline

Removal of coordinates

Sentences that do not end in a period

Using your own data

Criteria for discarding articles

Reasons for not using lists

Final number of valid article pairs: 2898

Execution Order of Scripts

License Information

Contributions

How to Cite?

Comments

Remove empty sentences from HF dataset

Oracle baseline

Reviewer feedback

Compute baselines for test set only

Add sentence alignments

Owner

Dennis Aumiller

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

German Text-To-Speech Engine using Tacotron and Griffin-Lim

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Module for automatic summarization of text documents and HTML pages.

Python implementation of TextRank for phrase extraction and summarization of text documents

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Module for automatic summarization of text documents and HTML pages.

Python implementation of TextRank for phrase extraction and summarization of text documents

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Codes for processing meeting summarization datasets AMI and ICSI.

Two-stage text summarization with BERT and BART

Understand Text Summarization and create your own summarizer in python

Scene Text Retrieval via Joint Text Detection and Similarity Learning

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Summarization module based on KoBART

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Manually Replaced Articles in `articles.json`

Merging sentences that end in a semicolon (`;`)