BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Overview

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

Introduction

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.

Paper link: https://arxiv.org/abs/2105.08209

Table of Contents

  1. Updates
  2. Citation
  3. Legal Note
  4. License
  5. Usage
  6. Get Involved

Updates

4/15/2021

Initial commit

Citation

@article{kryscinski2021booksum,
      title={BookSum: A Collection of Datasets for Long-form Narrative Summarization}, 
      author={Wojciech Kry{\'s}ci{\'n}ski and Nazneen Rajani and Divyansh Agarwal and Caiming Xiong and Dragomir Radev},
      year={2021},
      eprint={2105.08209},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Legal Note

By downloading or using the resources, including any code or scripts, shared in this code repository, you hereby agree to the following terms, and your use of the resources is conditioned on and subject to these terms.

  1. You may only use the scripts shared in this code repository for research purposes. You may not use or allow others to use the scripts for any other purposes and other uses are expressly prohibited.
  2. You will comply with all terms and conditions, and are responsible for obtaining all rights, related to the services you access and the data you collect.
  3. We do not make any representations or warranties whatsoever regarding the sources from which data is collected. Furthermore, we are not liable for any damage, loss or expense of any kind arising from or relating to your use of the resources shared in this code repository or the data collected, regardless of whether such liability is based in tort, contract or otherwise.

License

The code is released under the BSD-3 License (see LICENSE.txt for details).

Usage

1. Chapterized Project Guteberg Data

The chapterized book text from Gutenberg, for the books we use in our work, has been made available through a public GCP bucket. It can be fetched using:

gsutil cp gs://sfr-books-dataset-chapters-research/all_chapterized_books.zip .

or downloaded directly here.

2. Data Collection

Data collection scripts for the summary text are organized by the different sources that we use summaries from. Note: At the time of collecting the data, all links in literature_links.tsv were working for the respective sources.

For each data source, run get_works.py to first fetch the links for each book, and then run get_summaries.py to get the summaries from the collected links.

python scripts/data_collection/cliffnotes/get_works.py
python scripts/data_collection/cliffnotes/get_summaries.py

3. Data Cleaning

Data Cleaning is performed through the following steps:

First script for some basic cleaning operations, like removing parentheses, links etc from the summary text

python scripts/data_cleaning_scripts/basic_clean.py

We use intermediate alignments in summary_chapter_matched_all_sources.jsonl to identify which summaries are separable, and separates them, creating new summaries (eg. Chapters 1-3 summary separated into 3 different files - Chapter 1 summary, Chapter 2 summary, Chapter 3 summary)

python scripts/data_cleaning_scripts/split_aggregate_chaps_all_sources.py

Lastly, our final cleaning script using various regexes to separate out analysis/commentary text, removes prefixes, suffixes etc.

python scripts/data_cleaning_scripts/clean_summaries.py

Data Alignments

Generating paragraph alignments from the chapter-level-summary-alignments, is performed individually for the train/test/val splits:

Gather the data from the summaries and book chapters into a single jsonl

python paragraph-level-summary-alignments/gather_data.py

Generate alignments of the paragraphs with sentences from the summary using the bi-encoder paraphrase-distilroberta-base-v1

python paragraph-level-summary-alignments/align_data_bi_encoder_paraphrase.py

Aggregate the generated alignments for cases where multiple sentences from chapter-summaries are matched to the same paragraph from the book

python paragraph-level-summary-alignments/aggregate_paragraph_alignments_bi_encoder_paraphrase.py

Troubleshooting

  1. The web archive links we collect the summaries from can often be unreliable, taking a long time to load. One way to fix this is to use higher sleep timeouts when one of the links throws an exception, which has been implemented in some of the scripts.
  2. Some links that constantly throw errors are aggregated in a file called - 'section_errors.txt'. This is useful to inspect which links are actually unavailable and re-running the data collection scripts for those specific links.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Comments
  • NotADirectoryError[WinError 267]

    NotADirectoryError[WinError 267]

    Hello, When I tried to get the summary of "46. Narrative of the Life of Frederick Douglass: An American Slave", a NotADirectoryError[WinError 267] raised.

    The referenced directory is like '../../raw_summaries/cliffnotes/summaries\Narrative of the Life of Frederick Douglass: An American Slave'

    (It is easy to modify the tsv.pruned file to solve it.)

    opened by SpaceTime1999 3
  • Incorrect behavior in separate_mulitple_summaries function

    Incorrect behavior in separate_mulitple_summaries function

    I've been trying to diagnose why I have missing data and part of the problem appears to be in the separate_multiple_summaries function. The end result, is that the script doesn't split some books which are expected to be split in the provided chapter-level-summary-alignments.

    An example of this can behavior can be seen by stepping through the splitting of A Room With a View from gradesaver. It turns out that the script doesn't account for the <PARAGRAPH> tags, despite a comment in the source which states that it should.

    While stepping through the function, you can see that the regex splits the text into lines like so:

    <PARAGRAPH>Chapter Two In Santa Croce with No Baedeker:<PARAGRAPH>Summary:<PARAGRAPH>Lucy looks out her window onto the beautiful scene of a Florence morning
    

      Then the first preprocessing function in the loop, remove_prefixes_line, simply takes off the first < due to split_aggregate_chaps_all_sources.py:276, which strips all leading punctuation. The resulting line that starts with: PARAGRAPH>Chapter Two In Santa Croce with No Baedeker: doesn't match the regex, which expects the chapter marker to be at the beginning of the string.

    This splitting issue (maybe there are more issues with splitting, but this is the one I investigated) causes a number of books to fail to split. Here's the list of books that the data collection script downloaded, but failed to properly split for me:

    gradesaver/A Room With a View
    gradesaver/A Tale of Two Cities
    gradesaver/Adam Bede
    gradesaver/Anne of Green Gables
    gradesaver/Antony and Cleopatra
    gradesaver/As You Like It
    gradesaver/Babbitt
    gradesaver/Bleak House
    gradesaver/Dombey and Son
    gradesaver/Dr. Jekyll and Mr. Hyde
    gradesaver/Dracula
    gradesaver/Emma
    gradesaver/Ethan Frome
    gradesaver/Every Man in His Humour
    gradesaver/Frankenstein
    gradesaver/Gulliver's Travels
    gradesaver/Incidents in the Life of a Slave Girl
    gradesaver/Jane Eyre
    gradesaver/Kidnapped
    gradesaver/King Solomon's Mines
    gradesaver/Little Women
    gradesaver/Middlemarch
    gradesaver/My Antonia
    gradesaver/Northanger Abbey
    gradesaver/Regeneration
    gradesaver/Sense and Sensibility
    gradesaver/Tess of the D'Urbervilles
    gradesaver/The Age of Innocence
    gradesaver/The Blithedale Romance
    gradesaver/The House of the Seven Gables
    gradesaver/The Jungle
    gradesaver/The Marrow of Tradition
    gradesaver/The Monkey's Paw
    gradesaver/The Prince
    gradesaver/The Red Badge of Courage
    gradesaver/The Rise of Silas Lapham
    gradesaver/The Rivals
    gradesaver/The School for Scandal
    gradesaver/The Spanish Tragedy
    gradesaver/The Tempest
    gradesaver/The Time Machine
    gradesaver/The Turn of the Screw
    gradesaver/The Valley of Fear
    gradesaver/Troilus and Cressida
    gradesaver/Twelve Years a Slave
    gradesaver/What Maisie Knew
    novelguide/Henry VI Part 1
    novelguide/Madame Bovary
    novelguide/Merry Wives of Windsor
    novelguide/Oliver Twist
    novelguide/Persuasion
    sparknotes/Adam Bede
    sparknotes/Anne of Green Gables
    sparknotes/Anthem
    sparknotes/Candide
    sparknotes/Dr. Jekyll and Mr. Hyde
    sparknotes/Dracula
    sparknotes/Emma
    sparknotes/Far from the Madding Crowd
    sparknotes/Frankenstein
    sparknotes/Hamlet
    sparknotes/Jane Eyre
    sparknotes/Kidnapped
    sparknotes/Northanger Abbey
    sparknotes/Persuasion
    sparknotes/Regeneration
    sparknotes/Romeo and Juliet
    sparknotes/The Brothers Karamazov
    sparknotes/The House of the Seven Gables
    sparknotes/The Jungle
    sparknotes/The Last of the Mohicans
    sparknotes/The Picture of Dorian Gray
    sparknotes/The Prince
    sparknotes/The Red Badge of Courage
    sparknotes/The Secret Garden
    sparknotes/The Turn of the Screw
    
    opened by dojoteef 3
  • No barronbooks directory

    No barronbooks directory

    opened by dojoteef 2
  • README Issues

    README Issues

    I've run into a few issue with the README:

    1. In Steps 2 and 3 of the Usage section, the python scripts must be called from the individual directories, rather than specifying the full script path from the base of the repo. That's because the scripts use hardcoded relative paths
    2. The gather_data.py script under Data Alignments requires a number of parameters to be passed in. It's not entirely clear which are the correct parameters (for example, --join_strings and --split_paragraphs are two mutually exclusive flags and one must be specified). A similar issue exists with the other two alignment scripts (e.g. --stable_alignment vs --greedy_alignment for align_data_bi_encoder_paraphrase.py). It would be useful to have an example of the exact arguments need for each of these scripts.
    3. It would also be quite useful to know what the python package requirements are. While it's possible to look at the scripts manually to see which packages were used, it's hard to determine what versions of the packages are needed. Either documenting this in the README, or including a requirements.txt with pinned versions would be useful.
    opened by dojoteef 2
  • Numbers of Books

    Numbers of Books

    Hi, @jigsaw2212 and @muggin

    Thanks for sharing the scripts for reproducing the BookSum dataset. I'm still running the get_works.py and get_summaries.py as it took some time to complete.

    I have a few things to confirm:

    1. Meanwhile I read on your paper it says that the BookSum Full contains 436 documents, but in the Gutenberg project Zipped file, it contains only 269 folders (I assume it is corresponding with the number of books). Can you explain the count mismatch?

    2. Could you explain the relationship among numbers of paragraphs, chapters, and books? Do all the paragraphs belong to chapters, and the chapters belong to books?

    Cheers

    opened by geeraay 2
  • AttributeError: 'NoneType' object has no attribute 'findAll'

    AttributeError: 'NoneType' object has no attribute 'findAll'

    Got an error collecting the cliffnotes summaries (see below). Looks as if there is a problem, when some element (here article) is missing:

    Traceback (most recent call last):
      File "scripts/data_collection/cliffnotes/get_summaries.py", line 95, in <module>
        section_paragraphs = list(filter(None, scrape_section_continuation(soup, section_header)))
      File "scripts/data_collection/cliffnotes/get_summaries.py", line 41, in scrape_section_continuation
        section_paragraphs = [paragraph.text.strip() for paragraph in section_data.findAll("p", recursive=False)]
    AttributeError: 'NoneType' object has no attribute 'findAll'
    
    opened by fotisj 2
  • Which scripts used to fine-tune t5 model after final alignment .jsonl files generated?

    Which scripts used to fine-tune t5 model after final alignment .jsonl files generated?

    Have generated all final alignment files and have tested several t5 fine-tuning scripts with different degrees of success.

    Seeking to reproduce results from:

    Table 8: Examples of decoded summaries of the Chapter 1 of “Sense and Sensibility”, part 2.

    Table 11: Examples of decoded summaries of the full text of “Sense and Sensibility”, part 3.

    Could you please provide information on scripts and parameters used to generate the above results.

    Have tested latest version of transformers 'run_summarization.py' but it yields the following error on the .jsonl

    TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

    Have tested other scripts using the "summarize :" option with same .jsonl. These work but only generate a single line summary for Chapter 1 above. How was the multi-line summary for Chapter 1 and the entire book generated?

    Looking forward to your reply.

    opened by GenTxt 1
  • Request for .gathered Data Alignment files

    Request for .gathered Data Alignment files

    Thanks for the interesting repo. Lots of moving parts here and some working better than others.

    Managed to complete all text cleaning tasks (with missing downloads) but encounter this 'IndexError: list index out of range' error each time when running the following command:

    python gather_data.py --matched_file ../chapter-level-summary-alignments/chapter_summary_aligned_train_split.jsonl --split_paragraph

    Running python3.7, ubuntu 18.04, and all requirements.

    Appears to be working until ...

    22%|████████▌ | 2117/9713 [00:38<02:35, 48.90it/s]sentence: The Leech summary_content: ['The Leech'] fixed_content: [] 67%|██████████████████████████ | 6496/9713 [01:55<00:57, 56.16it/s] Traceback (most recent call last): File "gather_data.py", line 254, in main(args) File "gather_data.py", line 220, in main summary_content = fix_prefix_quotations(summary_content) File "gather_data.py", line 82, in fix_prefix_quotations fixed_content[-1] = fixed_content[-1] + sent_split[0].strip() IndexError: list index out of range

    Script exits without a partial .gathered file being generated. Would like to complete final step with paraphrase-distilroberta-base-v1 but require the .gathered files. Appreciate the upload or advice on how to solve issue.

    Cheers,

    opened by GenTxt 1
  • Missing book level alignments, extra chapter level alignments

    Missing book level alignments, extra chapter level alignments

    The linked arxiv paper reports that there are 436 book-level alignments, but the alignments/book-level-summary-alignments/ directory only contains 405 alignments (314 train + 45 val + 46 test). The paper also reports 12,293 chapter-level alignments, but the alignments/chapter-level-summary-alignments/ directory currently contains 12,630 alignments (9713 train + 1485 val + 1432 test).

    Are there plans to fix these discrepancies? I am trying to reproduce the paper's results with as much fidelity as possible.

    opened by bandrus5 1
  • Wrong File Open Mode in <align_data_bi_encoder_paraphrase.py>

    Wrong File Open Mode in

    In align_data_bi_encoder_paraphrase.py at line 226, the file is opened with write mode. It seems a bug to me as it is supposed to the append mode. I spent 3 hours only to find that the output file kept being overwritten by the new samples.

    # Original 
    with open(basename(args.data_path) + ".stable.bi_encoder_paraphrase", "w") as fd:  
        for stable_example in stable_examples:  
            fd.write(json.dumps(stable_example) + "\n")  
    # expected
    with open(basename(args.data_path) + ".stable.bi_encoder_paraphrase", "a") as fd:  
        for stable_example in stable_examples:  
            fd.write(json.dumps(stable_example) + "\n")  
    

    Also a kind suggestion for the usage of tqdm: at line 197, probably enumerate(tqdm(data)) is preferred.

    # original
    for ix, example in tqdm(enumerate(data)):
    # preferred
    for ix, example in enumerate(tqdm(data)):
    
    opened by moutaigua8183 1
  • Release a dataset snapshot

    Release a dataset snapshot

    Hi,

    Is it possible to release a link to the processed dataset used in the paper? Some of the download scripts are super slow (and always raise connection timed out error) on my end...

    Thanks!

    opened by chijames 1
  • Licensing Question

    Licensing Question

    Hi Team,

    Thanks for contributing this amazing dataset and open sourcing it. I had a question regarding usage of the data:

    The Legal note in the README.md file says:

    "You may only use the scripts shared in this code repository for research purposes. You may not use or allow others to use the scripts for any other purposes and other uses are expressly prohibited."

    Whereas I believe the BSD 3-Clause License does allow for commercial use. Can I clarify which is correct?

    opened by KMFODA 0
  • Evaluation Metric - rougeL

    Evaluation Metric - rougeL

    Hi,

    May I know which rougeL metric is used in the paper? Specifically, is it rougeL or rougeLSum? The latter one adds a newline to every sentence and compute a union-LCS score.

    Thanks.

    opened by chijames 0
  • Bump numpy from 1.19.5 to 1.22.0

    Bump numpy from 1.19.5 to 1.22.0

    Bumps numpy from 1.19.5 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Need more instructions to reproduce the extractive oracle of booksum-chapter

    Need more instructions to reproduce the extractive oracle of booksum-chapter

    Hi, thank you for your great works!

    I plan to reproduce some of your baseline result due to #22 , but I met some problems when reproducing the extractive oracle of booksum-chapter and get a slightly different result from your paper, where I got ROUGE-1/2/L (F1) 42.38/9.82/20.62 while 42.68/9.66/21.33 are posted in your paper.

    Here are my steps:

    1. Split text in BOOKSUM-paragraph (lines in chapter_summary_aligned_{}_split.jsonl.gathered.stable) into sentences by spaCy, and compute oracles for each instance as Section 4.2 in your paper.
    2. Split text in BOOKSUM-chapter (lines in chapter_summary_aligned_{}_split.jsonl.gathered) into paragraphs by function "merge_text_paragraphs()" in align_data_bi_encoder_paraphrase.py, then split paragraphs into sentences individually as Step 1.
    3. Mapping ALL of the oracle sentences gained from Step 1 to chapter sentences of BOOKSUM-chapter that gained from Step 2.
    4. Now I have BOOKSUM-chapter that texts are split into sentences and each sentence is marked whether it is an oracle, and I can compute ROUGE for each chapter instance.

    Any wrong places in my steps? Can you give more instructions about how you perform this?

    Another question is, it seems that those extractive models are not directly provided in Huggingface and need additional efforts to reproduce. Do you train and evaluate the models such as BertExt, MatchSum by using codes of their original repos? Can you also give some instructions about this?

    Thank you very much! @jigsaw2212 @muggin

    opened by lzhou1998 0
  • More instructions to reproduce the baseline model results?

    More instructions to reproduce the baseline model results?

    Hi! Thank you for your amazing efforts.

    I wonder do you plan to update the repo to include more reproducibility codes/instructions? I am (and I guess many others are) interested in reproducing the baseline results for research purposes.

    Thank you very much.

    opened by shi-kejian 2
Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

null 56 Jan 2, 2023
Towards Long-Form Video Understanding

Towards Long-Form Video Understanding Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021 [Paper] [Project Page] [Dataset] Citation @inproceedings{lvu2021,

Chao-Yuan Wu 69 Dec 26, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
Source codes for "Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs"

Structure-Aware-BART This repo contains codes for the following paper: Jiaao Chen, Diyi Yang:Structure-Aware Abstractive Conversation Summarization vi

GT-SALT 56 Dec 8, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Evaluating the Factual Consistency of Abstractive Text Summarization Authors: Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher Int

Salesforce 165 Dec 21, 2022
Codes for our IJCAI21 paper: Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization

DDAMS This is the pytorch code for our IJCAI 2021 paper Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization [Arxiv Pr

xcfeng 55 Dec 27, 2022
This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization This is the code for our paper ``SumGNN: Multi-typed Drug

Yue Yu 58 Dec 21, 2022
Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

SimCLS Code for our paper: "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021 1. How to Install Requirements

Yixin Liu 150 Dec 12, 2022
null 190 Jan 3, 2023
Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.

Summary Explorer Summary Explorer is a tool to visually inspect the summaries from several state-of-the-art neural summarization models across multipl

Webis 42 Aug 14, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 1, 2023
Does Pretraining for Summarization Reuqire Knowledge Transfer?

Pretraining summarization models using a corpus of nonsense

Approximately Correct Machine Intelligence (ACMI) Lab 12 Dec 19, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022