On Generating Extended Summaries of Long Documents

Overview

ExtendedSumm

This repository contains the implementation details and datasets used in On Generating Extended Summaries of Long Documents paper at the AAAI-21 Workshop on Scientific Document Understanding (SDU 2021).

Conda environment: preliminary setup

To install the required packages, please run conda yml file that you find in the root directory using the following command:

conda env create -f environment.yml

How to run...

IMPORTANT: The following commands should be run under src/ directory.

Dataset

To start with, you first need to download the datasets that are intended to work with the code base. You can download them from following links:

Dataset Download Link
arXiv-Long Download
PubMed-Long Download

After downloading the dataset, you will need to uncompress it using the following command:

tar -xvf pubmedL.tar.gz 

This will uncompress the pubmedL tar file into the current directory. The directory will include the single json files of different sets including training, validation, and test.

FORMAT Each paper file is structured within a a json object with the following keys:

  • "id" (String): the paper ID
  • "abstract" (String): the abstract text of the paper. This field is different from "gold" field for the datasets that have different ground-truth than the abstract.
  • "gold" (List >): the ground-truth summary of the paper, where the inner list is the tokens associated with each gold summary sentence.
  • "sentences" (List >): the source sentences of the full-text. The inner list contains 5 indices, each of which represents different fields of the source sentence:
    • Index [0]: tokens of the sentences (i.e., list of tokens).
    • Index [1]: textual representation of the section that the sentence belongs to.
    • Index [2]: Rouge-L score of the sentence with the gold summary.
    • Index [3]: textual representation of the sentences.
    • Index [4]: oracle label associated with the sentence (0, or 1).
    • Index [5]: the section id assigned by sequential sentence classification package. For more information, please refer to this repository

Preparing Data

Simply run the prep.sh bash script with providing the dataset directory. This script will use two functions to first create aggregated json files, and then preparing them for pretrained language models' usage.

Please note that if you want to use your custom dataset and create torch files, you will need to frame the format of your dataset to the given format in the Dataset section.

Training

The full training scripts are inside train.sh bash file. To run it on your machine, you will need to change the directories to fit in your needs:

...

DATA_PATH=/path/to/dataset/torch-files/
MODEL_PATH=/path/to/saved/model/

# Specifiying GPUs either single GPU, or multi-GPU
export CUDA_VISIBLE_DEVICES=0,1


# You don't need to modify these below 
LOG_DIR=../logs/$(echo $MODEL_PATH | cut -d \/ -f 6).log
mkdir -p ../results/$(echo $MODEL_PATH | cut -d \/ -f 6)
RESULT_PATH_TEST=../results/$(echo $MODEL_PATH | cut -d \/ -f 6)/

MAX_POS=2500

...

Inference

The inference scripts are inside test.sh bash file. To run it on your machine, you will need to modify the file directories:

...
# path to the data directory
BERT_DIR=/path/to/dataset/torch-files/

# path to the trained model directory
MODEL_PATH=/disk1/sajad/sci-trained-models/presum/LSUM-2500-segmented-sectioned-multi50-classi-v1/

# path to the best trained model (or the checkpoint that you want to run inference on)
CHECKPOINT=$MODEL_PATH/Recall_BEST_model_s63000_0.4910.pt

# GPU machines, either multi or single GPU
export CUDA_VISIBLE_DEVICES=0,1

MAX_POS=2500

...

Citation

If you plan to use this work, please cite the following papers:

@inproceedings{Sotudeh2021ExtendedSumm,
  title={On Generating Extended Summaries of Long Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)},
  year={2021}
}
@inproceedings{Sotudeh2020LongSumm,
  title={GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={First Workshop on Scholarly Document Processing (SDP 2020)},
  year={2020}
}
You might also like...
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting This is the origin Pytorch implementation of Informer in the followin

[ECCVW2020] Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DiMP)
[ECCVW2020] Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DiMP)

Feel free to visit my homepage Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DIMP) [ECCVW2020 paper] Presentation

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

pair-emnlp2020 Official repository for the paper: Xinyu Hua and Lu Wang: PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long

Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch
Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch

[AAAI 2021]DropLoss for Long-Tail Instance Segmentation [AAAI 2021] DropLoss for Long-Tail Instance Segmentation Ting-I Hsieh*, Esther Robb*, Hwann-Tz

Implementation of
Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Long-term-Motion-in-3D-Scenes This is an implementation of the CVPR'21 paper "Synthesizing Long-Term 3D Human Motion and Interaction in 3D". Please ch

Improving Calibration for Long-Tailed Recognition (CVPR2021)
Improving Calibration for Long-Tailed Recognition (CVPR2021)

MiSLAS Improving Calibration for Long-Tailed Recognition Authors: Zhisheng Zhong, Jiequan Cui, Shu Liu, Jiaya Jia [arXiv] [slide] [BibTeX] Introductio

Improving Calibration for Long-Tailed Recognition (CVPR2021)
Improving Calibration for Long-Tailed Recognition (CVPR2021)

Improving Calibration for Long-Tailed Recognition (CVPR2021)

Pytorch implementation for
Pytorch implementation for "Adversarial Robustness under Long-Tailed Distribution" (CVPR 2021 Oral)

Adversarial Long-Tail This repository contains the PyTorch implementation of the paper: Adversarial Robustness under Long-Tailed Distribution, CVPR 20

Comments
  • the LongSumm

    the LongSumm

    Hi: can you put the dataset in there or teach me how to get this dataset completely,because i can not use the complete dataset, really you are appreciated!!

    opened by arnold-em 1
  • the Pdb problem

    the Pdb problem

    gpu_rank 0 [2022-11-09 11:32:53,261 INFO] loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b [2022-11-09 11:32:55,748 INFO] * number of parameters: 159693576 [2022-11-09 11:32:55,748 INFO] Start training... [2022-11-09 11:32:56,667 INFO] Loading train dataset from ./arxivL/bert-files/2500-segmented-test/train.35.bert.pt, number of examples: 802

    /root/workspace/cht/ExtendSumm/ExtendedSumm-master/src/models/data_loader.py(355)preprocess() -> end_id = [src[-1]] (Pdb) the gpu is Tesla a100(40g) when i run train.sh then just (Pdb),could you tell me how to deal this

    opened by arnold-em 0
  • Section id by sequential sentence classification

    Section id by sequential sentence classification

    Hello, Thank you for making this code available. Could you provide more details on how the section id mentioned in the Dataset Format section of the README file is being generated? The linked repo (Sequential Sentence Classification) does not have any documentation regarding generating the section id. I did not find that information in the 2 papers either.

    opened by pranav-s 0
  • CSV FILE

    CSV FILE

    https://github.com/Georgetown-IR-Lab/ExtendedSumm/blob/cf9d74f2b55b8322c476ac89a49f6655fee219f5/src/prepro/data_builder.py#L563

    Can you please tell how to get this csv file ??? Or if we are using json files then we can you use _format_to_bert function instead of format_to_bert function ???

    opened by CJPJ007 6
Owner
Georgetown Information Retrieval Lab
Georgetown Information Retrieval Lab
Automatic voice-synthetised summaries of latest research papers on arXiv

PaperWhisperer PaperWhisperer is a Python application that keeps you up-to-date with research papers. How? It retrieves the latest articles from arXiv

Valerio Velardo 124 Dec 20, 2022
A New Approach to Overgenerating and Scoring Abstractive Summaries

We provide the source code for the paper "A New Approach to Overgenerating and Scoring Abstractive Summaries" accepted at NAACL'21. If you find the code useful, please cite the following paper.

Kaiqiang Song 4 Apr 3, 2022
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 1, 2023
Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

TensorFlow White Paper Notes Features Notes broken down section by section, as well as subsection by subsection Relevant links to documentation, resou

Sam Abrahams 437 Oct 9, 2022
Demo for Real-time RGBD-based Extended Body Pose Estimation paper

Real-time RGBD-based Extended Body Pose Estimation This repository is a real-time demo for our paper that was published at WACV 2021 conference The ou

Renat Bashirov 118 Dec 26, 2022
Generalized hybrid model for mode-locked laser diodes with an extended passive cavity

GenHybridMLLmodel Generalized hybrid model for mode-locked laser diodes with an extended passive cavity This hybrid simulation strategy combines a tra

Stijn Cuyvers 3 Sep 21, 2022
A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

null 1 Nov 30, 2021
An executor that loads ONNX models and embeds documents using the ONNX runtime.

ONNXEncoder An executor that loads ONNX models and embeds documents using the ONNX runtime. Usage via Docker image (recommended) from jina import Flow

Jina AI 2 Mar 15, 2022
Import Python modules from dicts and JSON formatted documents.

Paker Paker is module for importing Python packages/modules from dictionaries and JSON formatted documents. It was inspired by httpimporter. Important

Wojciech Wentland 1 Sep 7, 2022
Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-like Documents.

Value Retrieval with Arbitrary Queries for Form-like Documents Introduction Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-

Salesforce 13 Sep 15, 2022