On Generating Extended Summaries of Long Documents

Related tags

Overview

ExtendedSumm

This repository contains the implementation details and datasets used in On Generating Extended Summaries of Long Documents paper at the AAAI-21 Workshop on Scientific Document Understanding (SDU 2021).

Conda environment: preliminary setup

To install the required packages, please run conda yml file that you find in the root directory using the following command:

conda env create -f environment.yml

How to run...

IMPORTANT: The following commands should be run under src/ directory.

Dataset

To start with, you first need to download the datasets that are intended to work with the code base. You can download them from following links:

Dataset	Download Link
arXiv-Long	Download
PubMed-Long	Download

After downloading the dataset, you will need to uncompress it using the following command:

tar -xvf pubmedL.tar.gz

This will uncompress the pubmedL tar file into the current directory. The directory will include the single json files of different sets including training, validation, and test.

FORMAT Each paper file is structured within a a json object with the following keys:

"id" (String): the paper ID
"abstract" (String): the abstract text of the paper. This field is different from "gold" field for the datasets that have different ground-truth than the abstract.
"gold" (List >): the ground-truth summary of the paper, where the inner list is the tokens associated with each gold summary sentence.
"sentences" (List >): the source sentences of the full-text. The inner list contains 5 indices, each of which represents different fields of the source sentence:
- Index [0]: tokens of the sentences (i.e., list of tokens).
- Index [1]: textual representation of the section that the sentence belongs to.
- Index [2]: Rouge-L score of the sentence with the gold summary.
- Index [3]: textual representation of the sentences.
- Index [4]: oracle label associated with the sentence (0, or 1).
- Index [5]: the section id assigned by sequential sentence classification package. For more information, please refer to this repository

Preparing Data

Simply run the prep.sh bash script with providing the dataset directory. This script will use two functions to first create aggregated json files, and then preparing them for pretrained language models' usage.

Please note that if you want to use your custom dataset and create torch files, you will need to frame the format of your dataset to the given format in the Dataset section.

Training

The full training scripts are inside train.sh bash file. To run it on your machine, you will need to change the directories to fit in your needs:

...

DATA_PATH=/path/to/dataset/torch-files/
MODEL_PATH=/path/to/saved/model/

# Specifiying GPUs either single GPU, or multi-GPU
export CUDA_VISIBLE_DEVICES=0,1


# You don't need to modify these below 
LOG_DIR=../logs/$(echo $MODEL_PATH | cut -d \/ -f 6).log
mkdir -p ../results/$(echo $MODEL_PATH | cut -d \/ -f 6)
RESULT_PATH_TEST=../results/$(echo $MODEL_PATH | cut -d \/ -f 6)/

MAX_POS=2500

...

Inference

The inference scripts are inside test.sh bash file. To run it on your machine, you will need to modify the file directories:

...
# path to the data directory
BERT_DIR=/path/to/dataset/torch-files/

# path to the trained model directory
MODEL_PATH=/disk1/sajad/sci-trained-models/presum/LSUM-2500-segmented-sectioned-multi50-classi-v1/

# path to the best trained model (or the checkpoint that you want to run inference on)
CHECKPOINT=$MODEL_PATH/Recall_BEST_model_s63000_0.4910.pt

# GPU machines, either multi or single GPU
export CUDA_VISIBLE_DEVICES=0,1

MAX_POS=2500

...

Citation

If you plan to use this work, please cite the following papers:

@inproceedings{Sotudeh2021ExtendedSumm,
  title={On Generating Extended Summaries of Long Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)},
  year={2021}
}

@inproceedings{Sotudeh2020LongSumm,
  title={GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={First Workshop on Scholarly Document Processing (SDP 2020)},
  year={2020}
}

You might also like...

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Comments

the LongSumm

Hi： can you put the dataset in there or teach me how to get this dataset completely，because i can not use the complete dataset， really you are appreciated！！

opened by arnold-em 1
the Pdb problem

gpu_rank 0 [2022-11-09 11:32:53,261 INFO] loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b [2022-11-09 11:32:55,748 INFO] * number of parameters: 159693576 [2022-11-09 11:32:55,748 INFO] Start training... [2022-11-09 11:32:56,667 INFO] Loading train dataset from ./arxivL/bert-files/2500-segmented-test/train.35.bert.pt, number of examples: 802

/root/workspace/cht/ExtendSumm/ExtendedSumm-master/src/models/data_loader.py(355)preprocess() -> end_id = [src[-1]] (Pdb) the gpu is Tesla a100（40g） when i run train.sh then just （Pdb），could you tell me how to deal this

opened by arnold-em 0
Section id by sequential sentence classification

Hello, Thank you for making this code available. Could you provide more details on how the section id mentioned in the Dataset Format section of the README file is being generated? The linked repo (Sequential Sentence Classification) does not have any documentation regarding generating the section id. I did not find that information in the 2 papers either.

opened by pranav-s 0
CSV FILE

https://github.com/Georgetown-IR-Lab/ExtendedSumm/blob/cf9d74f2b55b8322c476ac89a49f6655fee219f5/src/prepro/data_builder.py#L563

Can you please tell how to get this csv file ??? Or if we are using json files then we can you use _format_to_bert function instead of format_to_bert function ???

opened by CJPJ007 6

On Generating Extended Summaries of Long Documents

Related tags

Overview

ExtendedSumm

Conda environment: preliminary setup

How to run...

Dataset

Preparing Data

Training

Inference

Citation

You might also like...

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

[ECCVW2020] Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DiMP)

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Improving Calibration for Long-Tailed Recognition (CVPR2021)

Improving Calibration for Long-Tailed Recognition (CVPR2021)

Pytorch implementation for "Adversarial Robustness under Long-Tailed Distribution" (CVPR 2021 Oral)

Comments

the LongSumm

the Pdb problem

Section id by sequential sentence classification

CSV FILE

Owner

Georgetown Information Retrieval Lab

Automatic voice-synthetised summaries of latest research papers on arXiv

A New Approach to Overgenerating and Scoring Abstractive Summaries

Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

Demo for Real-time RGBD-based Extended Body Pose Estimation paper

Generalized hybrid model for mode-locked laser diodes with an extended passive cavity

A deep learning based semantic search platform that computes similarity scores between provided query and documents

An executor that loads ONNX models and embeds documents using the ONNX runtime.

Import Python modules from dicts and JSON formatted documents.

Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-like Documents.