Improving Factual Consistency of Abstractive Text Summarization

Overview

Improving Factual Consistency of Abstractive Text Summarization

We provide the code for the papers:

  1. "Entity-level Factual Consistency of Abstractive Text Summarization", EACL 2021.
    • We provide a set of new metrics to quantify the entity-level factual consistency of generated summaries. We also provide code for the two methods in our paper:
      • JAENS: joint entity and summary generation, and
      • Summary-worthy entity classification with summarization (multi-task learning)
  2. "Improving Factual Consistency of Abstractive Summarization via Question Answering", ACL-IJCNLP 2021
    • QUALS, a new automatic metric for factual consistency.
    • CONSEQ, a new contrastive learning algorithm for Seq2seq models to optimize sequence level objectives such as QUALS.

Our code is based on the fairseq library and we added support for model training on Sagemaker.

Requirements and setup

  • python==3.6: conda create -n entity_fact python=3.6
  • pytorch==1.4.0: pip install torch==1.4.0 torchvision==0.5.0
  • run pip install --editable ./
  • install file2rouge following instructions here
  • download en_core_web_lg: python -m spacy download en_core_web_lg

Entity-level Factual Consistency of Abstractive Text Summarization

Data preprocessing:

We provide three options to preprocess summarization data through the filter_level option.

  • filter_level=0: no special processing
  • filter_level=1: remove corruption text in source articles and summaries. (Undesirable texts included as a result of imperfect data collection. e.g. "Share this with Email, Facebook, Messenger". Undesirable summaries such as "Collection of all USATODAY.com coverage of People, including articles, videos, photos, and quotes.")
  • filter_level=2: entity hallucination filtering in addition to corruption text removal. A summary sentence is removed if it contains a named entity not in the source document.

XSUM:

  1. Follow the instructions here to download and extract text from HTML files and establish the xsum-extracts-from-downloads directory.
  2. Let be the directory that contains the xsum-extracts-from-downloads directory and XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json.
  3. Run python preprocess/data_prepro_clean.py --mode preprocess_xsum --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

CNNDM:

  1. Download and unzip the stories directories from here for both CNN and Daily Mail. Put all .story files in a directory /raw_stories .
  2. Download the url files mapping_train.txt, mapping_test.txt and mapping_valid.txt from here to .
  3. Run python preprocess/data_prepro_clean.py --mode preprocess_cnndm --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

NEWSROOM:

  1. Download the datasets following instructions from here.
  2. Run python preprocess/data_prepro_clean.py --mode preprocess_newsroom --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

Tokenize and binarize the data:

Download bpe encoder.json, vocabulary and fairseq dictionary to a directory, say ; then tokenize and binarize the data.

wget -O <bpe-dir>/encoder.json 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -O <bpe-dir>/vocab.bpe 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
wget -N <bpe-dir>/dict.txt' https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'

cd preprocess
python data_prepro_clean.py --mode bpe_binarize --input_dir <processed-data-dir> --tokenizer_dir <bpe-dir>

This generates the binary input files as well as the dictionaries under /data_bin for fairseq.

JAENS: Joint Entity and Summary Generation

The idea is to train the seq2seq model to generate .

1. prepare the train/val data by generating entity augmented targets:

python preprocess/create_entity_classification_labels.py --base_dir <processed-data-dir> --type entity_augment --tokenizer_dir <bpe-dir>

Binarize the augmented targets: cd preprocess

python data_prepro_clean.py --mode binarize --input_dir 
   
    /entity_augment --tokenizer_dir 
    

    
   

Since we already binarized the source documents, we just need to create symbolic links to put all binary input files together for fairseq training:

ln -s <processed-data-dir>/data_bin/train.source-target.source.idx <processed-data-dir>/entity_augment/data_bin/train.source-target.source.idx
ln -s <processed-data-dir>/data_bin/train.source-target.source.bin <processed-data-dir>/entity_augment/data_bin/train.source-target.source.bin
ln -s <processed-data-dir>/data_bin/valid.source-target.source.bin <processed-data-dir>/entity_augment/data_bin/valid.source-target.source.bin
ln -s <processed-data-dir>/data_bin/valid.source-target.source.idx <processed-data-dir>/entity_augment/data_bin/valid.source-target.source.idx

2. Fine-tune the BART-large model on the generated data:

Run the launch scripts scripts/launch_xsum.py, scripts/launch_cnndm.py or scripts/launch_newsroom.py to fine-tune the BART-large model. Note you need to modify the following in the scripts:

  • hyperparameters.
  • train_path: location of the binary input files. e.g. /entity_augment/data_bin .
  • init_path: location of the pre-trained BART-large model checkpoint. Please rename the checkpoint to pretrained_model.pt
  • output_path: location for the model outputs.

If training locally, you need to specify ngpus - the number of GPUS in the local machine. Example command:

python scripts/launch_xsum.py --datatype ner_filtered --epoch 8 --exp_type local

If training on Sagemaker, you need to specify the docker image name (image_name) as well as execution role (role). To create Sagemaker docker container and push to ECR:

./build_and_push.sh 
   

   

To launch training job:

python scripts/launch_xsum.py --datatype ner_filtered --epoch 8 --exp_type sagemaker

3. Generate the summaries from the fine-tuned models:

preprocess/multi_gpu_generate.py is used to generate summaries.

Since the JAENS models generates the named entities before the summaries, we need to remove the named entities before evaluating the summaries. Example command:

python evaluate_hypo.py --mode remove_ent_from_hypo --base_dir 
   
     --sub_dir 
    
      --split val --pattern .*.hypo

    
   

4. To evaluate the generated summaries for ROUGE as well as entity level factual scores:

We use the tokenizer from Stanford CoreNLP package. Example command:

export CLASSPATH=path/to/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar
python evaluate_hypo.py --mode evaluate_summary --base_dir <output-dir> --sub_dir <output-sub-dir> --split val --pattern .*.hypo

See preprocess/run_*_eval.sh for examples.

Summary-worthy entity classification with summarization (multi-task learning)

We perform summary-worthy entity classification at a classification head on the encoder while keeping the seq2seq objective at the decoder. For training, we need to preprocess the input document by create B-I-O labels to identify summary-worthy entities:

python create_entity_classification_labels.py --base_dir 
   
     --type cls_labels --tokenizer_dir 
    
     
python data_prepro_clean.py --mode binarize_cls_labels --input_dir 
     
       --output_dir 
      
       /data_bin --tokenizer_dir 
        
       
      
     
    
   

Launch training jobs using scripts scripts/launch_multitask_*.py.

Improving Factual Consistency of Abstractive Summarization via Question Answering

QUALS (QUestion Answering with Language model score for Summarization)

To evaluate the QUALS of summaries (e.g. test.target) given original input (e.g. test.source), we execute the following steps in the preprocess sub-directory.

0. Prepare summaries into jsonl format

python evaluate_hypo.py --mode convert_hypo_to_json --base_dir 
   
     --sub_dir 
    
      --split test --pattern .target

    
   

1. Generating question and answer pairs from summaries

python sm_inference_asum.py --task gen_qa --base_dir 
   
     --source_dir 
    
      --output_dir 
     
       --num_workers 
      
        --bsz 5 --beam 60 --max_len 60 --min_len 8 --checkpoint_dir 
       
         --ckp_file checkpoint2.pt --bin_dir 
        
         /data_bin --diverse_beam_groups 60 --diverse_beam_strength 0.5 --batch_lines True --input_file test.target.hypo --return_token_scores True 
        
       
      
     
    
   

Here, we use diverse beam search to generate 60 question-answer pairs for each summary. The batch_lines option is set to True to batch bsz input summaries together for efficient generation. The QAGen model is trained by fine-tuning BART on the SQuAD and NewsQA datasets by concatenating the question-answer pairs using a separator.

To train the QAGen model, place the dev-v1.1.json and train-v1.1.json of SQuAD and the combined-newsqa-data-v1.json of the NewsQA under . The following command generates the binarized input for fine-tuning BART using Fairseq.

python data_prepro_clean.py --mode newsqa_squad_prepro --input_dir 
   
     --output_dir 
    

    
   

You can also download our trained QAGen model from s3 by running:

aws s3 cp s3://fact-check-summarization/newsqa-squad-qagen-checkpoint/checkpoint2.pt 
   
    /

   

Alternatively, you can download here if you don't have awscli.

2. Filter the generated question and answer for high quality pairs

python evaluate_hypo.py --mode filter_qas_dataset_lm_score --base_dir 
   
     --sub_dir 
    
      --pattern test.target.hypo.beam60.qas

    
   

3. Evaluate the generated question and answer pairs using the source document as input

python sm_inference_asum.py --task qa_eval --base_dir 
   
     --output_dir 
    
      --num_workers 
     
       --bsz 30 --checkpoint_dir 
      
        --ckp_file checkpoint2.pt --bin_dir 
       
        /data_bin --qas_dir 
        
          --source_file test.source --target_file test.target --input_file test.target.qas_filtered --prepend_target False 
        
       
      
     
    
   

4. Compute QUALS scores for each summary

python evaluate_hypo.py --mode compute_hypos_lm_score --base_dir 
   
     --sub_dir 
    
      --pattern test.*.source_eval_noprepend

    
   

CONSEQ (CONtrastive SEQ2seq learning)

To use QUALS to improve factual consistency of the summarization model using the CONSEQ algorithm, we follow the steps:

  1. Obtain the MLE summarization baseline by fine-tuning the BART model. Note that in the ACL paper, we used the corruption-filtered CNNDM and XSUM datasets (filter_level=1).
  2. Use the MLE summarization model to sample summaries on the training data.
  3. Evaluate the QUALS for the generated summaries as well as the ground truth summaries of the training data.
  4. Form the positive and negative sets for contrastive learning.
  5. Fine-tune the MLE summarization model using the positive and negative examples. Example scripts for launching training jobs locally or on Sagemaker are preprocess/run_generate_unlikelihood_train_cnndm.sh and preprocess/run_generate_unlikelihood_train_xsum.sh.

We provide an example script preprocess/run_generate_unlikelihood_train_xsum.sh to illustrate steps 2-4. Note that

  • To avoid running for a long time and encountering OOM errors and then restarting the whole process, we split the input files into smaller ones. We do this by splitting the source file by line (e.g. each sub-file has 10000 lines):
split -l 10000 train.source train.source.split   
  • The script has to make repeated calls of python sm_inference_asum.py --task gen_qa to generate question-ansewr pairs, as many times as there are sub-files as a result of line splits. The python function automatically checks which sub-files have been processed (based on output files) so it always processes the next available sub-file. If all sub-files have been processed, it will simply do nothing so it's safe if it's called more times than there are available sub-files.
  • Similarly, sm_inference_asum.py --task qa_eval needs to be repeated called to cover all sub-files.
  • The speed of question-answer pairs generation depends on the batch size setting. Depending on the summary file and the batch_lines setting, batching is handled differently. If the summary file contains only a single summary per input document, batch_lines should be set to True and bsz number of input lines are batched together as input to the QAGen model. If the summary file contains multiple summaries per input document, batch_lines should be set to False and the batching is done using bsz within each input example (line). For example, if there are 6 summaries per line in the summary file, we should set batch_lines to False; setting bsz to 7 will batch all the 7 summaries in a line together, which gives the best speed. (setting it higher won't improve speed since we do not do batching over different lines of input as batch_lines is False). On CNN-DM, bsz of 7 would sometimes result in OOM errors with 16G GPU memory so I use 3 or 4; on XSUM, it is safe to use 7.
  • num_workers should be the number of GPUs available on the machine. The lines in each input files will be distributed per GPU.
  • Finally, run the following to concatenate the QUALS scores from the sub-files: cat *.quals > train.source.source_eval_noprepend.quals

Citations

@inproceedings{nan-etal-2021-entity,
    title = "Entity-level Factual Consistency of Abstractive Text Summarization",
    author = "Nan, Feng  and
      Nallapati, Ramesh  and
      Wang, Zhiguo  and
      Nogueira dos Santos, Cicero  and
      Zhu, Henghui  and
      Zhang, Dejiao  and
      McKeown, Kathleen  and
      Xiang, Bing",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.235",
    pages = "2727--2733",
}

@inproceedings{nan-etal-2021-improving,
    title = {Improving Factual Consistency of Abstractive Summarization via Question Answering},
    author = "Nan, Feng  and
      Nogueira dos Santos, Cicero  and
      Zhu, Henghui  and
      Ng, Patrick  and
      McKeown, Kathleen  and
      Nallapati, Ramesh  and
      Zhang, Dejiao  and
      Wang, Zhiguo  and
      Arnold, Andrew  and
      Xiang, Bing",
    booktitle = {Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP)},
    address = {Virtual},
    month = {August},
    url = {https://arxiv.org/abs/2105.04623},
    year = {2021}
}
Comments
  • Unable to proceed tokenize and binarize the data

    Unable to proceed tokenize and binarize the data

    I'm reproducing the experiment following the guide provided in README, more specifically: here

    While executing the processing script, the command raise an error:

    usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                         [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir TENSORBOARD_LOGDIR]
                         [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16]
                         [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                         [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                         [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                         [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ]
                         [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
                         [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                         [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                         [--criterion {composite_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,sentence_prediction,wav2vec,adaptive_loss,sentence_ranking,masked_lm,nat_loss,cross_entropy,legacy_masked_lm_loss,ctc,vocab_parallel_cross_entropy}]
                         [--tokenizer {space,nltk,moses}]
                         [--bpe {subword_nmt,fastbpe,gpt2,bert,characters,hf_byte_bpe,sentencepiece,bytes,byte_bpe}]
                         [--optimizer {adam,lamb,adadelta,adamax,adafactor,adagrad,nag,sgd}]
                         [--lr-scheduler {reduce_lr_on_plateau,tri_stage,fixed,polynomial_decay,cosine,inverse_sqrt,triangular}]
                         [--scoring {sacrebleu,bleu,wer,chrf}] [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
                         [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
                         [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N]
                         [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary] [--only-source]
                         [--padding-factor N] [--workers N]
    preprocess.py: error: unrecognized arguments: --only-target False
    

    This happens because data_prepro_clean.py does not generate a proper command executing the script, at here.

    https://github.com/amazon-research/fact-check-summarization/blob/551bd525906b2bbff49ff2d8f37c6dca2cd4dc22/preprocess/data_prepro_clean.py#L416

    I've tried to hotfix it, but there's simply too many places to fix, I'm afraid this could result in a bad result.

    Please review, Thanks. 👋🏻

    opened by tomy0000000 5
  • OOM problem

    OOM problem

    Hi, when i reimplement the experiment , i met the OOM problem:

    2021-07-04 14:54:41 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.51 GiB (GPU 0; 10.92 GiB total capacity; 8.34 GiB already allocated; 319.00 MiB free; 10.08 GiB reserved in total by PyTorch)

    My GPU is 11G, where can i change the batch_size or something else for running.

    opened by gaozhiguang 3
  • Code to fine-tune CONSEQ model using constrastive loss

    Code to fine-tune CONSEQ model using constrastive loss

    Hi,

    Thank you for sharing wonderful ideas and the code repository for improving the factual consistency of summarization.

    I had a question regarding the code base. I am wondering if the repository also includes the code to fine-tune CONSEQ model using constrastive loss? As far as I can tell, the repo only includes code to preprocess the dataset into positive and negative samples. Please correct me if I am wrong. Appreciate your time and response. Thank you!

    opened by rajesh-gupta-14 1
  • A question about 'Generating question and answer pairs from summaries'

    A question about 'Generating question and answer pairs from summaries'

    I have now obtained the data for training QAGen. I am still not familiar with fairseq. I don’t know how to use fairseq to get the QAGen model in your paper, so as to execute: python sm_inference_asum.py --task gen_qa --base_dir <processed-data-dir> --source_dir <any-sub-directory-to-data> --output_dir <output-dir> --num_workers <num-of-gpus> --bsz 5 --beam 60 --max_len 60 --min_len 8 --checkpoint_dir <QAGen-model-dir> --ckp_file checkpoint2.pt --bin_dir <processed-data-dir>/data_bin --diverse_beam_groups 60 --diverse_beam_strength 0.5 --batch_lines True --input_file test.target.hypo --return_token_scores True Thank you for your help

    opened by 01070 1
  • JAENS Model generated summaries and model checkpoint

    JAENS Model generated summaries and model checkpoint

    Hi, thanks for sharing this work! We are trying to reproduce your results. Could you please share the JAENS model generated summaries and/or model checkpoint?

    opened by anton164 0
Owner
null
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

SimCLS Code for our paper: "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021 1. How to Install Requirements

Yixin Liu 150 Dec 12, 2022
null 190 Jan 3, 2023
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 1, 2023
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
NAACL'2021: Factual Probing Is [MASK]: Learning vs. Learning to Recall

OptiPrompt This is the PyTorch implementation of the paper Factual Probing Is [MASK]: Learning vs. Learning to Recall. We propose OptiPrompt, a simple

Princeton Natural Language Processing 150 Dec 20, 2022
Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

Yanai Elazar 26 Dec 2, 2022
This repository is an implementation of paper : Improving the Training of Graph Neural Networks with Consistency Regularization

CRGNN Paper : Improving the Training of Graph Neural Networks with Consistency Regularization Environments Implementing environment: GeForce RTX™ 3090

THUDM 1 Dec 9, 2021
A New Approach to Overgenerating and Scoring Abstractive Summaries

We provide the source code for the paper "A New Approach to Overgenerating and Scoring Abstractive Summaries" accepted at NAACL'21. If you find the code useful, please cite the following paper.

Kaiqiang Song 4 Apr 3, 2022
Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.

Summary Explorer Summary Explorer is a tool to visually inspect the summaries from several state-of-the-art neural summarization models across multipl

Webis 42 Aug 14, 2022
Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh

Arjun Majumdar 44 Dec 14, 2022
RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving (AAAI2021). RTS3D is efficiency and accuracy s

null 71 Nov 29, 2022
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 293 Dec 20, 2022
Consistency Regularization for Adversarial Robustness

Consistency Regularization for Adversarial Robustness Official PyTorch implementation of Consistency Regularization for Adversarial Robustness by Jiho

null 40 Dec 17, 2022
Multi-Scale Geometric Consistency Guided Multi-View Stereo

ACMM [News] The code for ACMH is released!!! [News] The code for ACMP is released!!! About ACMM is a multi-scale geometric consistency guided multi-vi

Qingshan Xu 118 Jan 4, 2023
[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

null 153 Dec 14, 2022
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

Jia Research Lab 137 Dec 14, 2022
PyTorch code for SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised DA

PyTorch Code for SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised Domain Adaptation Viraj Prabhu, Shivam Khare, Deeks

Viraj Prabhu 46 Dec 24, 2022