Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Arthur Bražinskas

Last update: Jan 1, 2023

Related tags

Deep Learning natural-language-processing reinforcement-learning deep-learning amazon summarization opinion-mining variational-inference natural-language-understanding

Overview

Learning Opinion Summarizers by Selecting Informative Reviews

This repository contains the codebase and the dataset for the corresponding EMNLP 2021 paper. Please star the repository and cite the paper if you find it useful.

SelSum is a probabilistic (latent) model that selects informative reviews from large collections and subsequently summarizes them as shown in the diagram below.

AmaSum is the largest abstractive opinion summarization dataset, consisting of more than 33,000 human-written summaries for Amazon products. Each summary is paired, on average, with more than 320 customer reviews. Summaries consist of verdicts, pros, and cons, see the example below.

Verdict: The Olympus Evolt E-500 is a compact, easy-to-use digital SLR camera with a broad feature set for its class and very nice photo quality overall.

Pros:

Compact design
Strong autofocus performance even in low-light situations
Intuitive and easy-to-navigate menu system
Wide range of automated and manual features to appeal to both serious hobbyists and curious SLR newcomers

Cons:

Unreliable automatic white balance in some conditions
Slow start-up time when dust reduction is enabled
Compatible Zuiko lenses don't indicate focal distance

1. Setting up

1.1. Environment

The easiest way to proceed is to create a separate conda environment with Python 3.7.0.

conda create -n selsum python=3.7.0

Further, install PyTorch as shown below.

conda install -c pytorch pytorch=1.7.0

In addition, install the essential python modules:

pip install -r requirements.txt

The codebase relies on FairSeq. To avoid version conflicts, please download our version and store it to ../fairseq_lib. Please follow the installation instructions in the unzipped directory.

1.2. Environmental variables

Before running scripts, please add the environmental variables below.

export PYTHONPATH=../fairseq_lib/.:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MKL_THREADING_LAYER=GNU

1.3. Data

The dataset in various formats is available in the dataset folder. To run the model, please binarize the fairseq specific version.

1.4. Checkpoints

We also provide the checkpoints of the trained models. These should be allocated to artifacts/checkpoints.

2. Training

2.1. Posterior and Summarizer training

First, the posterior and summarizer need to be trained. The summarizer is initialized using the BART base model, please download the checkpoint and store it to artifacts/bart. Note: please adjust hyper-parameters and paths in the script if needed.

bash selsum/scripts/training/train_selsum.sh

Please note that REINFORCE-based loss for the posterior training can be negative as the forward pass does not correspond to the actual loss function. Instead, the loss is re-formulated to compute gradients in the backward pass (Eq. 5 in the paper).

2.2. Selecting reviews with the Posterior

Once the posterior is trained (jointly with the summarizer), informative reviews need to be selected. The script below produces binary tags indicating selected reviews.

python selsum/scripts/inference/posterior_select_revs.py --data-path=../data/form  \
--checkpoint-path=artifacts/checkpoints/selsum.pt \
--bart-dir=artifacts/bart \
--output-folder-path=artifacts/output/q_sel \
--split=test \
--ndocs=10 \
--batch-size=30

The output can be downloaded and stored to artifacts/output/q_sel.

2.3. Fitting the Prior

Once tags are produced by the posterior, we can fit the prior to approximate it.

bash selsum/scripts/training/train_prior.sh

2.4. Selecting Reviews with the Prior

After the prior is trained, we select informative reviews for downstream summarization.

python selsum/scripts/inference/prior_select_revs.py --data-path=../data/form \
--checkpoint-path=artifacts/checkpoints/prior.pt \
--bart-dir=artifacts/bart \
--output-folder-path=artifacts/output/p_sel \
--split=test \
--ndocs=10 \
--batch-size=10

The output can be downloaded and stored to artifacts/output/p_sel.

3. Inference

3.1. Summary generation

To generate summaries, run the command below:

python selsum/scripts/inference/gen_summs.py --data-path=artifacts/output/p_sel/ \
--bart-dir=artifacts/bart \
--checkpoint-path=artifacts/checkpoints/selsum.pt \
--output-folder-path=artifacts/output/p_summs \
--split=test \
--batch-size=20

The model outputs are also available at artifacts/summs.

3.2. Evaluation

For evaluation, we used a wrapper over ROUGE and the CoreNLP tokenizer.

The tokenizer requires the CoreNLP library to be downloaded. Please unzip it to the artifacts/misc folder. Further, make it visible in the classpath as shown below.

export CLASSPATH=artifacts/misc/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

After the installations, please adjust the paths and use the commands below.

GEN_FILE_PATH=artifacts/summs/test.verd
GOLD_FILE_PATH=../data/form/eval/test.verd

# tokenization
cat "${GEN_FILE_PATH}" | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > "${GEN_FILE_PATH}.tokenized"
cat "${GOLD_FILE_PATH}" | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > "${GOLD_FILE_PATH}.tokenized"

# rouge evaluation
files2rouge "${GOLD_FILE_PATH}.tokenized" "${GEN_FILE_PATH}.tokenized"

Citation

@inproceedings{bražinskas2021learning,
      title={Learning Opinion Summarizers by Selecting Informative Reviews}, 
      author={Arthur Bražinskas and Mirella Lapata and Ivan Titov},
      booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
      year={2021},
}

License

Codebase: MIT

Dataset: non-commercial

Notes

Occasionally logging stops being printed while the model is training. In this case, the log can be displayed either with a gap or only at the end of the epoch.
SelSum is trained with a single data worker process because otherwise cross-parallel errors are encountered.

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Text-AutoAugment (TAA) This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classific

105 Jan 3, 2023

Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

RecycleD Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN

23 Nov 5, 2022

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Efficient implementations of Product Quantization and its variants using Pytorch and CUDA

146 Dec 28, 2022

Streamlit App For Product Analysis - Streamlit App For Product Analysis

Streamlit_App_For_Product_Analysis Здравствуйте! Перед вами дашборд, позволяющий

1 Jan 10, 2022

This is the official repo for TransFill: Reference-guided Image Inpainting by Merging Multiple Color and Spatial Transformations at CVPR'21. According to some product reasons, we are not planning to release the training/testing codes and models. However, we will release the dataset and the scripts to prepare the dataset.

TransFill-Reference-Inpainting This is the official repo for TransFill: Reference-guided Image Inpainting by Merging Multiple Color and Spatial Transf

80 Dec 8, 2022

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

68 Jul 18, 2022

39 Oct 5, 2021

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

TensorFlow White Paper Notes Features Notes broken down section by section, as well as subsection by subsection Relevant links to documentation, resou

437 Oct 9, 2022

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

25 Sep 6, 2022

Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Related tags

Overview

﻿Learning Opinion Summarizers by Selecting Informative Reviews

1. Setting up

1.1. Environment

1.2. Environmental variables

1.3. Data

1.4. Checkpoints

2. Training

2.1. Posterior and Summarizer training

2.2. Selecting reviews with the Posterior

2.3. Fitting the Prior

2.4. Selecting Reviews with the Prior

3. Inference

3.1. Summary generation

3.2. Evaluation

Citation

License

Notes

You might also like...

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Streamlit App For Product Analysis - Streamlit App For Product Analysis

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Owner

Arthur Bražinskas

A New Approach to Overgenerating and Scoring Abstractive Summaries

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Source codes for "Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs"

Learning Opinion Summarizers by Selecting Informative Reviews