SummVis is an interactive visualization tool for text summarization.

Overview

SummVis

SummVis is an interactive visualization tool for analyzing abstractive summarization model outputs and datasets.

Figure

Installation

IMPORTANT: Please use python>=3.8 since some dependencies require that for installation.

git clone https://github.com/robustness-gym/summvis.git
cd summvis
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Quickstart

Follow the steps below to start using SummVis immediately.

1. Download and extract data

Download our pre-cached dataset that contains predictions for state-of-the-art models such as PEGASUS and BART on 1000 examples taken from the CNN / Daily Mail validation set.

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize data

Next, we'll need to add the original examples from the CNN / Daily Mail dataset to deanonymize the data (this information is omitted for copyright reasons). The preprocessing.py script can be used for this with the --deanonymize flag.

Deanonymize 10 examples (try_it mode):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/try:cnn_dailymail_1000.validation \
--try_it

This will take between 10 seconds and several minutes depending on whether you've previously loaded CNN/DailyMail from the Datasets library.

3. Run SummVis

Finally, we're ready to run the Streamlit app. Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

General instructions for running with pre-loaded datasets

1. Download one of the pre-loaded datasets:

CNN / Daily Mail (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip
CNN / Daily Mail (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail.validation.anonymized.zip
XSum (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum_1000.validation.anonymized.zip
XSum (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum.validation.anonymized.zip

We recommend that you choose the smallest dataset that fits your need in order to minimize download / preprocessing time.

Example: Download and unzip CNN / Daily Mail

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize n examples:

Set the --n_samples argument and name the --processed_dataset_path output file accordingly.

Example: Deanonymize 100 examples from CNN / Daily Mail:

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/100:cnn_dailymail_1000.validation \
--n_samples 100

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail_1000.validation \
--n_samples 1000

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail.validation

Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/xsum_1000.validation.anonymized \
--dataset xsum \
--split validation \
--processed_dataset_path data/full:xsum_1000.validation \
--n_samples 1000

3. Run SummVis

Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

Alternately, if you need to point SummVis to a folder where your data is stored.

streamlit run summvis.py -- --path your/path/to/data

Note that the additional -- is not a mistake, and is required to pass command-line arguments in streamlit.

Get your data into SummVis: end-to-end preprocessing

You can also perform preprocessing end-to-end to load any summarization dataset or model predictions into SummVis. Instructions for this are provided below.

Prior to running the following, an additional install step is required:

python -m spacy download en_core_web_lg

1. Standardize and save dataset to disk.

Loads in a dataset from HF, or any dataset that you have and stores it in a standardized format with columns for document and summary:reference.

Example: Save CNN / Daily Mail validation split to disk as a jsonl file.

python preprocessing.py \
--standardize \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

Example: Load custom my_dataset.jsonl, standardize, and save.

python preprocessing.py \
--standardize \
--dataset_jsonl path/to/my_dataset.jsonl \
--doc_column name_of_document_column \
--reference_column name_of_reference_summary_column \
--save_jsonl_path preprocessing/my_dataset.jsonl

2. Add predictions to the saved dataset.

Takes a saved dataset that has already been standardized and adds predictions to it from prediction jsonl files. Cached predictions for several models available here: https://storage.googleapis.com/sfr-summvis-data-research/predictions.zip

You may also generate your own predictions using this this script.

Example: Add 6 prediction files for PEGASUS and BART to the dataset.

python preprocessing.py \
--join_predictions \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--prediction_jsonls \
predictions/bart-cnndm.cnndm.validation.results.anonymized \
predictions/bart-xsum.cnndm.validation.results.anonymized \
predictions/pegasus-cnndm.cnndm.validation.results.anonymized \
predictions/pegasus-multinews.cnndm.validation.results.anonymized \
predictions/pegasus-newsroom.cnndm.validation.results.anonymized \
predictions/pegasus-xsum.cnndm.validation.results.anonymized \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

3. Run the preprocessing workflow and save the dataset.

Takes a saved dataset that has been standardized, and predictions already added. Applies all the preprocessing steps to it (running spaCy, lexical and semantic aligners), and stores the processed dataset back to disk.

Example: Autorun with default settings on a few examples to try it.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail.validation \
--try_it

Example: Autorun with default settings on all examples.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail

Citation

When referencing this repository, please cite this paper:

@misc{vig2021summvis,
      title={SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization}, 
      author={Jesse Vig and Wojciech Kryscinski and Karan Goel and Nazneen Fatema Rajani},
      year={2021},
      eprint={2104.07605},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2104.07605}
}

Acknowledgements

We thank Michael Correll for his valuable feedback.

Issues
  • Following the SummVis Quickstart yields

    Following the SummVis Quickstart yields "ImportError: cannot import name 'Dataset' from 'robustnessgym'"

    Initial note I'm using Apple silicon

    Describe the bug Running the documentation's quickstart example doesn't create the validation data. There are ImportErrors for Dataset, Spacy and CachedOperation. I have tried with some of the other pre-loaded datasets as well.

    To reproduce Following code runs without throwing any errors

    git clone https://github.com/robustness-gym/summvis.git
    cd summvis
    conda env create -f environment.yml
    conda activate summvis
    

    However, when trying to create the pre-cached examples with

    sh quickstart.sh
    

    a couple of errors are thrown, many of which I am able to fix. I do that by manually installing kaleido, grpcio, robustnessgym and libopenblas as follows (in the summvis conda environment)

    python -m pip install kaleido
    conda install grpcio
    python -m pip install robustnessgym
    conda install libopenblas
    

    I followed the recommendations in this issue to install grpcio using conda and took the same approach for libopenblas.

    Finally, running

    sh quickstart.sh
    

    yields the following error which refers to line 8 in the preprocessing.py file.

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 53.7M  100 53.7M    0     0  12.7M      0  0:00:04  0:00:04 --:--:-- 12.7M
    Archive:  preprocessing/cnn_dailymail_1000.validation.anonymized.zip
      inflating: preprocessing/cnn_dailymail_1000.validation.anonymized/metadata.json  
      inflating: preprocessing/cnn_dailymail_1000.validation.anonymized/_dataset/data.gz  
    Traceback (most recent call last):
      File "preprocessing.py", line 8, in <module>
        from robustnessgym import Dataset, Spacy, CachedOperation
    ImportError: cannot import name 'Dataset' from 'robustnessgym' (/opt/anaconda3/envs/summvis/lib/python3.8/site-packages/robustnessgym/__init__.py)
    

    Expected behavior

    A cnn_dailymail_10.validation file in the data folder. And ultimately run SummVis.

    System information

    • Apple Silicon
    • OS: MacOS 11.3.1
    • Python version: 3.8.11

    pip freeze output:

    absl-py==0.13.0
    aiohttp==3.7.4.post0
    antlr4-python3-runtime==4.8
    async-timeout==3.0.1
    attrs==21.2.0
    cachetools==4.2.2
    certifi==2021.5.30
    chardet==4.0.0
    charset-normalizer==2.0.4
    click==8.0.1
    coverage @ file:///Users/ktietz/demo/mc3/conda-bld/coverage_1630667500650/work
    Cython @ file:///Users/ktietz/demo/mc3/conda-bld/cython_1628585205880/work
    cytoolz==0.11.0
    dataclasses==0.6
    datasets==1.11.0
    dill==0.3.4
    fastBPE==0.1.0
    filelock==3.0.12
    fsspec==2021.8.1
    future==0.18.2
    fuzzywuzzy==0.18.0
    google-auth==1.35.0
    google-auth-oauthlib==0.4.6
    grpcio @ file:///Users/ktietz/demo/mc3/conda-bld/grpcio_1628724614448/work
    huggingface-hub==0.0.16
    idna==3.2
    joblib==1.0.1
    jsonlines==2.0.0
    kaleido==0.2.1
    Markdown==3.3.4
    meerkat-ml==0.1.2
    multidict==5.1.0
    multiprocess==0.70.12.2
    numpy==1.21.2
    oauthlib==3.1.1
    omegaconf==2.1.1
    packaging==21.0
    pandas==1.3.3
    plotly==5.3.1
    progressbar==2.5
    protobuf==3.17.3
    pyahocorasick==1.4.2
    pyarrow==5.0.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pyDeprecate==0.3.1
    pyparsing==2.4.7
    python-dateutil==2.8.2
    python-Levenshtein==0.12.2
    pytorch-lightning==1.4.6
    pytz==2021.1
    PyYAML==5.4.1
    regex==2021.8.28
    requests==2.26.0
    requests-oauthlib==1.3.0
    robustnessgym==0.1.3
    rsa==4.7.2
    sacremoses==0.0.45
    scikit-learn==0.24.2
    scipy==1.7.1
    semver==2.13.0
    six @ file:///tmp/build/80754af9/six_1623709665295/work
    sklearn==0.0
    tenacity==8.0.1
    tensorboard==2.6.0
    tensorboard-data-server==0.6.1
    tensorboard-plugin-wit==1.8.0
    threadpoolctl==2.2.0
    tokenizers==0.10.3
    toolz==0.11.1
    torch==1.9.0
    torchaudio==0.9.0
    torchmetrics==0.5.1
    tqdm==4.62.2
    transformers==4.10.2
    typing-extensions==3.10.0.2
    ujson==4.1.0
    urllib3==1.26.6
    Werkzeug==2.0.1
    xxhash==2.0.2
    yarl==1.6.3
    
    opened by JakobLS 8
  •  Connect to remote server

    Connect to remote server

    When I run this program on a remote server, can it be displayed locally?

    opened by martin6336 0
Owner
Robustness Gym
Building tools for evaluating and repairing ML models.
Robustness Gym
Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

Bokeh 14.7k Feb 13, 2021
Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

Bokeh 14.7k Feb 18, 2021
Learning Convolutional Neural Networks with Interactive Visualization.

CNN Explainer An interactive visualization system designed to help non-experts learn about Convolutional Neural Networks (CNNs) For more information,

Polo Club of Data Science 5.7k Nov 27, 2021
Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

Aravind Kumar G 2 Nov 17, 2021
Squidpy is a tool for the analysis and visualization of spatial molecular data.

Squidpy is a tool for the analysis and visualization of spatial molecular data. It builds on top of scanpy and anndata, from which it inherits modularity and scalability. It provides analysis tools that leverages the spatial coordinates of the data, as well as tissue images if available.

Theis Lab 126 Nov 18, 2021
Simple spectra visualization tool for astronomers

SpecViewer A simple visualization tool for astronomers. Dependencies Python >= 3.7.4 PyQt5 >= 5.15.4 pyqtgraph == 0.10.0 numpy >= 1.19.4 How to use py

null 5 Oct 7, 2021
A visualization tool made in Pygame for various pathfinding algorithms.

Pathfinding-Visualizer ?? A visualization tool made in Pygame for various pathfinding algorithms. Pathfinding is closely related to the shortest path

Aysha sana 4 Oct 25, 2021
Declarative statistical visualization library for Python

Altair http://altair-viz.github.io Altair is a declarative statistical visualization library for Python. With Altair, you can spend more time understa

Altair 7.1k Nov 24, 2021
Statistical data visualization using matplotlib

seaborn: statistical data visualization Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing

Michael Waskom 8.9k Nov 25, 2021
Fast data visualization and GUI tools for scientific / engineering applications

PyQtGraph A pure-Python graphics library for PyQt5/PyQt6/PySide2/PySide6 Copyright 2020 Luke Campagnola, University of North Carolina at Chapel Hill h

pyqtgraph 2.6k Dec 1, 2021
Simple, realtime visualization of neural network training performance.

pastalog Simple, realtime visualization server for training neural networks. Use with Lasagne, Keras, Tensorflow, Torch, Theano, and basically everyth

Rewon Child 414 Oct 14, 2021
Apache Superset is a Data Visualization and Data Exploration Platform

Superset A modern, enterprise-ready business intelligence web application. Why Superset? | Supported Databases | Installation and Configuration | Rele

The Apache Software Foundation 42.3k Dec 2, 2021
Debugging, monitoring and visualization for Python Machine Learning and Data Science

Welcome to TensorWatch TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Micr

Microsoft 3.2k Dec 1, 2021
Python script to generate a visualization of various sorting algorithms, image or video.

sorting_algo_visualizer Python script to generate a visualization of various sorting algorithms, image or video.

null 150 Oct 30, 2021
Statistical data visualization using matplotlib

seaborn: statistical data visualization Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing

Michael Waskom 8.1k Feb 13, 2021
Declarative statistical visualization library for Python

Altair http://altair-viz.github.io Altair is a declarative statistical visualization library for Python. With Altair, you can spend more time understa

Altair 6.4k Feb 13, 2021
Fast data visualization and GUI tools for scientific / engineering applications

PyQtGraph A pure-Python graphics library for PyQt5/PyQt6/PySide2/PySide6 Copyright 2020 Luke Campagnola, University of North Carolina at Chapel Hill h

pyqtgraph 2.3k Feb 13, 2021
Missing data visualization module for Python.

missingno Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities tha

Aleksey Bilogur 3k Dec 1, 2021
Streaming pivot visualization via WebAssembly

Perspective is an interactive visualization component for large, real-time datasets. Originally developed for J.P. Morgan's trading business, Perspect

The Fintech Open Source Foundation (www.finos.org) 3.9k Nov 28, 2021