SummVis is an interactive visualization tool for text summarization.

Overview

SummVis

SummVis is an interactive visualization tool for analyzing abstractive summarization model outputs and datasets.

Figure

Installation

IMPORTANT: Please use python>=3.8 since some dependencies require that for installation.

git clone https://github.com/robustness-gym/summvis.git
cd summvis
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Quickstart

Follow the steps below to start using SummVis immediately.

1. Download and extract data

Download our pre-cached dataset that contains predictions for state-of-the-art models such as PEGASUS and BART on 1000 examples taken from the CNN / Daily Mail validation set.

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize data

Next, we'll need to add the original examples from the CNN / Daily Mail dataset to deanonymize the data (this information is omitted for copyright reasons). The preprocessing.py script can be used for this with the --deanonymize flag.

Deanonymize 10 examples (try_it mode):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/try:cnn_dailymail_1000.validation \
--try_it

This will take between 10 seconds and several minutes depending on whether you've previously loaded CNN/DailyMail from the Datasets library.

3. Run SummVis

Finally, we're ready to run the Streamlit app. Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

General instructions for running with pre-loaded datasets

1. Download one of the pre-loaded datasets:

CNN / Daily Mail (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip
CNN / Daily Mail (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail.validation.anonymized.zip
XSum (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum_1000.validation.anonymized.zip
XSum (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum.validation.anonymized.zip

We recommend that you choose the smallest dataset that fits your need in order to minimize download / preprocessing time.

Example: Download and unzip CNN / Daily Mail

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize n examples:

Set the --n_samples argument and name the --processed_dataset_path output file accordingly.

Example: Deanonymize 100 examples from CNN / Daily Mail:

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/100:cnn_dailymail_1000.validation \
--n_samples 100

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail_1000.validation \
--n_samples 1000

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail.validation

Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/xsum_1000.validation.anonymized \
--dataset xsum \
--split validation \
--processed_dataset_path data/full:xsum_1000.validation \
--n_samples 1000

3. Run SummVis

Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

Alternately, if you need to point SummVis to a folder where your data is stored.

streamlit run summvis.py -- --path your/path/to/data

Note that the additional -- is not a mistake, and is required to pass command-line arguments in streamlit.

Get your data into SummVis: end-to-end preprocessing

You can also perform preprocessing end-to-end to load any summarization dataset or model predictions into SummVis. Instructions for this are provided below.

Prior to running the following, an additional install step is required:

python -m spacy download en_core_web_lg

1. Standardize and save dataset to disk.

Loads in a dataset from HF, or any dataset that you have and stores it in a standardized format with columns for document and summary:reference.

Example: Save CNN / Daily Mail validation split to disk as a jsonl file.

python preprocessing.py \
--standardize \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

Example: Load custom my_dataset.jsonl, standardize, and save.

python preprocessing.py \
--standardize \
--dataset_jsonl path/to/my_dataset.jsonl \
--doc_column name_of_document_column \
--reference_column name_of_reference_summary_column \
--save_jsonl_path preprocessing/my_dataset.jsonl

2. Add predictions to the saved dataset.

Takes a saved dataset that has already been standardized and adds predictions to it from prediction jsonl files. Cached predictions for several models available here: https://storage.googleapis.com/sfr-summvis-data-research/predictions.zip

You may also generate your own predictions using this this script.

Example: Add 6 prediction files for PEGASUS and BART to the dataset.

python preprocessing.py \
--join_predictions \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--prediction_jsonls \
predictions/bart-cnndm.cnndm.validation.results.anonymized \
predictions/bart-xsum.cnndm.validation.results.anonymized \
predictions/pegasus-cnndm.cnndm.validation.results.anonymized \
predictions/pegasus-multinews.cnndm.validation.results.anonymized \
predictions/pegasus-newsroom.cnndm.validation.results.anonymized \
predictions/pegasus-xsum.cnndm.validation.results.anonymized \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

3. Run the preprocessing workflow and save the dataset.

Takes a saved dataset that has been standardized, and predictions already added. Applies all the preprocessing steps to it (running spaCy, lexical and semantic aligners), and stores the processed dataset back to disk.

Example: Autorun with default settings on a few examples to try it.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail.validation \
--try_it

Example: Autorun with default settings on all examples.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail

Citation

When referencing this repository, please cite this paper:

@misc{vig2021summvis,
      title={SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization}, 
      author={Jesse Vig and Wojciech Kryscinski and Karan Goel and Nazneen Fatema Rajani},
      year={2021},
      eprint={2104.07605},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2104.07605}
}

Acknowledgements

We thank Michael Correll for his valuable feedback.

Comments
  • Following the SummVis Quickstart yields

    Following the SummVis Quickstart yields "ImportError: cannot import name 'Dataset' from 'robustnessgym'"

    Initial note I'm using Apple silicon

    Describe the bug Running the documentation's quickstart example doesn't create the validation data. There are ImportErrors for Dataset, Spacy and CachedOperation. I have tried with some of the other pre-loaded datasets as well.

    To reproduce Following code runs without throwing any errors

    git clone https://github.com/robustness-gym/summvis.git
    cd summvis
    conda env create -f environment.yml
    conda activate summvis
    

    However, when trying to create the pre-cached examples with

    sh quickstart.sh
    

    a couple of errors are thrown, many of which I am able to fix. I do that by manually installing kaleido, grpcio, robustnessgym and libopenblas as follows (in the summvis conda environment)

    python -m pip install kaleido
    conda install grpcio
    python -m pip install robustnessgym
    conda install libopenblas
    

    I followed the recommendations in this issue to install grpcio using conda and took the same approach for libopenblas.

    Finally, running

    sh quickstart.sh
    

    yields the following error which refers to line 8 in the preprocessing.py file.

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 53.7M  100 53.7M    0     0  12.7M      0  0:00:04  0:00:04 --:--:-- 12.7M
    Archive:  preprocessing/cnn_dailymail_1000.validation.anonymized.zip
      inflating: preprocessing/cnn_dailymail_1000.validation.anonymized/metadata.json  
      inflating: preprocessing/cnn_dailymail_1000.validation.anonymized/_dataset/data.gz  
    Traceback (most recent call last):
      File "preprocessing.py", line 8, in <module>
        from robustnessgym import Dataset, Spacy, CachedOperation
    ImportError: cannot import name 'Dataset' from 'robustnessgym' (/opt/anaconda3/envs/summvis/lib/python3.8/site-packages/robustnessgym/__init__.py)
    

    Expected behavior

    A cnn_dailymail_10.validation file in the data folder. And ultimately run SummVis.

    System information

    • Apple Silicon
    • OS: MacOS 11.3.1
    • Python version: 3.8.11

    pip freeze output:

    absl-py==0.13.0
    aiohttp==3.7.4.post0
    antlr4-python3-runtime==4.8
    async-timeout==3.0.1
    attrs==21.2.0
    cachetools==4.2.2
    certifi==2021.5.30
    chardet==4.0.0
    charset-normalizer==2.0.4
    click==8.0.1
    coverage @ file:///Users/ktietz/demo/mc3/conda-bld/coverage_1630667500650/work
    Cython @ file:///Users/ktietz/demo/mc3/conda-bld/cython_1628585205880/work
    cytoolz==0.11.0
    dataclasses==0.6
    datasets==1.11.0
    dill==0.3.4
    fastBPE==0.1.0
    filelock==3.0.12
    fsspec==2021.8.1
    future==0.18.2
    fuzzywuzzy==0.18.0
    google-auth==1.35.0
    google-auth-oauthlib==0.4.6
    grpcio @ file:///Users/ktietz/demo/mc3/conda-bld/grpcio_1628724614448/work
    huggingface-hub==0.0.16
    idna==3.2
    joblib==1.0.1
    jsonlines==2.0.0
    kaleido==0.2.1
    Markdown==3.3.4
    meerkat-ml==0.1.2
    multidict==5.1.0
    multiprocess==0.70.12.2
    numpy==1.21.2
    oauthlib==3.1.1
    omegaconf==2.1.1
    packaging==21.0
    pandas==1.3.3
    plotly==5.3.1
    progressbar==2.5
    protobuf==3.17.3
    pyahocorasick==1.4.2
    pyarrow==5.0.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pyDeprecate==0.3.1
    pyparsing==2.4.7
    python-dateutil==2.8.2
    python-Levenshtein==0.12.2
    pytorch-lightning==1.4.6
    pytz==2021.1
    PyYAML==5.4.1
    regex==2021.8.28
    requests==2.26.0
    requests-oauthlib==1.3.0
    robustnessgym==0.1.3
    rsa==4.7.2
    sacremoses==0.0.45
    scikit-learn==0.24.2
    scipy==1.7.1
    semver==2.13.0
    six @ file:///tmp/build/80754af9/six_1623709665295/work
    sklearn==0.0
    tenacity==8.0.1
    tensorboard==2.6.0
    tensorboard-data-server==0.6.1
    tensorboard-plugin-wit==1.8.0
    threadpoolctl==2.2.0
    tokenizers==0.10.3
    toolz==0.11.1
    torch==1.9.0
    torchaudio==0.9.0
    torchmetrics==0.5.1
    tqdm==4.62.2
    transformers==4.10.2
    typing-extensions==3.10.0.2
    ujson==4.1.0
    urllib3==1.26.6
    Werkzeug==2.0.1
    xxhash==2.0.2
    yarl==1.6.3
    
    opened by JakobLS 8
  • Custom BART model

    Custom BART model

    Hi, first off thank you for this great work and framework.

    I am really keen to visualise the summarisations of my custom BART model and wanted to try out the framework.

    Unfortunately, I always receive the following error when I start the service:

    
    ValueError: Expected object or value
    Traceback:
    File "/home/anaconda3/lib/python3.8/site-packages/streamlit/script_runner.py", line 333, in _run_script
        exec(code, module.__dict__)
    File "/home/manschuetz/lebeling_train/BARTTrain/summvis/summvis.py", line 312, in <module>
        dataset = load_dataset(str(path_dir / filename), nlp=nlp)
    File "/home/anaconda3/lib/python3.8/site-packages/streamlit/caching.py", line 603, in wrapped_func
        return get_or_create_cached_value()
    File "/home/anaconda3/lib/python3.8/site-packages/streamlit/caching.py", line 587, in get_or_create_cached_value
        return_value = func(*args, **kwargs)
    File "/home/manschuetz/lebeling_train/BARTTrain/summvis/summvis.py", line 53, in load_dataset
        return DataPanel.from_jsonl(path)
    File "/home/anaconda3/lib/python3.8/site-packages/meerkat/provenance.py", line 205, in _wrapper
        return fn(*args, **kwargs)
    File "/home/anaconda3/lib/python3.8/site-packages/meerkat/datapanel.py", line 341, in from_jsonl
        return cls.from_pandas(pd.read_json(json_path, orient="records", lines=True))
    File "/home/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 199, in wrapper
        return func(*args, **kwargs)
    File "/home/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper
        return func(*args, **kwargs)
    File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 563, in read_json
        return json_reader.read()
    File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 692, in read
        obj = self._get_object_parser(self._combine_lines(data_lines))
    File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 716, in _get_object_parser
        obj = FrameParser(json, **kwargs).parse()
    File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 831, in parse
        self._parse_no_numpy()
    File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 1098, in _parse_no_numpy
        loads(json, precise_float=self.precise_float), dtype=None 
    

    Here are my steps:

    1. I create a document and reference summary json (ref.json): {"document": "Cricket (englisch [ˈkɹɪkɪt]; in Deutschland amtlich Kricket,[1][2] in den Anfängen auch „Thorball“) ist ein Schlagballspiel mit zwei Mannschaften. Dabei dreht sich alles um das Duell zwischen dem Werfer (Bowler) und dem Schlagmann (Batter). Der Bowler versucht, den Batter zu einem Fehler zu bewegen, damit dieser ausscheidet, der Batter seinerseits versucht, den Ball wegzuschlagen, um Punkte (Runs) zu erzielen. Der Bowler wird durch die anderen Feldspieler unterstützt, die den Ball so schnell wie möglich zurückzubringen versuchen.", "summary:reference": "Cricket ist ein Schlagballspiel. Der Werfer wirft den Ball. Der Schlagmann schlägt den Ball."}
    2. I call python generation.py --model_name_or_path /path/to/my/BARTModel --data_path /path/to/ref.json
    3. I run streamlit run summvis.py -- --path ref.predictions

    I also tried the same with a json file where I included the summary of the BART model by hand: {"document": "Cricket (englisch [ˈkɹɪkɪt]; in Deutschland amtlich Kricket,[1][2] in den Anfängen auch „Thorball“) ist ein Schlagballspiel mit zwei Mannschaften. Dabei dreht sich alles um das Duell zwischen dem Werfer (Bowler) und dem Schlagmann (Batter). Der Bowler versucht, den Batter zu einem Fehler zu bewegen, damit dieser ausscheidet, der Batter seinerseits versucht, den Ball wegzuschlagen, um Punkte (Runs) zu erzielen. Der Bowler wird durch die anderen Feldspieler unterstützt, die den Ball so schnell wie möglich zurückzubringen versuchen.", "summary:reference": "Cricket ist ein Schlagballspiel. Der Werfer wirft den Ball. Der Schlagmann schlägt den Ball.","summary:BART_custom": " Cricket ist ein Schlagballspiel mit zwei Mannschaften"}

    Any help is highly appreciated. Thank you.

    opened by LukasFides 3
  • [PEGASUS and BART] summarization with csv dataset

    [PEGASUS and BART] summarization with csv dataset

    Hi,

    I have finetuned 2 models for text summarization, BART(facebook-bart) and PEGASUS(financial) and I have a dataset stored in .csv files. So, I want to use this tool for the analysis. Can you help me with this issue? How can I run this code for fine-tuned models and my dataset?

    opened by SatishDeshbhratar 3
  • ":" on Windows can't be used in file names.

    Hello, and thanks for your great work! But I noticed that both in preprocessing and the main application, the default way to separate file names is ":", but it can not be used on Windows. After I downloaded your tool, the file names in examples\wikinews\wikinews.cache\mgr\columns became things like 'BertscoreAligner_spacy_document_spacy_summary_bart-cnndm', and so summvis can not find the file it needs (separated by ":"). Therefore, I can't even run this example in my PC. Is there any way to solve it? Thanks again!

    opened by Zhao-Linke 1
  • summvis visualisation in scientific paper

    summvis visualisation in scientific paper

    I wanted to know if there is already a way to export outputs displayed in the framework in order to e.g. show them in a text file (latex preferable). I assume the .cache folder hold necessary data (in the .dill files)? I wanted to ask prior to trying to extract the data on my own and I am happy to found this framework!

    Cheers Lukas

    opened by LukasFides 1
Owner
Robustness Gym
Building tools for evaluating and repairing ML models.
Robustness Gym
Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

Bokeh 17.1k Dec 31, 2022
Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

Bokeh 14.7k Feb 13, 2021
Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

Bokeh 14.7k Feb 18, 2021
Learning Convolutional Neural Networks with Interactive Visualization.

CNN Explainer An interactive visualization system designed to help non-experts learn about Convolutional Neural Networks (CNNs) For more information,

Polo Club of Data Science 6.3k Jan 1, 2023
A dashboard built using Plotly-Dash for interactive visualization of Dex-connected individuals across the country.

Dashboard For The DexConnect Platform of Dexterity Global Working prototype submission for internship at Dexterity Global Group. Dashboard for real ti

Yashasvi Misra 2 Jun 15, 2021
GUI for visualization and interactive editing of SMPL-family body models ie. SMPL, SMPL-X, MANO, FLAME.

Body Model Visualizer Introduction This is a simple Open3D-based GUI for SMPL-family body models. This GUI lets you play with the shape, expression, a

Muhammed Kocabas 207 Jan 1, 2023
Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

Aravind Kumar G 2 Nov 17, 2021
Squidpy is a tool for the analysis and visualization of spatial molecular data.

Squidpy is a tool for the analysis and visualization of spatial molecular data. It builds on top of scanpy and anndata, from which it inherits modularity and scalability. It provides analysis tools that leverages the spatial coordinates of the data, as well as tissue images if available.

Theis Lab 251 Dec 19, 2022
Simple spectra visualization tool for astronomers

SpecViewer A simple visualization tool for astronomers. Dependencies Python >= 3.7.4 PyQt5 >= 5.15.4 pyqtgraph == 0.10.0 numpy >= 1.19.4 How to use py

null 5 Oct 7, 2021
A visualization tool made in Pygame for various pathfinding algorithms.

Pathfinding-Visualizer ?? A visualization tool made in Pygame for various pathfinding algorithms. Pathfinding is closely related to the shortest path

Aysha sana 7 Jul 9, 2022
D-Analyst : High Performance Visualization Tool

D-Analyst : High Performance Visualization Tool D-Analyst is a high performance data visualization built with python and based on OpenGL. It allows to

null 4 Apr 14, 2022
Regress.me is an easy to use data visualization tool powered by Dash/Plotly.

Regress.me Regress.me is an easy to use data visualization tool powered by Dash/Plotly. Regress.me.-.Google.Chrome.2022-05-10.15-58-59.mp4 Get Started

Amar 14 Aug 14, 2022
Declarative statistical visualization library for Python

Altair http://altair-viz.github.io Altair is a declarative statistical visualization library for Python. With Altair, you can spend more time understa

Altair 8k Jan 5, 2023
Statistical data visualization using matplotlib

seaborn: statistical data visualization Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing

Michael Waskom 10.2k Dec 30, 2022
Fast data visualization and GUI tools for scientific / engineering applications

PyQtGraph A pure-Python graphics library for PyQt5/PyQt6/PySide2/PySide6 Copyright 2020 Luke Campagnola, University of North Carolina at Chapel Hill h

pyqtgraph 3.1k Jan 8, 2023
Simple, realtime visualization of neural network training performance.

pastalog Simple, realtime visualization server for training neural networks. Use with Lasagne, Keras, Tensorflow, Torch, Theano, and basically everyth

Rewon Child 416 Dec 29, 2022
Apache Superset is a Data Visualization and Data Exploration Platform

Superset A modern, enterprise-ready business intelligence web application. Why Superset? | Supported Databases | Installation and Configuration | Rele

The Apache Software Foundation 50k Jan 6, 2023
Debugging, monitoring and visualization for Python Machine Learning and Data Science

Welcome to TensorWatch TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Micr

Microsoft 3.3k Dec 27, 2022
Python script to generate a visualization of various sorting algorithms, image or video.

sorting_algo_visualizer Python script to generate a visualization of various sorting algorithms, image or video.

null 146 Nov 12, 2022