SummVis is an interactive visualization tool for text summarization.

Robustness Gym

Last update: Dec 8, 2022

Related tags

Data Visualization summvis

Overview

SummVis

SummVis is an interactive visualization tool for analyzing abstractive summarization model outputs and datasets.

Installation

IMPORTANT: Please use python>=3.8 since some dependencies require that for installation.

git clone https://github.com/robustness-gym/summvis.git
cd summvis
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Quickstart

Follow the steps below to start using SummVis immediately.

1. Download and extract data

Download our pre-cached dataset that contains predictions for state-of-the-art models such as PEGASUS and BART on 1000 examples taken from the CNN / Daily Mail validation set.

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize data

Next, we'll need to add the original examples from the CNN / Daily Mail dataset to deanonymize the data (this information is omitted for copyright reasons). The preprocessing.py script can be used for this with the --deanonymize flag.

Deanonymize 10 examples (`try_it` mode):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/try:cnn_dailymail_1000.validation \
--try_it

This will take between 10 seconds and several minutes depending on whether you've previously loaded CNN/DailyMail from the Datasets library.

3. Run SummVis

Finally, we're ready to run the Streamlit app. Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

General instructions for running with pre-loaded datasets

1. Download one of the pre-loaded datasets:

CNN / Daily Mail (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip

CNN / Daily Mail (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail.validation.anonymized.zip

XSum (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum_1000.validation.anonymized.zip

XSum (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum.validation.anonymized.zip

We recommend that you choose the smallest dataset that fits your need in order to minimize download / preprocessing time.

Example: Download and unzip CNN / Daily Mail

mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/

2. Deanonymize n examples:

Set the --n_samples argument and name the --processed_dataset_path output file accordingly.

Example: Deanonymize 100 examples from CNN / Daily Mail:

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/100:cnn_dailymail_1000.validation \
--n_samples 100

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail_1000.validation \
--n_samples 1000

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--processed_dataset_path data/full:cnn_dailymail.validation

Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):

python preprocessing.py \
--deanonymize \
--dataset_rg preprocessing/xsum_1000.validation.anonymized \
--dataset xsum \
--split validation \
--processed_dataset_path data/full:xsum_1000.validation \
--n_samples 1000

3. Run SummVis

Once the app loads, make sure it's pointing to the right File at the top of the interface.

streamlit run summvis.py

Alternately, if you need to point SummVis to a folder where your data is stored.

streamlit run summvis.py -- --path your/path/to/data

Note that the additional -- is not a mistake, and is required to pass command-line arguments in streamlit.

Get your data into SummVis: end-to-end preprocessing

You can also perform preprocessing end-to-end to load any summarization dataset or model predictions into SummVis. Instructions for this are provided below.

Prior to running the following, an additional install step is required:

python -m spacy download en_core_web_lg

1. Standardize and save dataset to disk.

Loads in a dataset from HF, or any dataset that you have and stores it in a standardized format with columns for document and summary:reference.

Example: Save CNN / Daily Mail validation split to disk as a jsonl file.

python preprocessing.py \
--standardize \
--dataset cnn_dailymail \
--version 3.0.0 \
--split validation \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

Example: Load custom `my_dataset.jsonl`, standardize, and save.

python preprocessing.py \
--standardize \
--dataset_jsonl path/to/my_dataset.jsonl \
--doc_column name_of_document_column \
--reference_column name_of_reference_summary_column \
--save_jsonl_path preprocessing/my_dataset.jsonl

2. Add predictions to the saved dataset.

Takes a saved dataset that has already been standardized and adds predictions to it from prediction jsonl files. Cached predictions for several models available here: https://storage.googleapis.com/sfr-summvis-data-research/predictions.zip

You may also generate your own predictions using this this script.

Example: Add 6 prediction files for PEGASUS and BART to the dataset.

python preprocessing.py \
--join_predictions \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--prediction_jsonls \
predictions/bart-cnndm.cnndm.validation.results.anonymized \
predictions/bart-xsum.cnndm.validation.results.anonymized \
predictions/pegasus-cnndm.cnndm.validation.results.anonymized \
predictions/pegasus-multinews.cnndm.validation.results.anonymized \
predictions/pegasus-newsroom.cnndm.validation.results.anonymized \
predictions/pegasus-xsum.cnndm.validation.results.anonymized \
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl

3. Run the preprocessing workflow and save the dataset.

Takes a saved dataset that has been standardized, and predictions already added. Applies all the preprocessing steps to it (running spaCy, lexical and semantic aligners), and stores the processed dataset back to disk.

Example: Autorun with default settings on a few examples to try it.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail.validation \
--try_it

Example: Autorun with default settings on all examples.

python preprocessing.py \
--workflow \
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
--processed_dataset_path data/cnn_dailymail

Citation

When referencing this repository, please cite this paper:

@misc{vig2021summvis,
      title={SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization}, 
      author={Jesse Vig and Wojciech Kryscinski and Karan Goel and Nazneen Fatema Rajani},
      year={2021},
      eprint={2104.07605},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2104.07605}
}

Acknowledgements

We thank Michael Correll for his valuable feedback.

Comments

Following the SummVis Quickstart yields "ImportError: cannot import name 'Dataset' from 'robustnessgym'"

Initial note I'm using Apple silicon

Describe the bug Running the documentation's quickstart example doesn't create the validation data. There are ImportErrors for Dataset, Spacy and CachedOperation. I have tried with some of the other pre-loaded datasets as well.

To reproduce Following code runs without throwing any errors

git clone https://github.com/robustness-gym/summvis.git
cd summvis
conda env create -f environment.yml
conda activate summvis

However, when trying to create the pre-cached examples with

sh quickstart.sh

a couple of errors are thrown, many of which I am able to fix. I do that by manually installing kaleido, grpcio, robustnessgym and libopenblas as follows (in the summvis conda environment)

python -m pip install kaleido
conda install grpcio
python -m pip install robustnessgym
conda install libopenblas

I followed the recommendations in this issue to install grpcio using conda and took the same approach for libopenblas.

Finally, running

sh quickstart.sh

yields the following error which refers to line 8 in the preprocessing.py file.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 53.7M  100 53.7M    0     0  12.7M      0  0:00:04  0:00:04 --:--:-- 12.7M
Archive:  preprocessing/cnn_dailymail_1000.validation.anonymized.zip
  inflating: preprocessing/cnn_dailymail_1000.validation.anonymized/metadata.json  
  inflating: preprocessing/cnn_dailymail_1000.validation.anonymized/_dataset/data.gz  
Traceback (most recent call last):
  File "preprocessing.py", line 8, in <module>
    from robustnessgym import Dataset, Spacy, CachedOperation
ImportError: cannot import name 'Dataset' from 'robustnessgym' (/opt/anaconda3/envs/summvis/lib/python3.8/site-packages/robustnessgym/__init__.py)

Expected behavior

A cnn_dailymail_10.validation file in the data folder. And ultimately run SummVis.

System information

Apple Silicon
OS: MacOS 11.3.1
Python version: 3.8.11

pip freeze output:

absl-py==0.13.0
aiohttp==3.7.4.post0
antlr4-python3-runtime==4.8
async-timeout==3.0.1
attrs==21.2.0
cachetools==4.2.2
certifi==2021.5.30
chardet==4.0.0
charset-normalizer==2.0.4
click==8.0.1
coverage @ file:///Users/ktietz/demo/mc3/conda-bld/coverage_1630667500650/work
Cython @ file:///Users/ktietz/demo/mc3/conda-bld/cython_1628585205880/work
cytoolz==0.11.0
dataclasses==0.6
datasets==1.11.0
dill==0.3.4
fastBPE==0.1.0
filelock==3.0.12
fsspec==2021.8.1
future==0.18.2
fuzzywuzzy==0.18.0
google-auth==1.35.0
google-auth-oauthlib==0.4.6
grpcio @ file:///Users/ktietz/demo/mc3/conda-bld/grpcio_1628724614448/work
huggingface-hub==0.0.16
idna==3.2
joblib==1.0.1
jsonlines==2.0.0
kaleido==0.2.1
Markdown==3.3.4
meerkat-ml==0.1.2
multidict==5.1.0
multiprocess==0.70.12.2
numpy==1.21.2
oauthlib==3.1.1
omegaconf==2.1.1
packaging==21.0
pandas==1.3.3
plotly==5.3.1
progressbar==2.5
protobuf==3.17.3
pyahocorasick==1.4.2
pyarrow==5.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyDeprecate==0.3.1
pyparsing==2.4.7
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytorch-lightning==1.4.6
pytz==2021.1
PyYAML==5.4.1
regex==2021.8.28
requests==2.26.0
requests-oauthlib==1.3.0
robustnessgym==0.1.3
rsa==4.7.2
sacremoses==0.0.45
scikit-learn==0.24.2
scipy==1.7.1
semver==2.13.0
six @ file:///tmp/build/80754af9/six_1623709665295/work
sklearn==0.0
tenacity==8.0.1
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
threadpoolctl==2.2.0
tokenizers==0.10.3
toolz==0.11.1
torch==1.9.0
torchaudio==0.9.0
torchmetrics==0.5.1
tqdm==4.62.2
transformers==4.10.2
typing-extensions==3.10.0.2
ujson==4.1.0
urllib3==1.26.6
Werkzeug==2.0.1
xxhash==2.0.2
yarl==1.6.3

opened by JakobLS 8

Custom BART model
Hi, first off thank you for this great work and framework.

I am really keen to visualise the summarisations of my custom BART model and wanted to try out the framework.

Unfortunately, I always receive the following error when I start the service:

ValueError: Expected object or value Traceback: File "/home/anaconda3/lib/python3.8/site-packages/streamlit/script_runner.py", line 333, in _run_script exec(code, module.__dict__) File "/home/manschuetz/lebeling_train/BARTTrain/summvis/summvis.py", line 312, in <module> dataset = load_dataset(str(path_dir / filename), nlp=nlp) File "/home/anaconda3/lib/python3.8/site-packages/streamlit/caching.py", line 603, in wrapped_func return get_or_create_cached_value() File "/home/anaconda3/lib/python3.8/site-packages/streamlit/caching.py", line 587, in get_or_create_cached_value return_value = func(*args, **kwargs) File "/home/manschuetz/lebeling_train/BARTTrain/summvis/summvis.py", line 53, in load_dataset return DataPanel.from_jsonl(path) File "/home/anaconda3/lib/python3.8/site-packages/meerkat/provenance.py", line 205, in _wrapper return fn(*args, **kwargs) File "/home/anaconda3/lib/python3.8/site-packages/meerkat/datapanel.py", line 341, in from_jsonl return cls.from_pandas(pd.read_json(json_path, orient="records", lines=True)) File "/home/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 199, in wrapper return func(*args, **kwargs) File "/home/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper return func(*args, **kwargs) File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 563, in read_json return json_reader.read() File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 692, in read obj = self._get_object_parser(self._combine_lines(data_lines)) File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 716, in _get_object_parser obj = FrameParser(json, **kwargs).parse() File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 831, in parse self._parse_no_numpy() File "/home/anaconda3/lib/python3.8/site-packages/pandas/io/json/_json.py", line 1098, in _parse_no_numpy loads(json, precise_float=self.precise_float), dtype=None

Here are my steps:

I create a document and reference summary json (ref.json): {"document": "Cricket (englisch [ˈkɹɪkɪt]; in Deutschland amtlich Kricket,[1][2] in den Anfängen auch „Thorball“) ist ein Schlagballspiel mit zwei Mannschaften. Dabei dreht sich alles um das Duell zwischen dem Werfer (Bowler) und dem Schlagmann (Batter). Der Bowler versucht, den Batter zu einem Fehler zu bewegen, damit dieser ausscheidet, der Batter seinerseits versucht, den Ball wegzuschlagen, um Punkte (Runs) zu erzielen. Der Bowler wird durch die anderen Feldspieler unterstützt, die den Ball so schnell wie möglich zurückzubringen versuchen.", "summary:reference": "Cricket ist ein Schlagballspiel. Der Werfer wirft den Ball. Der Schlagmann schlägt den Ball."}

I call python generation.py --model_name_or_path /path/to/my/BARTModel --data_path /path/to/ref.json

I run streamlit run summvis.py -- --path ref.predictions

I also tried the same with a json file where I included the summary of the BART model by hand: {"document": "Cricket (englisch [ˈkɹɪkɪt]; in Deutschland amtlich Kricket,[1][2] in den Anfängen auch „Thorball“) ist ein Schlagballspiel mit zwei Mannschaften. Dabei dreht sich alles um das Duell zwischen dem Werfer (Bowler) und dem Schlagmann (Batter). Der Bowler versucht, den Batter zu einem Fehler zu bewegen, damit dieser ausscheidet, der Batter seinerseits versucht, den Ball wegzuschlagen, um Punkte (Runs) zu erzielen. Der Bowler wird durch die anderen Feldspieler unterstützt, die den Ball so schnell wie möglich zurückzubringen versuchen.", "summary:reference": "Cricket ist ein Schlagballspiel. Der Werfer wirft den Ball. Der Schlagmann schlägt den Ball.","summary:BART_custom": " Cricket ist ein Schlagballspiel mit zwei Mannschaften"}

Any help is highly appreciated. Thank you.
opened by LukasFides 3
[PEGASUS and BART] summarization with csv dataset

Hi,

I have finetuned 2 models for text summarization, BART(facebook-bart) and PEGASUS(financial) and I have a dataset stored in .csv files. So, I want to use this tool for the analysis. Can you help me with this issue? How can I run this code for fine-tuned models and my dataset?

opened by SatishDeshbhratar 3
":" on Windows can't be used in file names.

Hello, and thanks for your great work! But I noticed that both in preprocessing and the main application, the default way to separate file names is ":", but it can not be used on Windows. After I downloaded your tool, the file names in examples\wikinews\wikinews.cache\mgr\columns became things like 'BertscoreAligner_spacy_document_spacy_summary_bart-cnndm', and so summvis can not find the file it needs (separated by ":"). Therefore, I can't even run this example in my PC. Is there any way to solve it? Thanks again!

opened by Zhao-Linke 1
summvis visualisation in scientific paper

I wanted to know if there is already a way to export outputs displayed in the framework in order to e.g. show them in a text file (latex preferable). I assume the .cache folder hold necessary data (in the .dill files)? I wanted to ask prior to trying to extract the data on my own and I am happy to found this framework!

Cheers Lukas

opened by LukasFides 1

Owner

Robustness Gym

Building tools for evaluating and repairing ML models.

GitHub https://arxiv.org/abs/2104.07605

Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

17.1k Dec 31, 2022

Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

14.7k Feb 13, 2021

Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

14.7k Feb 18, 2021

Learning Convolutional Neural Networks with Interactive Visualization.

CNN Explainer An interactive visualization system designed to help non-experts learn about Convolutional Neural Networks (CNNs) For more information,

6.3k Jan 1, 2023

A dashboard built using Plotly-Dash for interactive visualization of Dex-connected individuals across the country.

Dashboard For The DexConnect Platform of Dexterity Global Working prototype submission for internship at Dexterity Global Group. Dashboard for real ti

2 Jun 15, 2021

GUI for visualization and interactive editing of SMPL-family body models ie. SMPL, SMPL-X, MANO, FLAME.

Body Model Visualizer Introduction This is a simple Open3D-based GUI for SMPL-family body models. This GUI lets you play with the shape, expression, a

207 Jan 1, 2023

Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

2 Nov 17, 2021

Squidpy is a tool for the analysis and visualization of spatial molecular data.

Squidpy is a tool for the analysis and visualization of spatial molecular data. It builds on top of scanpy and anndata, from which it inherits modularity and scalability. It provides analysis tools that leverages the spatial coordinates of the data, as well as tissue images if available.

251 Dec 19, 2022

SummVis is an interactive visualization tool for text summarization.

Related tags

Overview

SummVis

Installation

Quickstart

1. Download and extract data

2. Deanonymize data

Deanonymize 10 examples (try_it mode):

3. Run SummVis

General instructions for running with pre-loaded datasets

1. Download one of the pre-loaded datasets:

CNN / Daily Mail (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip

CNN / Daily Mail (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail.validation.anonymized.zip

XSum (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum_1000.validation.anonymized.zip

XSum (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum.validation.anonymized.zip

Example: Download and unzip CNN / Daily Mail

2. Deanonymize n examples:

Example: Deanonymize 100 examples from CNN / Daily Mail:

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):

Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):

Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):

3. Run SummVis

Get your data into SummVis: end-to-end preprocessing

1. Standardize and save dataset to disk.

Example: Save CNN / Daily Mail validation split to disk as a jsonl file.

Example: Load custom my_dataset.jsonl, standardize, and save.

2. Add predictions to the saved dataset.

Example: Add 6 prediction files for PEGASUS and BART to the dataset.

3. Run the preprocessing workflow and save the dataset.

Example: Autorun with default settings on a few examples to try it.

Example: Autorun with default settings on all examples.

Citation

Acknowledgements

Comments

Following the SummVis Quickstart yields "ImportError: cannot import name 'Dataset' from 'robustnessgym'"

Custom BART model

[PEGASUS and BART] summarization with csv dataset

":" on Windows can't be used in file names.

summvis visualisation in scientific paper

Owner

Robustness Gym

Interactive Data Visualization in the browser, from Python

Interactive Data Visualization in the browser, from Python

Interactive Data Visualization in the browser, from Python

Learning Convolutional Neural Networks with Interactive Visualization.

A dashboard built using Plotly-Dash for interactive visualization of Dex-connected individuals across the country.

GUI for visualization and interactive editing of SMPL-family body models ie. SMPL, SMPL-X, MANO, FLAME.

Pebble is a stat's visualization tool, this will provide a skeleton to develop a monitoring tool.

Squidpy is a tool for the analysis and visualization of spatial molecular data.

Simple spectra visualization tool for astronomers

A visualization tool made in Pygame for various pathfinding algorithms.

D-Analyst : High Performance Visualization Tool

Regress.me is an easy to use data visualization tool powered by Dash/Plotly.

Declarative statistical visualization library for Python

Statistical data visualization using matplotlib

Fast data visualization and GUI tools for scientific / engineering applications

Simple, realtime visualization of neural network training performance.

Apache Superset is a Data Visualization and Data Exploration Platform

Debugging, monitoring and visualization for Python Machine Learning and Data Science

Python script to generate a visualization of various sorting algorithms, image or video.

Deanonymize 10 examples (`try_it` mode):

Example: Load custom `my_dataset.jsonl`, standardize, and save.