Emblaze - Interactive Embedding Comparison

Related tags

Overview

Emblaze - Interactive Embedding Comparison

Emblaze is a Jupyter notebook widget for visually comparing embeddings using animated scatter plots. It bundles an easy-to-use Python API for performing dimensionality reduction on multiple sets of embedding data (including aligning the results for easier comparison), and a full-featured interactive platform for probing and comparing embeddings that runs within a Jupyter notebook cell.

Installation

Compatibility Note: Note that this widget has been tested using Python >= 3.7. If you are using JupyterLab, please make sure you are running version 3.0 or higher. The widget currently does not support displaying in the VS Code interactive notebook environment.

Install Emblaze using pip:

pip install emblaze

The widget should work out of the box when you run jupyter lab (see example code below).

Jupyter Notebook note: If you are using Jupyter Notebook 5.2 or earlier, you may also need to enable the nbextension:

jupyter nbextension enable --py --sys-prefix emblaze

Standalone Demo

Although the full application is designed to work as a Jupyter widget, you can run a standalone version with most of the available features directly in your browser. To do so, simply run the following command after pip-installing the package (note: you do not need to clone the repository to run the standalone app):

python -m emblaze.server

Visit localhost:5000 to see the running application. This will allow you to view two demo datasets: one showing five different t-SNE projections of a subset of MNIST digits, and one showing embeddings of the same 5,000 words according to three different data sources (Google News, Wikipedia, and Twitter). To add your own datasets to the standalone app, you can create a directory containing your saved comparison JSON files (see Saving and Loading below), then pass it as a command-line argument:

python -m emblaze.server /path/to/comparisons

Examples

Please see examples/example.ipynb to try using the Emblaze widget on the Boston housing prices or MNIST (TensorFlow import required) datasets.

Example 1: Multiple projections of the same embedding dataset. This can reveal areas of variation in the dimensionality reduction process, since tSNE and UMAP are randomized algorithms.

import emblaze
from emblaze.utils import Field, ProjectionTechnique

# X is an n x k array, Y is a length-n array
X, Y = ...

# Represent the high-dimensional embedding
emb = emblaze.Embedding({Field.POSITION: X, Field.COLOR: Y})
# Compute nearest neighbors in the high-D space (for display)
emb.compute_neighbors(metric='cosine')

# Generate UMAP 2D representations - you can pass UMAP parameters to project()
variants = emblaze.EmbeddingSet([
    emb.project(method=ProjectionTechnique.UMAP) for _ in range(10)
])
# Compute neighbors again (to indicate that we want to compare projections)
variants.compute_neighbors(metric='euclidean')

w = emblaze.Viewer(embeddings=variants)
w

Example 2: Multiple embeddings of the same data from different models. This is useful to see how different models embed data differently.

# Xs is a list of n x k arrays corresponding to different embedding spaces
Xs = ...
# Y is a length-n array of labels for color-coding
Y = ...
# List of strings representing the name of each embedding space (e.g.
# "Google News", "Wikipedia", "Twitter"). Omit to use generic names
embedding_names = [...]

# Make high-dimensional embedding objects
embeddings = emblaze.EmbeddingSet([
    emblaze.Embedding({Field.POSITION: X, Field.COLOR: Y}, label=emb_name)
    for X, emb_name in zip(Xs, embedding_names)
])
embeddings.compute_neighbors(metric='cosine')

# Make aligned UMAP
reduced = embeddings.project(method=ProjectionTechnique.ALIGNED_UMAP)

w = emblaze.Viewer(embeddings=reduced)
w

Example 3: Visualizing image data with image thumbnails. The viewer will display image previews for each point as well as its nearest neighbors. (For text data, you can use TextThumbnails to show small pieces of text next to the points.)

# images is an n x 100 x 100 x 3 numpy array of 100x100 RGB images (values from 0-255)
images = ...
thumbnails = emblaze.ImageThumbnails(images)
w = emblaze.Viewer(embeddings=embeddings, thumbnails=thumbnails)
w

You can also visualize embeddings with multimodal labels (i.e. where some points have text labels and others have image labels) by initializing an emblaze.CombinedThumbnails instance with a list of other Thumbnails objects to combine.

Interactive Analysis

Once you have loaded a Viewer instance in the notebook, you can read and write its properties to dynamically work with the visualization. The following properties are reactive:

embeddings (EmbeddingSet) Modify this to change the entire dataset that is displayed.
thumbnails (Thumbnails) Represents the image or text thumbnails displayed on hover and click.
currentFrame (int) The current frame or embedding space that is being viewed (from 0 to len(embeddings)).
selectedIDs (List[int]) The selected ID numbers. Unless you provide custom IDs when constructing the EmbeddingSet, these are simply zero-indexed integers.
alignedIDs (List[int]) The IDs of the points to which the embedding spaces are aligned (same format as selectedIDs). Alignment is computed relative to the positions of the points in the current frame.
colorScheme (string) The name of a color scheme to use to render the points. A variety of color schemes are available, listed in src/colorschemes.ts. This property can also be changed in the Settings panel of the widget.
previewMode (string) The method to use to generate preview lines, which should be one of the values in `utils.

Saving and Loading

You can save the data used to make comparisons to JSON, so that it is easy to load them again in Jupyter or the standalone application without re-running the embedding/projection code. Comparisons consist of an EmbeddingSet (containing the positions of the points in each 2D projection), a Thumbnails object (dictating how to display each point), and one or more NeighborSets (which contain the nearest-neighbor sets used for comparison and display).

To save a comparison, call the save_comparison() method on the Viewer. Note that if you are using high-dimensional nearest neighbors (most use cases), this method by default saves both the high-dimensional coordinates and the nearest-neighbor IDs. This can create files ranging from hundreds of MB to GBs. To store only the nearest neighbor IDs, pass ancestor_data=False as a keyword argument. Note that if you disable storing the high-dimensional coordinates, you will be unable to use tools that depend on distances in hi-D space (such as the high-dimensional radius select).

To load a comparison, simply initialize the Viewer as follows:

w = emblaze.Viewer(file="/path/to/comparison.json")

Development Installation

Clone repository, then install dependencies:

pip install -r requirements.txt

Install the python package. This will also build the JS packages.

pip install -e .

Run the following commands if you use Jupyter Lab:

jupyter labextension install @jupyter-widgets/jupyterlab-manager --no-build
jupyter labextension install .

Run the following commands if you use Jupyter Notebook:

jupyter nbextension install --sys-prefix --symlink --overwrite --py emblaze
jupyter nbextension enable --sys-prefix --py emblaze

Note that the --symlink flag doesn't work on Windows, so you will here have to run the install command every time that you rebuild your extension. For certain installations you might also need another flag instead of --sys-prefix, but we won't cover the meaning of those flags here.

How to see your changes

Open JupyterLab in watch mode with jupyter lab --watch. Then, in a separate terminal, watch the source directory for changes with npm run watch. After a change to the JavaScript code, you will wait for the build to finish, then refresh your browser. After changing in Python code, you will need to restart the notebook kernel to see your changes take effect.

Standalone App Development

To develop using the standalone app, run npm run watch:standalone in a separate terminal from the Flask server to continuously build the frontend. You will need to reload the page to see your changes.

The standalone application serves datasets stored at the data path that is printed when the Flask server starts (should be something like .../lib/python3.9/site-packages/emblaze/data for the pip-installed version, or .../emblaze/emblaze/data for a local repository). You can add your own datasets by building an EmbeddingSet and (optionally) a Thumbnails object, then saving the results to files in the data directory:

import os, json

dataset_name = "my-dataset"
data_dir = ... # data directory printed by flask server

embeddings = ... # EmbeddingSet object
thumbnails = ... # (Text|Image)Thumbnails object

os.mkdir(os.path.join(data_dir, dataset_name))
with open(os.path.join(data_dir, dataset_name, "data.json"), "w") as file:
    json.dump(embeddings.to_json(), file)
with open(os.path.join(data_dir, dataset_name, "thumbnails.json"), "w") as file:
    json.dump(thumbnails.to_json(), file)

Deployment

First clean all npm build intermediates:

npm run clean

Bump the widget version in emblaze/_version.py, emblaze/_frontend.py, and package.json if applicable. Then build the notebook widgets and standalone app:

npm run build:all

Run the packaging script to generate the wheel for distribution:

python -m build

Upload to PyPI (replace with the version number):

twine upload dist/emblaze-
   
    *

Development Notes

Svelte transitions don't seem to work well as they force an expensive re-layout operation. Avoid using them during interactions.

Comments

Feature parity for standalone demo application

Converts the old standalone application, which only supported a subset of Emblaze's functionality, to a socket-based Flask server that supports all of the widget's functionality on two example datasets.

opened by venkatesh-sivaraman 4
Handle multiple neighbor sets

Adds backend support for dealing with both high-dimensional and low-dimensional neighbors at different points in the visualization. For example, when visualizing the differences between multiple 2D projections of the same data, we may want the sidebar to display high-dimensional neighbors, but for the comparison functions (color stripes, suggested selections) to operate on the low-dimensional neighbors.

In this implementation, we define two additional classes, Neighbors and NeighborSet, which contain and handle serialization of nearest neighbors. We also add two methods on Embedding/EmbeddingSet, get_ancestor_neighbors() and get_recent_neighbors(). Ancestor neighbors are the neighbors that are farthest back in the parent tree of an embedding, i.e. the neighbors from the highest-dimensional embedding. Recent neighbors are the closest neighbor set in the parent tree of an embedding, i.e. the neighbors that should be compared.

Which neighbors are used as the ancestor and recent neighbors can be controlled by deciding where to call compute_neighbors(), which associates a Neighbors object with a particular embedding. To make a certain level the target of comparison, simply clear the parent embeddings' neighbors by calling clear_upstream_neighbors().

opened by venkatesh-sivaraman 1
Multimodal thumbnails and save/load improvements
Closes #8

Fixes error when creating CombinedThumbnails without descriptions

Adds save() and load() convenience methods to Thumbnails and EmbeddingSet
opened by venkatesh-sivaraman 1
Multimodal thumbnails
Adds support for creating Thumbnails objects with both images and text, where some thumbnails only have images and others only have text. This was already supported by the existing architecture of the frontend, so this PR simply adds support for defining such thumbnail sets in the backend. To use the feature, define some text and image thumbnails and specify what point IDs each thumbnail belongs to, then create a CombinedThumbnails object:

text_thumbnails = emblaze.TextThumbnails(names, ids=text_ids) image_thumbnails = emblaze.ImageThumbnails(images, ids=image_ids) combined = emblaze.CombinedThumbnails([text_thumbnails, image_thumbnails])
opened by venkatesh-sivaraman 1
More efficient data transfer

Adds a new compression method to reduce the amount of data that needs to pass from the kernel backend to the client. Instead of sending the entire dataset as a blob of uncompressed JSON, we now store each field as a base64-compressed typed array. In particular, this allows us to save a lot of space on the neighbors matrix, which is the main space occupant in the data structure. If the number of IDs is less than 65535, we can store all neighbors as Uint16 integers, which means each value only takes two bytes as opposed to four. This allows for ~60% reduction in data size when transferring to file or to the widget.

Also bumps version number to 0.9.3.

opened by venkatesh-sivaraman 1
Interaction logging

Logs timestamped basic interactions (selections, alignment, filtering, frame changes) to a JSON file if the loggingEnabled flag is set to True when initializing the widget.

opened by venkatesh-sivaraman 1
High-performance mode for >5k points

Includes graphical optimizations that are automatically performed when the number of points displayed is greater than 2,000. This can handle approximately 80k points smoothly on a current-generation MacBook Pro.

Also includes improvements to nearest-neighbor display in the sidebar, and updates to how frame colors are computed.

opened by venkatesh-sivaraman 1
Error handling

Raise errors when using an invalid number of embeddings, invalid embedding dimension, or invalid projection technique. Fixes a bug in which the sidebar pane control disabled selection.

opened by venkatesh-sivaraman 0
Standalone app improvements
Tutorial messages showing what each part of the interface means

Alert when trying to use continuous color schemes on categorical data

Prevent redundant update events, which caused Recent selections to be broken
opened by venkatesh-sivaraman 0
Precompute suggested selections

Ability to compute suggested selections for an entire dataset non-interactively, saving time when browsing the interface. This will also be used to supply pregenerated recommendations for the standalone demo application.

opened by venkatesh-sivaraman 0
Combined thumbnails crash when descriptions are not provided
The following thumbnail specification causes an error when loading the widget:

img_thumbnails = emblaze.ImageThumbnails(images, ids=image_ids) text_thumbnails = emblaze.TextThumbnails(names, ids=text_ids) thumbnails = emblaze.CombinedThumbnails([img_thumbnails, text_thumbnails])

The error seems to occur because text_thumbnails doesn't have a descriptions object when it is serialized to JSON. Current workaround is to add a descriptions parameter containing None for every id: emblaze.TextThumbnails(names, [None for _ in text_ids], ids=text_ids).
opened by venkatesh-sivaraman 0

Emblaze - Interactive Embedding Comparison

Related tags

Overview

Emblaze - Interactive Embedding Comparison

Installation

Standalone Demo

Examples

Interactive Analysis

Saving and Loading

Development Installation

How to see your changes

Standalone App Development

Deployment

Development Notes

Comments

Owner

CMU Data Interaction Group

the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

Y. Zhang, Q. Yao, W. Dai, L. Chen. AutoSF: Searching Scoring Functions for Knowledge Graph Embedding. IEEE International Conference on Data Engineering (ICDE). 2020

Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Visual Tracking by TridenAlign and Context Embedding

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Paddle implementation for "Highly Efficient Knowledge Graph Embedding Learning with Closed-Form Orthogonal Procrustes Analysis" (NAACL 2021)

Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

HNECV: Heterogeneous Network Embedding via Cloud model and Variational inference

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Official PyTorch Implementation of Embedding Transfer with Label Relaxation for Improved Metric Learning, CVPR 2021

A PyTorch Implementation of "SINE: Scalable Incomplete Network Embedding" (ICDM 2018).

An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

(ICCV'21) Official PyTorch implementation of Relational Embedding for Few-Shot Classification

PrimitiveNet: Primitive Instance Segmentation with Local Primitive Embedding under Adversarial Metric (ICCV 2021)

Simple embedding based text classifier inspired by fastText, implemented in tensorflow

A Structured Self-attentive Sentence Embedding