Package to compute Mauve, a similarity score between neural text and human text. Install with `pip install mauve-text`.

Overview

MAUVE

MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure, introduced in this paper (NeurIPS 2021 Oral).

MAUVE summarizes both Type I and Type II errors measured softly using Kullback–Leibler (KL) divergences.

Documentation Link

Features:

  • MAUVE with quantization using k-means.
  • Adaptive selection of k-means hyperparameters.
  • Compute MAUVE using pre-computed GPT-2 features (i.e., terminal hidden state), or featurize raw text using HuggingFace transformers + PyTorch.

For scripts to reproduce the experiments in the paper, please see this repo.

Installation

For a direct install, run this command from your terminal:

pip install mauve-text

If you wish to edit or contribute to MAUVE, you should install from source

git clone [email protected]:krishnap25/mauve.git
cd mauve
pip install -e .

Some functionality requires more packages. Please see the requirements below.

Requirements

The installation command above installs the main requirements, which are:

  • numpy>=1.18.1
  • scikit-learn>=0.22.1
  • faiss-cpu>=1.7.0
  • tqdm>=4.40.0

In addition, if you wish to use featurization within MAUVE, you need to manually install:

Quick Start

Let p_text and q_text each be a list of strings, where each string is a complete generation (including context). For best practice, MAUVE needs at least a few thousand generations each for p_text and q_text (the paper uses 5000 each). For our demo, we use 100 generations each for fast running time.

To demonstrate the functionalities of this package on some real data, this repository provides some functionalities to download and use sample data in the ./examples folder (these are not a part of the MAUVE package, you need to clone the repository for these).

Let use download some Amazon product reviews as well as machine generations, provided by the GPT-2 output dataset repo by running this command in our shell (downloads ~17M in size):

python examples/download_gpt2_dataset.py

The data is downloaded into the ./data folder. We can load the data (100 samples out of the available 5000) in Python as

from examples import load_gpt2_dataset
p_text = load_gpt2_dataset('data/amazon.valid.jsonl', num_examples=100) # human
q_text = load_gpt2_dataset('data/amazon-xl-1542M.valid.jsonl', num_examples=100) # machine

We can now compute MAUVE as follows (note that this requires installation of PyTorch and HF Transformers).

import mauve 

# call mauve.compute_mauve using raw text on GPU 0; each generation is truncated to 256 tokens
out = mauve.compute_mauve(p_text=p_text, q_text=q_text, device_id=0, max_text_length=256, verbose=False)
print(out.mauve) # prints 0.9917

This first downloads GPT-2 large tokenizer and pre-trained model (if you do not have them downloaded already). Even if you have the model offline, it takes it up to 30 seconds to load the model the first time. out now contains the fields:

  • out.mauve: MAUVE score, a number between 0 and 1. Larger values indicate that P and Q are closer.
  • out.frontier_integral: Frontier Integral, a number between 0 and 1. Smaller values indicate that P and Q are closer.
  • out.divergence_curve: a numpy.ndarray of shape (m, 2); plot it with matplotlib to view the divergence curve
  • out.p_hist: a discrete distribution, which is a quantized version of the text distribution p_text
  • out.q_hist: same as above, but with q_text

You can plot the divergence curve using

# Make sure matplotlib is installed in your environment
import matplotlib.pyplot as plt  
plt.plot(out.divergence_curve[:, 1], out.divergence_curve[:, 0])

Other Ways of Using MAUVE

For each text (in both p_text and q_text), MAUVE internally uses the terimal hidden state from GPT-2 large as a feature representation. This featurization process can be rather slow (~10 mins for 5000 generations at a max length of 1024; but the implementation can be made more efficient, see Contributing). Alternatively, this package allows you to use cached hidden states directly (this does not require PyTorch and HF Transformers to be installed):

# call mauve.compute_mauve using features obtained directly
# p_feats and q_feats are `np.ndarray`s of shape (n, dim)
# we use a synthetic example here
import numpy as np
p_feats = np.random.randn(100, 1024)  # feature dimension = 1024
q_feats = np.random.randn(100, 1024)
out = mauve.compute_mauve(p_features=p_feats, q_features=q_feats)

You can also compute MAUVE using the tokenized (BPE) representation using the GPT-2 vocabulary (e.g., obtained from using an explicit call to transformers.GPT2Tokenizer).

# call mauve.compute_mauve using tokens on GPU 1
# p_toks, q_toks are each a list of LongTensors of shape [1, length]
# we use synthetic examples here
import torch
p_toks = [torch.LongTensor(np.random.choice(50257, size=(1, 32), replace=True)) for _ in range(100)]
q_toks = [torch.LongTensor(np.random.choice(50257, size=(1, 32), replace=True)) for _ in range(100)]
out = mauve.compute_mauve(p_tokens=p_toks, q_tokens=q_toks, device_id=1, max_text_length=1024)

To view the progress messages, pass in the argument verbose=True to mauve.compute_mauve. You can also use different forms as inputs for p and q, e.g., p via p_text and q via q_features.

Available Options

mauve.compute_mauve takes the following arguments

  • p_features: numpy.ndarray of shape (n, d), where n is the number of generations
  • q_features: numpy.ndarray of shape (n, d), where n is the number of generations
  • p_tokens: list of length n, each entry is torch.LongTensor of shape (1, length); length can vary between generations
  • q_tokens: list of length n, each entry is torch.LongTensor of shape (1, length); length can vary between generations
  • p_text: list of length n, each entry is a string
  • q_text: list of length n, each entry is a string
  • num_buckets: the size of the histogram to quantize P and Q. Options: 'auto' (default) or an integer
  • pca_max_data: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. Default -1
  • kmeans_explained_var: amount of variance of the data to keep in dimensionality reduction by PCA. Default 0.9
  • kmeans_num_redo: number of times to redo k-means clustering (the best objective is kept). Default 5
  • kmeans_max_iter: maximum number of k-means iterations. Default 500
  • featurize_model_name: name of the model from which features are obtained. Default 'gpt2-large' Use one of ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'].
  • device_id: Device for featurization. Supply a GPU id (e.g. 0 or 3) to use GPU. If no GPU with this id is found, use CPU
  • max_text_length: maximum number of tokens to consider. Default 1024
  • divergence_curve_discretization_size: Number of points to consider on the divergence curve. Default 25
  • mauve_scaling_factor: "c" from the paper. Default 5.
  • verbose: If True (default), print running time updates
  • seed: random seed to initialize k-means cluster assignments.

Note: p and q can be of different lengths, but it is recommended that they are the same length.

Contact

The best way to contact the authors in case of any questions or clarifications (about the package or the paper) is by raising an issue on GitHub. We are not able to respond to queries over email.

Contributing

If you find any bugs, please raise an issue on GitHub. If you would like to contribute, please submit a pull request. We encourage and highly value community contributions.

Some features which would be good to have are:

  • batched implementation featurization (current implementation sequentially featurizes generations); this requires appropriate padding/masking
  • featurization in HuggingFace Transformers with a TensorFlow backend.

Best Practices for MAUVE

MAUVE is quite different from most metrics in common use, so here are a few guidelines on proper usage of MAUVE:

  1. Relative comparisons:
    • We find that MAUVE is best suited for relative comparisons while the absolute MAUVE score is less meaningful.
    • For instance if we wish to find which of model1 and model2 are better at generating the human distribution, we can compare MAUVE(text_model1, text_human) and MAUVE(text_model2, text_human).
    • The absolute number MAUVE(text_model1, text_human) can vary based on the hyperparameters selected below, but the relative trends remain the same.
    • One must ensure that the hyperparameters are exactly the same for the MAUVE scores under comparison.
    • Some hyperparameters are described below.
  2. Number of generations:
    • MAUVE computes the similarity between two distributions.
    • Therefore, each distribution must contain at least a few thousand samples (we use 5000 each). MAUVE with a smaller number of samples is biased towards optimism (that is, MAUVE typically goes down as the number of samples increase) and exhibits a larger standard deviation between runs.
  3. Number of clusters (discretization size):
    • We take num_buckets to be 0.1 * the number of samples.
    • The performance of MAUVE is quite robust to this, provided the number of generations is not too small.
  4. MAUVE is too large or too small:
    • The parameter mauve_scaling_parameter controls the absolute value of the MAUVE score, without changing the relative ordering between various methods. The main purpose of this parameter is to help with interpretability.
    • If you find that all your methods get a very high MAUVE score (e.g., 0.995, 0.994), try increasing the value of mauve_scaling_factor. (note: this also increases the per-run standard deviation of MAUVE).
    • If you find that all your methods get a very low MAUVE score (e.g. < 0.4), then try decreasing the value of mauve_scaling_factor.
  5. MAUVE takes too long to run:
    • You can also try reducing the number of clusters using the argument num_buckets. The clustering algorithm's run time scales as the square of the number of clusters. Once the number of clusters exceeds 500, the clustering really starts to slow down. In this case, it could be helpful to set the number of clusters to 500 by overriding the default (which is num_data_points / 10, so use this when the number of samples for each of p and q is over 5000).
    • In this case, try reducing the clustering hyperparameters: set kmeans_num_redo to 1, and if this does not work, kmeans_max_iter to 100. This enables the clustering to run faster at the cost of returning a worse clustering.

Citation

If you find this package useful, or you use it in your research, please cite:

@inproceedings{pillutla-etal:mauve:neurips2021,
  title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
  author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
  booktitle = {NeurIPS},
  year      = {2021}
}

Further, the Frontier Integral was introduced in this paper:

@inproceedings{liu-etal:divergence:neurips2021,
  title={{Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals}},
  author={Liu, Lang and Pillutla, Krishna and  Welleck, Sean and Oh, Sewoong and Choi, Yejin and Harchaoui, Zaid},
  booktitle = {NeurIPS},
  year      = {2021}
}

Acknowledgements

This work was supported by NSF DMS-2134012, NSF CCF-2019844, NSF DMS-2023166, the DARPA MCS program through NIWC Pacific (N66001-19-2-4031), the CIFAR "Learning in Machines & Brains" program, a Qualcomm Innovation Fellowship, and faculty research awards.

Comments
  • Add batched implementation of mauve

    Add batched implementation of mauve

    Here, I write a simplified version of batched MAUVE implementation, and there is some place to clean up the code yet. If you think this implementation is precise, then it would be nice to replace original MAUVE implementation with batched version. I tested with the sample data and it gives me 0.991679398536574 of mauve score.

    opened by wade3han 9
  • How can I accelerate the featurizing of MAUVE score

    How can I accelerate the featurizing of MAUVE score

    I try to direct leverage MAUVE API such as

    out = mauve.compute_mauve(p_text=p_text, q_text=q_text, device_id=0, max_text_length=256, verbose=False).

    I find that the GPU memory usage is quite low. How could I accelerate the calculation (featuring of sequence) of MAUVE score.

    I note there is a hyper-parameter ''batch_size''. Could I tune the hyperparameter? Do different values cause different results? What is the default value of the ''batchsize''?

    opened by Jxu-Thu 4
  • Mabye incorrect calculation in one small line

    Mabye incorrect calculation in one small line

    Hi, First, thanks for the awesome paper.

    https://github.com/krishnap25/mauve/blob/20613eecd7b084281ed4df9bfeee67d66cbfe5ee/src/mauve/compute_mauve.py#L249-L263

    I think there is an issue at line 258, where the closing parentheses is placed. Mabye it's supposed to be like this:

    elif abs(p1 - q1) > 1e-8:
    

    Tried to find this in the paper and in the fronter intergral but nothing jumped into my eyes.

    Thanks

    opened by stav95 3
  • Comparing learned representations

    Comparing learned representations

    I was wondering if mauve can be applied in a slightly different setting. So if I understand correctly -- Given two text distributions t1 and t2, mauve first computes fixed dimensional representations using a pretrained model like GPT-2. Let's say the features are: t1-(100, 1024), t2- (150, 1024). Then a common dataset d1 (250, 1024) is created and clustered using K-means. The cluster assignments are then used to compute normalized histograms which are then used to plot the divergence curve and then quantify the gap using AUC.

    So my question is can we use mauve to compare the representations and quantify the information gap between the representations? Here we do not have two datasets from different data distributions, but two representations of the same data distribution -- for e.g. 8 and 16 dimensions. I am guessing we cannot use mauve directly here because we cannot combine them to form d1 (as the dimensions are different) and then cluster them. right? Do you have any suggestions on how it can be done by modifying mauve? Thanks, Kalyani

    opened by kalyani7195 2
  • Question about the Spearman rank correlation

    Question about the Spearman rank correlation

    Thanks for great paper! However I'm have a little bit confusion about the Spearman rank correlation in § 4.3 and Appendix E. While calculating Spearman rank correlation between MAUVE and human, do you mean to use eight figures from MAUVE and other eight figures from Bradley-Terry model?

    opened by bigbrother001 2
  • WARNING clustering 13104 points to 655 centroids: please provide at least 25545 training points

    WARNING clustering 13104 points to 655 centroids: please provide at least 25545 training points

    I received this warning when running MAUVE with basically default settings. mauve_out = mauve.compute_mauve(p_text=p_text, q_text=q_text, verbose=False, device_id=0) WARNING clustering 13104 points to 655 centroids: please provide at least 25545 training points

    I've provided ~6,500 generations each for p_text and q_text which is more than the 5,000 used in the paper. Is this warning safe to ignore? The generations are fairly short compared to those in the paper (~20 BPE tokens).

    opened by ivnle 1
  • Support BERT-models

    Support BERT-models

    Context here: https://github.com/krishnap25/mauve/issues/3 I have run the code in the way suggested in the readme, but with roberta-large:

    import mauve 
    
    # call mauve.compute_mauve using raw text on GPU 0; each generation is truncated to 256 tokens
    out = mauve.compute_mauve(p_text=p_text, q_text=q_text, device_id=0, max_text_length=256, verbose=False, featurize_model_name="roberta_large")
    print(out.mauve)
    

    I ran it once on 2-3 samples, and it did not crash, so, great success?

    opened by samhedin 1
  • Warnings about the number of centroids used for clustering

    Warnings about the number of centroids used for clustering

    Dear authors, thanks for release the MAUVE metrics in such an easy-to-use repository!

    I had a quick question about a warning I get each time I run MAUVE using the command provided in the README,

    mauve_out = mauve.compute_mauve(p_text=gen_seqs, q_text=gold_seqs, device_id=0, max_text_length=768, verbose=False)
    

    This nearly always gives me a warning of the form,

    WARNING clustering 9286 points to 464 centroids: please provide at least 18096 training points
    

    Is this something I need to worry about? Should I set num_buckets manually?

    Thanks again for your excellent project!

    opened by martiansideofthemoon 1
  • Allowing other models for extracting features

    Allowing other models for extracting features

    Hello!

    First off, thanks for sharing the code. In the paper, it says that MAUVE works with other embedding models. Therefore, I wanted to try out models such as DialoGPT from Microsoft. But in the code, it limits the model and tokenizer name to "gpt2" family. I think it would better we remove this restriction since others might also want to try out other models.

    If you want, I can make a PR to change this.

    https://github.com/krishnap25/mauve/blob/b3c01d5b0f3be85a997b1171b3f2efa3ba16280b/src/mauve/utils.py#L25-L39

    opened by jinyongyoo 1
  • MAUVE can vary greatly when computed with different K-Means random seeds

    MAUVE can vary greatly when computed with different K-Means random seeds

    While using MAUVE in a real use case, I decided to compute MAUVE multiple times per comparison with different K-Means random seeds.

    I've noticed that the value of the MAUVE metric varies a lot across these K-Means seeds.

    In particular, MAUVE varies about as much across K-Means seeds as it does across GPT sampling seeds. Typical values for std. dev. across 5 seeds are ~0.005 to ~0.01, for either type of seed (while holding the other constant).

    This is also comparable in size to the MAUVE differences reported in some model/sampler comparisons, e.g. between nucleus GPT-2-large and nucleus GPT-2-xl in Table 6 of the original paper.

    Do you have recommendations about what to do about this variability?

    • Am I doing something wrong?
    • Is this less of an issue with the DRMM or SPV algorithms? I haven't tried them.
    • I have an (untested) hypothesis that MAUVE would be less variable if fewer clusters were used.
      • The rule k = n/10 gives us an average of 10 members per cluster for each of p and q, with many clusters having fewer than this. The small counts mean there is is high uncertainty in the individual terms of the KL-divergence sum.
      • By the same token, we are averaging over a large number of bins, and one might hope the errors would wash out in the average, but perhaps they don't as much as we would hope.
      • Using fewer clusters would tend to push MAUVE estimates close to one another (Fig. 8 in the original paper), but maybe we could compensate for this by using a higher scaling constant (Fig. 5). What do you think about this idea?

    Colab notebook with an example: https://colab.research.google.com/drive/1wh38JRSr5vkOqlWUxNkP4tUgkFwZwAD0?usp=sharing

    opened by nostalgebraist 3
  • S-mauve

    S-mauve

    Hi, I made this modification to MAUVE during my MSc thesis, please see the updated README for details. If you find S-MAUVE interesting, feel free to include it in your code. If not, I can keep it as a separate repository. The thesis isn't published yet, but I can provide it at your request if you want more information.

    opened by samhedin 0
  • Questions regarding hidden states.

    Questions regarding hidden states.

    Hi, and thank you for the great paper. I have questions regarding https://github.com/krishnap25/mauve/blob/20613eecd7b084281ed4df9bfeee67d66cbfe5ee/src/mauve/utils.py#L120-L121 The activations in the final hidden layer is taken: outs.hidden_states[-1], right? Is my understanding of why sent_length needed correct?

    • Looking at hidden_state[0] is looking at the embedding of the first word in the sentence
    • When sent_length < len(hidden_state), hidden_state[-1] is padding
    • Therefore, hidden_state[sent_length - 1] is the embedding of the entire sentence.

    Second, is there a particular reason you chose to only look at the embeddings in the final hidden layer? Did you consider taking an average of the embeddings in all hidden layers?

    opened by samhedin 10
Owner
Krishna Pillutla
Ph.D student in Machine Learning
Krishna Pillutla
A fast model to compute optical flow between two input images.

DCVNet: Dilated Cost Volumes for Fast Optical Flow This repository contains our implementation of the paper: @InProceedings{jiang2021dcvnet, title={

Huaizu Jiang 8 Sep 27, 2021
pip install python-office

?? python for office ?? http://www.python4office.cn/ ?? ?? English Documentation ?? 简介 Python-office 是一个 Python 自动化办公第三方库,能解决大部分自动化办公的问题。而且每个功能只需一行代码,

程序员晚枫 272 Dec 29, 2022
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

null 1 Nov 30, 2021
Pip-package for trajectory benchmarking from "Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds", ECMR'21

Map Metrics for Trajectory Quality Map metrics toolkit provides a set of metrics to quantitatively evaluate trajectory quality via estimating consiste

Mobile Robotics Lab. at Skoltech 31 Oct 28, 2022
Details about the wide minima density hypothesis and metrics to compute width of a minima

wide-minima-density-hypothesis Details about the wide minima density hypothesis and metrics to compute width of a minima This repo presents the wide m

Nikhil Iyer 9 Dec 27, 2022
modelvshuman is a Python library to benchmark the gap between human and machine vision

modelvshuman is a Python library to benchmark the gap between human and machine vision. Using this library, both PyTorch and TensorFlow models can be evaluated on 17 out-of-distribution datasets with high-quality human comparison data.

Bethge Lab 244 Jan 3, 2023
Rendering Point Clouds with Compute Shaders

Compute Shader Based Point Cloud Rendering This repository contains the source code to our techreport: Rendering Point Clouds with Compute Shaders and

Markus Schütz 460 Jan 5, 2023
Compute FID scores with PyTorch.

FID score for PyTorch This is a port of the official implementation of Fréchet Inception Distance to PyTorch. See https://github.com/bioinf-jku/TTUR f

null 2.1k Jan 6, 2023
Compute descriptors for 3D point cloud registration using a multi scale sparse voxel architecture

MS-SVConv : 3D Point Cloud Registration with Multi-Scale Architecture and Self-supervised Fine-tuning Compute features for 3D point cloud registration

null 42 Jul 25, 2022
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.

The Kompute Project 1k Jan 6, 2023
Fast algorithms to compute an approximation of the minimal volume oriented bounding box of a point cloud in 3D.

ApproxMVBB Status Build UnitTests Homepage Fast algorithms to compute an approximation of the minimal volume oriented bounding box of a point cloud in

Gabriel Nützi 390 Dec 31, 2022
Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

Hello from magnus Magnus provides four capabilities for data teams: Compute execution plan: A DAG representation of work that you want to get done. In

null 12 Feb 8, 2022
A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).

SimGNN ⠀⠀⠀ A PyTorch implementation of SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (WSDM 2019). Abstract Graph similarity s

Benedek Rozemberczki 534 Dec 25, 2022
Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors, CVPR 2021

Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors Human POSEitioning System (H

Aymen Mir 66 Dec 21, 2022
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 3, 2023
[CVPR2021] UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

UAV-Human Official repository for CVPR2021: UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicle Paper arXiv Res

null 129 Jan 4, 2023
Human Action Controller - A human action controller running on different platforms.

Human Action Controller (HAC) Goal A human action controller running on different platforms. Fun Easy-to-use Accurate Anywhere Fun Examples Mouse Cont

null 27 Jul 20, 2022
Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Ibai Gorordo 99 Dec 31, 2022