NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Overview

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Automatic Evaluation Metric described in the papers BaryScore (EMNLP 2021) , DepthScore (Submitted), InfoLM (AAAI 2022).

Authors:

Goal :

This repository deals with automatic evaluation of NLG and addresses the special case of reference based evaluation. The goal is to build a metric m: where is the space of sentences. An example is given below:

Metric examples: similar sentences should have a high score, dissimilar should have a low score according to m.

Overview

We start by giving an overview of the proposed metrics.

DepthScore (Submitted)

DepthScore is a single layer metric based on pretrained contextualized representations. Similar to BertScore, it embeds both the candidate (C: It is freezing this morning) and the reference (R: The weather is cold today) using a single layer of Bert to obtain discrete probability measures and . Then, a similarity score is computed using the pseudo metric introduced here.

Depth Score

This statistical measure has been tested on Data2text and Summarization.

BaryScore (EMNLP 2021)

BaryScore is a multi-layers metric based on pretrained contextualized representations. Similar to MoverScore, it aggregates the layers of Bert before computing a similarity score. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; BaryScore aggregates the different outputs through the Wasserstein space topology. MoverScore (right) leverages the information available in other layers by aggregating the layers using a power mean and then use a Wasserstein distance ().

BaryScore (left) vs MoverScore (right)

This statistical measure has been tested on Data2text, Summarization, Image captioning and NMT.

InfoLM (AAAI 2022)

InfoLM is a metric based on a pretrained language model ( PLM) (). Given an input sentence S mask at position i (), the PLM outputs a discret probability distribution () over the vocabulary (). The second key ingredient of InfoLM is a measure of information () that computes a measure of similarity between the aggregated distributions. Formally, InfoLM involes 3 steps:

  • 1. Compute individual distributions using for the candidate C and the reference R.
  • 2. Aggregate individual distributions using a weighted sum.
  • 3. Compute similarity using .
InfoLM

InfoLM is flexible as it can adapte to different criteria using different measures of information. This metric has been tested on Data2text and Summarization.

References

If you find this repo useful, please cite our papers:

@article{infolm_aaai2022,
  title={InfoLM: A New Metric to Evaluate Summarization \& Data2Text Generation},
  author={Colombo, Pierre and Clavel, Chloe and Piantanida, Pablo},
  journal={arXiv preprint arXiv:2112.01589},
  year={2021}
}
@inproceedings{colombo-etal-2021-automatic,
    title = "Automatic Text Evaluation through the Lens of {W}asserstein Barycenters",
    author = "Colombo, Pierre  and Staerman, Guillaume  and Clavel, Chlo{\'e}  and Piantanida, Pablo",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    year = "2021",
    pages = "10450--10466"
}
@article{depth_score,
  title={A pseudo-metric between probability distributions based on depth-trimmed regions},
  author={Staerman, Guillaume and Mozharovskyi, Pavlo and Colombo, Pierre and Cl{\'e}men{\c{c}}on, St{\'e}phan and d'Alch{\'e}-Buc, Florence},
  journal={arXiv preprint arXiv:2103.12711},
  year={2021}
}

Usage

Python Function

Running our metrics can be computationally intensive (because it relies on pretrained models). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can use light pretrained representations such as TinyBERT, DistilBERT.

We provide example inputs under <metric_name>.py. For example for BaryScore

metric_call = BaryScoreMetric()

ref = [
        'I like my cakes very much',
        'I hate these cakes!']
hypothesis = ['I like my cakes very much',
                  'I like my cakes very much']

metric_call.prepare_idfs(ref, hypothesis)
final_preds = metric_call.evaluate_batch(ref, hypothesis)
print(final_preds)

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

export metric=infolm
export measure_to_use=fisher_rao
CUDA_VISIBLE_DEVICES=0 python score_cli.py --ref="samples/refs.txt" --cand="samples/hyps.txt" --metric_name=${metric} --measure_to_use=${measure_to_use}

See more options by python score_cli.py -h.

Practical Tips

  • Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by sent = re.sub(r' +', ' ', sent) or sent = re.sub(r'\s+', ' ', sent).
  • Using inverse document frequency (idf) on the reference sentences to weigh word importance may correlate better with human judgment. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. To use idf, please set --idf when using the CLI tool.
  • When you are low on GPU memory, consider setting batch_size to a low number.

Practical Limitation

  • Because pretrained representations have learned positional embeddings with max length 512, our scores are undefined between sentences longer than 510 (512 after adding [CLS] and [SEP] tokens) . The sentences longer than this will be truncated. Please consider using larger models which can support much longer inputs.

Acknowledgements

Our research was granted access to the HPC resources of IDRIS under the allocation 2021-AP010611665 as well as under the project 2021-101838 made by GENCI.

Comments
  • WMT15 BaryScore

    WMT15 BaryScore

    I also have a problem with reproducing your results. If I run your command_line.sh on WMT15 de-en I get a pearson correlation of -0.3559000480869218, but in your paper you reported 75.9. how did you run these experiments?

    opened by jbgruenwald 10
  • Reproduction on WMT15/16

    Reproduction on WMT15/16

    Hi,

    thanks for your great work! I'm interested in BaryScore and want to reproduce your results. May I know with which human judgments you correlate BaryScore on WMT15? As far as I know, WMT15 officially calculates Kendall's tau with ranking-based human judgments, but you reported three types of correlations, so I guess there should be some human ratings like DA scores. Unfortunately, I was not able to find the DA scores. Could you help me with this? And the link to WMT16 DA judgments on its official website is currently invalid, do you still have this data? Besides, I'm not sure if I did something wrong, but BaryScore runs really slowly on my computer (over 200s for one language pair in WMT15 (BERTScore about 8s)). It seems the code already supports GPU computation?

    Thanks a lot!

    opened by cyr19 9
  • Make a package installable + add CI pipelines

    Make a package installable + add CI pipelines

    What does this PR do?

    Fixes #6

    This PR updates the structure of this repo and adds some necessary components to make this package easily installable from the source. Furthermore, it defines some CI jobs checking the package and also handling publishing the package distribution to PyPI once a package version is released.


    @PierreColombo - Could I kindly ask you to check the author information? I'm not really sure if I set these properly. Also, could you please confirm for me that the dependency versions (in requirements.txt) make sense w.r.t. your setup?

    @Borda - Could I kindly ask you to have quick a look at the changes I made so that we will be sure that everything is alright and the package will be ready soon as a reference one for torchmetrics testing? :] (Not sure if there's anything missing)

    opened by stancld 8
  • Questions about reproducing scores

    Questions about reproducing scores

    Hello,

    I am having trouble reproducing the scores from the toy example Jupyter Notebook. I want to make sure I have faithfully reproduced the results from the metrics when I incorporate them into Repro.

    Here are some issues I'm running into:

    • Could you provide a requirements.txt with the package versions you are using (e.g., what transformers/pytorch versions)?
    • The Jupyter Notebook passes in the references first (metric_call.evaluate_batch(refs, hypothesis)) but the evaluate_batch methods seem to expect the arguments in the opposite order
    • When I run InfoLM on the GPU, there is a tensor which is still on the CPU. It occurs here:
    File "/app/nlg_eval_via_simi_measures/infolm.py", line 280, in evaluate_batch
        dict_final_distribution_batch_refs, idfs_ref = self.get_distribution(batch_refs, idf_ref)
    File "/app/nlg_eval_via_simi_measures/infolm.py", line 245, in get_distribution
        outputs = self.model(**unmasked_data, labels=labels)
    ...
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)
    
    • It would also be very helpful to have the expected output for the data in the "samples" directory

    Thanks!

    opened by danieldeutsch 7
  • Question about the robustness of the metric, seems easy to be affected by the candidate sentences

    Question about the robustness of the metric, seems easy to be affected by the candidate sentences

    We found the word frequency of candidate sentences might affect the score. We used the following codes to test the performance of InfoLM, but got different results for the same pair of candidate and reference sentences in different subsets.

    Firstly, we made the candidate and reference sentences exactly the same; as expected, the distances between sentences are 0.

    metric = InfoLM(measure_to_use='fisher_rao', alpha=0.5, beta=2, temperature=1.5, )
    ref = [
           "germany produces what kind of beer?", ]
    hypothesis = [
                  "germany produces what kind of beer?", ]
    metric.prepare_idfs(ref, hypothesis)
    final_preds = metric.evaluate_batch(hypothesis, ref)
    print(final_preds)
    

    result: {'fisher_rao': [0.0], 'r_fisher_rao': [0.0], 'sim_fisher_rao': [0.0]}

    However, when we make the set of candidate and reference sentences different, but including one exactly the same sentence inside each, the distance between these two sentences is not 0,

    metric = InfoLM(measure_to_use='fisher_rao', alpha=0.5, beta=2, temperature=1.5, )
    ref = [
           "name all of the rivers located in china.",
           "germany produces what kind of beer?",
           "lisp was designed by who?",
           "which of the bassists has a height of at least 1.94?",
           "name the aircraft that has the least amount of cargo space.",
           "what building has the earliest construction starting date?",
           "what lighthouses were constructed with sandstones?",
           "germany produces what kind of beer?",
           "belgium produces what kind of beer?",
           "belgium produces what kind of wine?",
           "belgium produces what type of wine?",
           "belgium produces which type of wine?",
           "belgium has which type of wine?",
           ]
    hypothesis = [
                  "name all of the rivers located in korea.",
                  "france produces what kind of beer?",
                  "ocaml was designed by who?",
                  "which of the bassists has a height of at most 1.94?",
                  "name the aircraft that has the most amount of cargo space.",
                  "what building has the oldest construction starting date?",
                  "what lighthouses were constructed with cements?",
                  "germany produces what kind of beer?",
                  "germany produces what kind of beer?",
                  "germany produces what kind of beer?",
                  "germany produces what kind of beer?",
                  "germany produces what kind of beer?",
                  "germany produces what kind of beer?"
    ]
    metric.prepare_idfs(ref, hypothesis)
    final_preds = metric.evaluate_batch(hypothesis, ref)
    print(final_preds)
    

    result: {'fisher_rao': [0.13154096901416779, 0.3968910872936249, 0.8951906561851501, 0.17877444624900818, 0.3418828845024109, 0.28145506978034973, 0.42686304450035095, 0.23330892622470856, 0.4328637421131134, 0.5381670594215393, 0.6704646348953247, 0.8761745095252991, 0.7861247658729553], 'r_fisher_rao': [0.13154096901416779, 0.3968910872936249, 0.8951906561851501, 0.17877444624900818, 0.3418828845024109, 0.28145506978034973, 0.42686304450035095, 0.23330892622470856, 0.4328637421131134, 0.5381670594215393, 0.6704646348953247, 0.8761745095252991, 0.7861247658729553], ' sim_fisher_rao': [0.13154096901416779, 0.3968910872936249, 0.8951906561851501, 0.17877444624900818, 0.3418828845024109, 0.28145506978034973, 0.42686304450035095, 0.23330892622470856, 0.4328637421131134, 0.5381670594215393, 0.6704646348953247, 0.8761745095252991, 0.7861247658729553]}

    Here we found that exactly the same sentence, "germany produces what kind of beer?" in reference and candidate sets don’t lead to a 0 distance, but their distance is 0.23330892622470856 according to the code you provided.

    After reading the paper carefully, we found that the word frequency in candidate sentences might affect the InfoLM score. On page 3 of your paper, you wrote “where $\widetilde{γ_k}$ and $γ_k$ are measures of the importance of the k-th token in the candidate and reference text, respectively”. From our understanding, since the set of candidate sentences each time might be different, then if each time you compute the IDF of words according to the candidate sentences, the distance obtained for a given (fixed) pair of (candidate sentence, reference sentence) could be different at each time of testing. Would you confirm our understanding here - is this a feature from the design your metric?

    Thank you for spending some time reading our mail. I’d appreciate it if you could help me with my questions.

    opened by guihuzhang 1
  • Make a package release

    Make a package release

    Hi @PierreColombo,

    First of all, I'd like to thank you for the great work of you and your team in developing NLG metrics. I'm just curious if you'd be interested in my contribution to creating setup.py file and preparing other related stuff so that the repo can be installed from the source with pip and also make it potentially published via pypi. I believe this could help further adopt your evaluation metrics and make them easier to use for various use cases. Also, we would like to implement these metrics within torchmetrics package (see InfoLM, BaryScore and DepthScore issues), therefore, the possibility of installing the package from pypi would facilitate us to test our implementation against yours so that we can be sure our implementation is correct.

    Thanks once again and please let me know what you do think about my proposal :]

    cc: @Borda

    opened by stancld 1
  • import ot

    import ot

    Thanks a lot for publishing your code :)

    I have a problem with bary_score.py. There is an import ot, which I couldn't install with pip. Does it refer to some file that I am missing?

    opened by jbgruenwald 1
  • Bug in `InfoLM` metric

    Bug in `InfoLM` metric

    Hello,

    when you call

    info_dic = self.compute_infolm(sum_distribution_hypothesis, sum_distribution_refs)
    

    in evaluate_batch method, you pass the arguments in the incorrect order as the signature of compute_infolm is

    def compute_infolm(self, ref_distribution, hyp_distribution):
    
    opened by stancld 0
Owner
Pierre Colombo
Pierre Colombo
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
Jetson Nano-based smart camera system that measures crowd face mask usage in real-time.

MaskCam MaskCam is a prototype reference design for a Jetson Nano-based smart camera system that measures crowd face mask usage in real-time, with all

BDTI 212 Dec 29, 2022
Measures input lag without dedicated hardware, performing motion detection on recorded or live video

What is InputLagTimer? This tool can measure input lag by analyzing a video where both the game controller and the game screen can be seen on a webcam

Bruno Gonzalez 4 Aug 18, 2022
Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

NLP_0-project Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures1. We are a "democratic" and c

null 3 Mar 16, 2022
PyTorch implementation of Weak-shot Fine-grained Classification via Similarity Transfer

SimTrans-Weak-Shot-Classification This repository contains the official PyTorch implementation of the following paper: Weak-shot Fine-grained Classifi

BCMI 60 Dec 2, 2022
Official implementation of the paper "Steganographer Detection via a Similarity Accumulation Graph Convolutional Network"

SAGCN - Official PyTorch Implementation | Paper | Project Page This is the official implementation of the paper "Steganographer detection via a simila

ZHANG Zhi 1 Nov 26, 2021
Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

DongGeun-Yoon 19 Jun 7, 2022
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

pgmpy 2.2k Jan 3, 2023
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8.1k Jan 2, 2023
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan NOTE: This documentation describes a BETA release of PyStan 3. PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is

Stan 229 Dec 29, 2022
Probabilistic Programming and Statistical Inference in PyTorch

PtStat Probabilistic Programming and Statistical Inference in PyTorch. Introduction This project is being developed during my time at Cogent Labs. The

Stefano Peluchetti 109 Nov 26, 2022
Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression. Not an official Google product. Me

Google Research 27 Dec 12, 2022
Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

Denis Emelin 42 Nov 24, 2022
Cockpit is a visual and statistical debugger specifically designed for deep learning.

Cockpit: A Practical Debugging Tool for Training Deep Neural Networks

Felix Dangel 421 Dec 29, 2022
PyTorch Implementation of Region Similarity Representation Learning (ReSim)

ReSim This repository provides the PyTorch implementation of Region Similarity Representation Learning (ReSim) described in this paper: @Article{xiao2

Tete Xiao 74 Jan 3, 2023
The code of “Similarity Reasoning and Filtration for Image-Text Matching” [AAAI2021]

SGRAF PyTorch implementation for AAAI2021 paper of “Similarity Reasoning and Filtration for Image-Text Matching”. It is built on top of the SCAN and C

Ronnie_IIAU 149 Dec 22, 2022
Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

ETH VIS Research Group 327 Dec 27, 2022
Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Wietse de Vries • Martijn Bartelds • Malvina Nissim • Martijn Wieling Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

Wietse de Vries 5 Aug 2, 2021