NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Pierre Colombo

Last update: Dec 28, 2022

Related tags

Deep Learning nlg_eval_via_simi_measures

Overview

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Automatic Evaluation Metric described in the papers BaryScore (EMNLP 2021) , DepthScore (Submitted), InfoLM (AAAI 2022).

Authors:

Goal :

This repository deals with automatic evaluation of NLG and addresses the special case of reference based evaluation. The goal is to build a metric m: $m : \mathcal{S} \times \mathcal{S} \rightarrow \mathcal{R}$ where $m : \mathcal{S}$ is the space of sentences. An example is given below:

Metric examples: similar sentences should have a high score, dissimilar should have a low score according to m.

Overview

We start by giving an overview of the proposed metrics.

DepthScore (Submitted)

DepthScore is a single layer metric based on pretrained contextualized representations. Similar to BertScore, it embeds both the candidate (C: It is freezing this morning) and the reference (R: The weather is cold today) using a single layer of Bert to obtain discrete probability measures $\hat{\mu}_{.,l}^R$ and $\hat{\mu}_{.,l}^C$ . Then, a similarity score is computed using the pseudo metric $DR_{p,\varepsilon}(\hat{\mu}_{.,l}^C,\hat{\mu}_{.,l}^R)$ introduced here.

Depth Score

This statistical measure has been tested on Data2text and Summarization.

BaryScore (EMNLP 2021)

BaryScore is a multi-layers metric based on pretrained contextualized representations. Similar to MoverScore, it aggregates the layers of Bert before computing a similarity score. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; BaryScore aggregates the different outputs through the Wasserstein space topology. MoverScore (right) leverages the information available in other layers by aggregating the layers using a power mean and then use a Wasserstein distance ( $W$ ).

BaryScore (left) vs MoverScore (right)

This statistical measure has been tested on Data2text, Summarization, Image captioning and NMT.

InfoLM (AAAI 2022)

InfoLM is a metric based on a pretrained language model ( PLM) ( $p_\Omega$ ). Given an input sentence S mask at position i ( $[S]^i$ ), the PLM outputs a discret probability distribution ( $p_\Omega(\cdot | [S]^i)$ ) over the vocabulary ( $\Omega$ ). The second key ingredient of InfoLM is a measure of information ( $\mathcal{I} : [0,1]^{|\mathbf{\Omega}|} \times [0,1]^{|\mathbf{\Omega}|}$ ) that computes a measure of similarity between the aggregated distributions. Formally, InfoLM involes 3 steps:

1. Compute individual distributions using $p_\Omega$ for the candidate C and the reference R.
2. Aggregate individual distributions using a weighted sum.
3. Compute similarity using $\mathcal{I}$ .

InfoLM

InfoLM is flexible as it can adapte to different criteria using different measures of information. This metric has been tested on Data2text and Summarization.

References

If you find this repo useful, please cite our papers:

@article{infolm_aaai2022,
  title={InfoLM: A New Metric to Evaluate Summarization \& Data2Text Generation},
  author={Colombo, Pierre and Clavel, Chloe and Piantanida, Pablo},
  journal={arXiv preprint arXiv:2112.01589},
  year={2021}
}

@inproceedings{colombo-etal-2021-automatic,
    title = "Automatic Text Evaluation through the Lens of {W}asserstein Barycenters",
    author = "Colombo, Pierre  and Staerman, Guillaume  and Clavel, Chlo{\'e}  and Piantanida, Pablo",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    year = "2021",
    pages = "10450--10466"
}

@article{depth_score,
  title={A pseudo-metric between probability distributions based on depth-trimmed regions},
  author={Staerman, Guillaume and Mozharovskyi, Pavlo and Colombo, Pierre and Cl{\'e}men{\c{c}}on, St{\'e}phan and d'Alch{\'e}-Buc, Florence},
  journal={arXiv preprint arXiv:2103.12711},
  year={2021}
}

Usage

Python Function

Running our metrics can be computationally intensive (because it relies on pretrained models). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can use light pretrained representations such as TinyBERT, DistilBERT.

We provide example inputs under <metric_name>.py. For example for BaryScore

metric_call = BaryScoreMetric()

ref = [
        'I like my cakes very much',
        'I hate these cakes!']
hypothesis = ['I like my cakes very much',
                  'I like my cakes very much']

metric_call.prepare_idfs(ref, hypothesis)
final_preds = metric_call.evaluate_batch(ref, hypothesis)
print(final_preds)

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

export metric=infolm
export measure_to_use=fisher_rao
CUDA_VISIBLE_DEVICES=0 python score_cli.py --ref="samples/refs.txt" --cand="samples/hyps.txt" --metric_name=${metric} --measure_to_use=${measure_to_use}

See more options by python score_cli.py -h.

Practical Tips

Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by sent = re.sub(r' +', ' ', sent) or sent = re.sub(r'\s+', ' ', sent).
Using inverse document frequency (idf) on the reference sentences to weigh word importance may correlate better with human judgment. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. To use idf, please set --idf when using the CLI tool.
When you are low on GPU memory, consider setting batch_size to a low number.

Practical Limitation

Because pretrained representations have learned positional embeddings with max length 512, our scores are undefined between sentences longer than 510 (512 after adding [CLS] and [SEP] tokens) . The sentences longer than this will be truncated. Please consider using larger models which can support much longer inputs.

Acknowledgements

Our research was granted access to the HPC resources of IDRIS under the allocation 2021-AP010611665 as well as under the project 2021-101838 made by GENCI.

Comments

WMT15 BaryScore

I also have a problem with reproducing your results. If I run your command_line.sh on WMT15 de-en I get a pearson correlation of -0.3559000480869218, but in your paper you reported 75.9. how did you run these experiments?

opened by jbgruenwald 10
Reproduction on WMT15/16

Hi,

thanks for your great work! I'm interested in BaryScore and want to reproduce your results. May I know with which human judgments you correlate BaryScore on WMT15? As far as I know, WMT15 officially calculates Kendall's tau with ranking-based human judgments, but you reported three types of correlations, so I guess there should be some human ratings like DA scores. Unfortunately, I was not able to find the DA scores. Could you help me with this? And the link to WMT16 DA judgments on its official website is currently invalid, do you still have this data? Besides, I'm not sure if I did something wrong, but BaryScore runs really slowly on my computer (over 200s for one language pair in WMT15 (BERTScore about 8s)). It seems the code already supports GPU computation?

Thanks a lot!

opened by cyr19 9
Make a package installable + add CI pipelines

What does this PR do?

Fixes #6

This PR updates the structure of this repo and adds some necessary components to make this package easily installable from the source. Furthermore, it defines some CI jobs checking the package and also handling publishing the package distribution to PyPI once a package version is released.

@PierreColombo - Could I kindly ask you to check the author information? I'm not really sure if I set these properly. Also, could you please confirm for me that the dependency versions (in requirements.txt) make sense w.r.t. your setup?

@Borda - Could I kindly ask you to have quick a look at the changes I made so that we will be sure that everything is alright and the package will be ready soon as a reference one for torchmetrics testing? :] (Not sure if there's anything missing)

opened by stancld 8
Questions about reproducing scores
Hello,

I am having trouble reproducing the scores from the toy example Jupyter Notebook. I want to make sure I have faithfully reproduced the results from the metrics when I incorporate them into Repro.

Here are some issues I'm running into:

Could you provide a requirements.txt with the package versions you are using (e.g., what transformers/pytorch versions)?

The Jupyter Notebook passes in the references first (metric_call.evaluate_batch(refs, hypothesis)) but the evaluate_batch methods seem to expect the arguments in the opposite order

When I run InfoLM on the GPU, there is a tensor which is still on the CPU. It occurs here:

File "/app/nlg_eval_via_simi_measures/infolm.py", line 280, in evaluate_batch dict_final_distribution_batch_refs, idfs_ref = self.get_distribution(batch_refs, idf_ref) File "/app/nlg_eval_via_simi_measures/infolm.py", line 245, in get_distribution outputs = self.model(**unmasked_data, labels=labels) ... RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

It would also be very helpful to have the expected output for the data in the "samples" directory

Thanks!
opened by danieldeutsch 7
Question about the robustness of the metric, seems easy to be affected by the candidate sentences
We found the word frequency of candidate sentences might affect the score. We used the following codes to test the performance of InfoLM, but got different results for the same pair of candidate and reference sentences in different subsets.

Firstly, we made the candidate and reference sentences exactly the same; as expected, the distances between sentences are 0.

metric = InfoLM(measure_to_use='fisher_rao', alpha=0.5, beta=2, temperature=1.5, ) ref = [ "germany produces what kind of beer?", ] hypothesis = [ "germany produces what kind of beer?", ] metric.prepare_idfs(ref, hypothesis) final_preds = metric.evaluate_batch(hypothesis, ref) print(final_preds)

result: {'fisher_rao': [0.0], 'r_fisher_rao': [0.0], 'sim_fisher_rao': [0.0]}

However, when we make the set of candidate and reference sentences different, but including one exactly the same sentence inside each, the distance between these two sentences is not 0,

metric = InfoLM(measure_to_use='fisher_rao', alpha=0.5, beta=2, temperature=1.5, ) ref = [ "name all of the rivers located in china.", "germany produces what kind of beer?", "lisp was designed by who?", "which of the bassists has a height of at least 1.94?", "name the aircraft that has the least amount of cargo space.", "what building has the earliest construction starting date?", "what lighthouses were constructed with sandstones?", "germany produces what kind of beer?", "belgium produces what kind of beer?", "belgium produces what kind of wine?", "belgium produces what type of wine?", "belgium produces which type of wine?", "belgium has which type of wine?", ] hypothesis = [ "name all of the rivers located in korea.", "france produces what kind of beer?", "ocaml was designed by who?", "which of the bassists has a height of at most 1.94?", "name the aircraft that has the most amount of cargo space.", "what building has the oldest construction starting date?", "what lighthouses were constructed with cements?", "germany produces what kind of beer?", "germany produces what kind of beer?", "germany produces what kind of beer?", "germany produces what kind of beer?", "germany produces what kind of beer?", "germany produces what kind of beer?" ] metric.prepare_idfs(ref, hypothesis) final_preds = metric.evaluate_batch(hypothesis, ref) print(final_preds)

result: {'fisher_rao': [0.13154096901416779, 0.3968910872936249, 0.8951906561851501, 0.17877444624900818, 0.3418828845024109, 0.28145506978034973, 0.42686304450035095, 0.23330892622470856, 0.4328637421131134, 0.5381670594215393, 0.6704646348953247, 0.8761745095252991, 0.7861247658729553], 'r_fisher_rao': [0.13154096901416779, 0.3968910872936249, 0.8951906561851501, 0.17877444624900818, 0.3418828845024109, 0.28145506978034973, 0.42686304450035095, 0.23330892622470856, 0.4328637421131134, 0.5381670594215393, 0.6704646348953247, 0.8761745095252991, 0.7861247658729553], ' sim_fisher_rao': [0.13154096901416779, 0.3968910872936249, 0.8951906561851501, 0.17877444624900818, 0.3418828845024109, 0.28145506978034973, 0.42686304450035095, 0.23330892622470856, 0.4328637421131134, 0.5381670594215393, 0.6704646348953247, 0.8761745095252991, 0.7861247658729553]}

Here we found that exactly the same sentence, "germany produces what kind of beer?" in reference and candidate sets don’t lead to a 0 distance, but their distance is 0.23330892622470856 according to the code you provided.

After reading the paper carefully, we found that the word frequency in candidate sentences might affect the InfoLM score. On page 3 of your paper, you wrote “where $\widetilde{γ_k}$ and $γ_k$ are measures of the importance of the k-th token in the candidate and reference text, respectively”. From our understanding, since the set of candidate sentences each time might be different, then if each time you compute the IDF of words according to the candidate sentences, the distance obtained for a given (fixed) pair of (candidate sentence, reference sentence) could be different at each time of testing. Would you confirm our understanding here - is this a feature from the design your metric?

Thank you for spending some time reading our mail. I’d appreciate it if you could help me with my questions.
opened by guihuzhang 1
Make a package release

Hi @PierreColombo,

First of all, I'd like to thank you for the great work of you and your team in developing NLG metrics. I'm just curious if you'd be interested in my contribution to creating setup.py file and preparing other related stuff so that the repo can be installed from the source with pip and also make it potentially published via pypi. I believe this could help further adopt your evaluation metrics and make them easier to use for various use cases. Also, we would like to implement these metrics within torchmetrics package (see InfoLM, BaryScore and DepthScore issues), therefore, the possibility of installing the package from pypi would facilitate us to test our implementation against yours so that we can be sure our implementation is correct.

Thanks once again and please let me know what you do think about my proposal :]

cc: @Borda

opened by stancld 1
import ot

Thanks a lot for publishing your code :)

I have a problem with bary_score.py. There is an import ot, which I couldn't install with pip. Does it refer to some file that I am missing?

opened by jbgruenwald 1
Bug in `InfoLM` metric
Hello,

when you call

info_dic = self.compute_infolm(sum_distribution_hypothesis, sum_distribution_refs)

in evaluate_batch method, you pass the arguments in the incorrect order as the signature of compute_infolm is

def compute_infolm(self, ref_distribution, hyp_distribution):
opened by stancld 0

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Related tags

Overview

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Authors:

Goal :

Overview

DepthScore (Submitted)

BaryScore (EMNLP 2021)

InfoLM (AAAI 2022)

References

Usage

Python Function

Command Line Interface (CLI)

Practical Tips

Practical Limitation

Acknowledgements

Comments

What does this PR do?

Owner

Pierre Colombo

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Jetson Nano-based smart camera system that measures crowd face mask usage in real-time.

Measures input lag without dedicated hardware, performing motion detection on recorded or live video

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

PyTorch implementation of Weak-shot Fine-grained Classification via Similarity Transfer

Official implementation of the paper "Steganographer Detection via a Similarity Accumulation Graph Convolutional Network"

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Statsmodels: statistical modeling and econometrics in Python

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

Probabilistic Programming and Statistical Inference in PyTorch

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Cockpit is a visual and statistical debugger specifically designed for deep learning.

PyTorch Implementation of Region Similarity Representation Learning (ReSim)

The code of “Similarity Reasoning and Filtration for Image-Text Matching” [AAAI2021]

Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"