BARTScore: Evaluating Generated Text as Text Generation

NeuLab

Last update: Dec 17, 2022

Related tags

Deep Learning BARTScore

Overview

This is the Repo for the paper: BARTScore: Evaluating Generated Text as Text Generation

Updates

2021.06.28 Release online evaluation Demo
2021.06.25 Release online Explainable Leaderboard for Meta-evaluation
2021.06.22 Code will be released soon

Background

There is a recent trend that leverages neural models for automated evaluation in different ways, as shown in Fig.1.

(a) Evaluation as matching task. Unsupervised matching metrics aim to measure the semantic equivalence between the reference and hypothesis by using a token-level matching functions in distributed representation space (e.g. BERT) or discrete string space (e.g. ROUGE).

(b) Evaluation as regression task. Regression-based metrics (e.g. BLEURT) introduce a parameterized regression layer, which would be learned in a supervised fashion to accurately predict human judgments.

(c) Evaluation as ranking task. Ranking-based metrics (e.g. COMET) aim to learn a scoring function that assigns a higher score to better hypotheses than to worse ones.

(d) Evaluation as generation task. In this work, we formulate evaluating generated text as a text generation task from pre-trained language models.

Comments

Some problem while fine-tuing on Paraphrase Dataset

I'm sorry to trouble you that I am interesting in paraphrase-fine tuing part.
I use the concise code you provide in train/bart.py. However when i use the command “ python bart.py" , I encounter OOM problem ( On single V100-32GB ) I reread your paper, and notice that you can even use batchsize=20 to fine-tuing one epoch within a hour ( impressive speed~).

And i check the train/args.py in this repo, the batchsize is 8 and Max-epoch is 3 which is not consistency with the setting in paper (batch = 20,epoch =1).

Can you release the setting/code you used in fine-tuing? I want to check whats the problem for OOV ( maybe the larger max_length? or larger batchsize? Currently i cut TrainBatchSize to 4 (EvalBatchSize to 2) and the model fine-tuned very very very slowly....

Or can you give me some reproducing advice ? I am not sure lower batchsize or shoter sequence length could achieve same precision with paper's result.

Thank you vey much and any suggestions are greatly appreciated. Sorry to trouble you QAQ

opened by Ricardokevins 9
Meaning of the value range?

Thank you for your work. Could you please tell us what is corresponding description of the value range? Such as [-1, -3]: similar; [-3, -6] normal; [-6, -100] not similar or so. Thanks.

opened by GabrielLin 4
A problem while using

Awesome Work~~

While using the implement in Repo, I have a question to comfirm:

BARTScore use generate loss as the score to estimate the quality of sentence. In bart_score.score's implement, I notice the code under:

curr_score_list = [-x.item() for x in loss]

The returned ScoreList contains some negetive value. if SummaryA get score -1 while SummaryB get Score -100 Does that means SummaryA is better than summaryB ? (becase the loss has been Negated?, loss1 is smaller than loss100) appreciate any reply ~

opened by Ricardokevins 4
human scores in QAGS_XSUM are all 0/1

Hi,

thanks for your great work.

When I use the extracted dataset of QAGS_XSUM, I found that all fact scores are 0 or 1, which differ from QAGS_CNN. Should it be like that or there is something wrong?

Thanks a lot!

Best,

opened by cyr19 2
A slight problem about Different Baseline result

Hi, I notice that the QAGSCNN/QAGSXsum ROUGE result in Table5 (Factual Test) is different from other paper (eg. the paper which release this QAGS dataset). Whether it is because this paper calculates the ROUGE between document and summary while other papers calculate the ROUGE between reference and summary? ( In your experiment setting, you set "ref_sum" same as "src") I am confused whether baselines for comparison are the same

opened by Ricardokevins 2
A question about prompt-based BARTScore

Hello, this is a very meaningful work.

I have a question, why is the prompt added to the right of y instead of the left?

For example, if we do a summary task, X is a long document, Y is its summary, and z is the prompt "in summary". I feel "[X]. In summay, [Y]" is more natural than "[X]. [Y], in summary".

Looking forward to your reply, thank you!

opened by StevenTang1998 2
Pearson Correlations results in Table 4
Hi @yyy-Apple

I had a couple of questions regarding reproducing the scores in Table 4 in the paper, I hope you could help me out

Are the correlations at the system or sentence level? I went through the experiment setup section and also the SUM folder but couldn't find an answer

When calculating correlations are the BARTScore values scaled to something like 0-1 range or taken as it is?

Thanks a lot for your time and amazing work!
opened by tanay2001 2

Mutlireference support

Hey there!

I am trying to use the BARTScore metric as follows, but my dataset contains multiple references, is there any feature that I should enable to allow multiple references? I can't seem to find any...

from bart_score import BARTScorer
bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
bart_scorer.score(['This is interesting.'], [['This is fun.'],['This is amazing']]) # generation scores from the first list of texts to the second list of texts.

Thanks a lot!

opened by tanay2001 2

Bartscore calculation question
In equation 2 of the paper, Bartscore is calculated as the sum of log probabilities. However, in line 63 of the bart_score.py is calculated as the average of the log probabilities.

loss = loss.sum(dim=1) / tgt_len

could you let us know which equation is the correct one?
opened by gmichalo 1
"EOFError: Ran out of input" when trying to load QAGS_CNN/data.pkl
Hello, thanks for this useful software!

I hit the following error when trying to load the file QAGS_CNN/data.pkl. If you have a solution I would appreciate it!

>>> import pickle >>> pickle.load(open('BARTScore/SUM/QAGS_CNN/data.pkl', 'rb')) Traceback (most recent call last): File "<stdin>", line 1, in <module> EOFError: Ran out of input

environment

ubuntu:20.04

python3.8.10
opened by taku-ito 1
Trouble with ROUGE score
I am having trouble running the score.py script from SUM when --rouge is passed

python score.py --file "REALSumm/data.pkl" --device "cuda:0" --output "REALSumm/scores.pkl" --rouge

Leads to

Cannot open /content/BARTScore/SUM/gehrmann_rouge_opennmt/rouge_baselines/pyrouge/RELEASE-1.5.5/smart_common_words.txt

I can find smart_common_words.txt under BARTScore/SUM/gehrmann_rouge_opennmt/rouge_baselines/pyrouge/RELEASE-1.5.5/data, but even if I move it to RELEASE-1.5.5 (as the script seems to expect) this just leads to other issues

Cannot open exception db file for reading: /content/BARTScore/SUM/gehrmann_rouge_opennmt/rouge_baselines/pyrouge/RELEASE-1.5.5/WordNet-2.0.exc.db

Are there extra dependencies/instructions not currently listed that are needed to run the scripts with rouge?
opened by JohnGiorgi 1
WMT 2019 - Can you provide more details about the split and dataset you used?

Hi (:

Great work with this paper! I was wondering if you could provide me with some additional feedback on the dataset you used for MT. If I understand correctly, you used the WMT-19 DARR dataset, but would you mind confirming if this was the validation or test set? Also, were you able to get actual scores for each of the sentences?

Thanks in advance :)

opened by PastelBelem8 0
Spearman Corrleations for Table-4

In Table-4 in the paper, for summEval dataset you have measured COH, FAC, FLU, INFO. I wanted to know which variants of bart-score you used.

From my understanding of the paper, For factuality(FAC) you must have used BARTScore(s->h) i.e source -> hypothesis.

But i am not clear about FLU, COH and INFO.

If you could please elaborate that will be really helpful.

opened by Atharva-Phatak 3
PyPi?

Hello

I was wondering whether you were planning to make the library pip installable. That would make further integration a lot easier. Inspiration may be taken from BERTScore's implementation.

Thanks

opened by BramVanroy 3

Owner

NeuLab

Graham Neubig's Lab at LTI/CMU

GitHub

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

UNION Automatic Evaluation Metric described in the paper UNION: An UNreferenced MetrIc for Evaluating Open-eNded Story Generation (EMNLP 2020). Please

50 Dec 30, 2022

Benchmark for evaluating open-ended generation

OpenMEVA Contributed by Jian Guan, Zhexin Zhang. Thank Jiaxin Wen for DeBugging. OpenMEVA is a benchmark for evaluating open-ended story generation me

25 Nov 15, 2022

Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Evaluating the Factual Consistency of Abstractive Text Summarization Authors: Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher Int

165 Dec 21, 2022

Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

23 Oct 17, 2022

[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

[ICLR 2021] RAPID: A Simple Approach for Exploration in Reinforcement Learning This is the Tensorflow implementation of ICLR 2021 paper Rank the Episo

48 Nov 21, 2022

Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images"

Reverse_Engineering_GMs Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Gener

100 Dec 18, 2022

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]

PINTO_model_zoo Please read the contents of the LICENSE file located directly under each folder before using the model. My model conversion scripts ar

2.4k Jan 5, 2023

GAN-generated image detection based on CNNs

GAN-image-detection This repository contains a GAN-generated image detector developed to distinguish real images from synthetic ones. The detector is

17 Dec 15, 2022

A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen.

Master Release Pytorch - Py + Nim A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen. Because Nim compiles to C+

425 Dec 22, 2022

Rename Images with Auto Generated Neural Image Captions

Recaption Images with Generated Neural Image Caption Example Usage: Commandline: Recaption all images from folder /home/feng/Downloads/images to folde

3 May 1, 2022

Fast and robust clustering of point clouds generated with a Velodyne sensor.

Depth Clustering This is a fast and robust algorithm to segment point clouds taken with Velodyne sensor into objects. It works with all available Velo

957 Dec 21, 2022

Convolutional neural network that analyzes self-generated images in a variety of languages to find etymological similarities

This project is a convolutional neural network (CNN) that analyzes self-generated images in a variety of languages to find etymological similarities. Specifically, the goal is to prove that computer vision can be used to identify cognates known to exist, and perhaps lead linguists to evidence of unknown cognates.

1 Feb 3, 2022

Lucid Sonic Dreams syncs GAN-generated visuals to music.

Lucid Sonic Dreams Lucid Sonic Dreams syncs GAN-generated visuals to music. By default, it uses NVLabs StyleGAN2, with pre-trained models lifted from

731 Jan 2, 2023

Utilities to bridge Canvas-generated course rosters with GitLab's API.

gitlab-canvas-utils A collection of scripts originally written for CSE 13S. Oversees everything from GitLab course group creation, student repository

5 Jun 8, 2022

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

4.2k Jan 1, 2023

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

97 Jan 3, 2023

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology Sharon Zhou, Eric Zelikman

34 Nov 16, 2022

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

This is the XLM-T repository, which includes data, code and pre-trained multilingual language models for Twitter. XLM-T - A Multilingual Language Mode

112 Dec 27, 2022

Evaluating different engineering tricks that make RL work

Reinforcement Learning Tricks, Index This repository contains the code for the paper "Distilling Reinforcement Learning Tricks for Video Games". Short

15 Dec 26, 2022