BARTScore: Evaluating Generated Text as Text Generation

Overview

This is the Repo for the paper: BARTScore: Evaluating Generated Text as Text Generation

Updates

  • 2021.06.28 Release online evaluation Demo
  • 2021.06.25 Release online Explainable Leaderboard for Meta-evaluation
  • 2021.06.22 Code will be released soon

Background

There is a recent trend that leverages neural models for automated evaluation in different ways, as shown in Fig.1.

(a) Evaluation as matching task. Unsupervised matching metrics aim to measure the semantic equivalence between the reference and hypothesis by using a token-level matching functions in distributed representation space (e.g. BERT) or discrete string space (e.g. ROUGE).

(b) Evaluation as regression task. Regression-based metrics (e.g. BLEURT) introduce a parameterized regression layer, which would be learned in a supervised fashion to accurately predict human judgments.

(c) Evaluation as ranking task. Ranking-based metrics (e.g. COMET) aim to learn a scoring function that assigns a higher score to better hypotheses than to worse ones.

(d) Evaluation as generation task. In this work, we formulate evaluating generated text as a text generation task from pre-trained language models.

Comments
  • Some problem while fine-tuing on Paraphrase Dataset

    Some problem while fine-tuing on Paraphrase Dataset

    I'm sorry to trouble you that I am interesting in paraphrase-fine tuing part.
    I use the concise code you provide in train/bart.py. However when i use the command “ python bart.py" , I encounter OOM problem ( On single V100-32GB ) I reread your paper, and notice that you can even use batchsize=20 to fine-tuing one epoch within a hour ( impressive speed~).

    And i check the train/args.py in this repo, the batchsize is 8 and Max-epoch is 3 which is not consistency with the setting in paper (batch = 20,epoch =1).

    Can you release the setting/code you used in fine-tuing? I want to check whats the problem for OOV ( maybe the larger max_length? or larger batchsize? Currently i cut TrainBatchSize to 4 (EvalBatchSize to 2) and the model fine-tuned very very very slowly....

    Or can you give me some reproducing advice ? I am not sure lower batchsize or shoter sequence length could achieve same precision with paper's result.

    Thank you vey much and any suggestions are greatly appreciated. Sorry to trouble you QAQ

    opened by Ricardokevins 9
  • Meaning of the value range?

    Meaning of the value range?

    Thank you for your work. Could you please tell us what is corresponding description of the value range? Such as [-1, -3]: similar; [-3, -6] normal; [-6, -100] not similar or so. Thanks.

    opened by GabrielLin 4
  • A problem while using

    A problem while using

    Awesome Work~~

    While using the implement in Repo, I have a question to comfirm:

    BARTScore use generate loss as the score to estimate the quality of sentence. In bart_score.score's implement, I notice the code under:

    curr_score_list = [-x.item() for x in loss]

    The returned ScoreList contains some negetive value. if SummaryA get score -1 while SummaryB get Score -100 Does that means SummaryA is better than summaryB ? (becase the loss has been Negated?, loss1 is smaller than loss100) appreciate any reply ~

    opened by Ricardokevins 4
  • human scores in QAGS_XSUM are all 0/1

    human scores in QAGS_XSUM are all 0/1

    Hi,

    thanks for your great work.

    When I use the extracted dataset of QAGS_XSUM, I found that all fact scores are 0 or 1, which differ from QAGS_CNN. Should it be like that or there is something wrong?

    Thanks a lot!

    Best,

    opened by cyr19 2
  • A slight problem about Different Baseline result

    A slight problem about Different Baseline result

    Hi, I notice that the QAGSCNN/QAGSXsum ROUGE result in Table5 (Factual Test) is different from other paper (eg. the paper which release this QAGS dataset). Whether it is because this paper calculates the ROUGE between document and summary while other papers calculate the ROUGE between reference and summary? ( In your experiment setting, you set "ref_sum" same as "src") I am confused whether baselines for comparison are the same

    opened by Ricardokevins 2
  • A question about prompt-based BARTScore

    A question about prompt-based BARTScore

    Hello, this is a very meaningful work.

    I have a question, why is the prompt added to the right of y instead of the left?

    For example, if we do a summary task, X is a long document, Y is its summary, and z is the prompt "in summary". I feel "[X]. In summay, [Y]" is more natural than "[X]. [Y], in summary".

    Looking forward to your reply, thank you!

    opened by StevenTang1998 2
  • Pearson Correlations results in Table 4

    Pearson Correlations results in Table 4

    Hi @yyy-Apple

    I had a couple of questions regarding reproducing the scores in Table 4 in the paper, I hope you could help me out

    • Are the correlations at the system or sentence level? I went through the experiment setup section and also the SUM folder but couldn't find an answer
    • When calculating correlations are the BARTScore values scaled to something like 0-1 range or taken as it is?

    Thanks a lot for your time and amazing work!

    opened by tanay2001 2
  • Mutlireference support

    Mutlireference support

    Hey there!

    I am trying to use the BARTScore metric as follows, but my dataset contains multiple references, is there any feature that I should enable to allow multiple references? I can't seem to find any...

    from bart_score import BARTScorer
    bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
    bart_scorer.score(['This is interesting.'], [['This is fun.'],['This is amazing']]) # generation scores from the first list of texts to the second list of texts.
    

    Thanks a lot!

    opened by tanay2001 2
  • Bartscore calculation question

    Bartscore calculation question

    In equation 2 of the paper, Bartscore is calculated as the sum of log probabilities. However, in line 63 of the bart_score.py is calculated as the average of the log probabilities.

     loss = loss.sum(dim=1) / tgt_len
    

    could you let us know which equation is the correct one?

    opened by gmichalo 1
  • "EOFError: Ran out of input" when trying to load QAGS_CNN/data.pkl

    Hello, thanks for this useful software!

    I hit the following error when trying to load the file QAGS_CNN/data.pkl. If you have a solution I would appreciate it!

    >>> import pickle
    >>> pickle.load(open('BARTScore/SUM/QAGS_CNN/data.pkl', 'rb'))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    EOFError: Ran out of input
    

    environment

    • ubuntu:20.04
    • python3.8.10
    opened by taku-ito 1
  • Trouble with ROUGE score

    Trouble with ROUGE score

    I am having trouble running the score.py script from SUM when --rouge is passed

    python score.py --file "REALSumm/data.pkl" --device "cuda:0" --output "REALSumm/scores.pkl" --rouge
    

    Leads to

    Cannot open /content/BARTScore/SUM/gehrmann_rouge_opennmt/rouge_baselines/pyrouge/RELEASE-1.5.5/smart_common_words.txt
    

    I can find smart_common_words.txt under BARTScore/SUM/gehrmann_rouge_opennmt/rouge_baselines/pyrouge/RELEASE-1.5.5/data, but even if I move it to RELEASE-1.5.5 (as the script seems to expect) this just leads to other issues

    Cannot open exception db file for reading: /content/BARTScore/SUM/gehrmann_rouge_opennmt/rouge_baselines/pyrouge/RELEASE-1.5.5/WordNet-2.0.exc.db
    

    Are there extra dependencies/instructions not currently listed that are needed to run the scripts with rouge?

    opened by JohnGiorgi 1
  • WMT 2019 - Can you provide more details about the split and dataset you used?

    WMT 2019 - Can you provide more details about the split and dataset you used?

    Hi (:

    Great work with this paper! I was wondering if you could provide me with some additional feedback on the dataset you used for MT. If I understand correctly, you used the WMT-19 DARR dataset, but would you mind confirming if this was the validation or test set? Also, were you able to get actual scores for each of the sentences?

    Thanks in advance :)

    opened by PastelBelem8 0
  • Spearman Corrleations for Table-4

    Spearman Corrleations for Table-4

    In Table-4 in the paper, for summEval dataset you have measured COH, FAC, FLU, INFO. I wanted to know which variants of bart-score you used.

    From my understanding of the paper, For factuality(FAC) you must have used BARTScore(s->h) i.e source -> hypothesis.

    But i am not clear about FLU, COH and INFO.

    If you could please elaborate that will be really helpful.

    image

    opened by Atharva-Phatak 3
  • PyPi?

    PyPi?

    Hello

    I was wondering whether you were planning to make the library pip installable. That would make further integration a lot easier. Inspiration may be taken from BERTScore's implementation.

    Thanks

    opened by BramVanroy 3
Owner
NeuLab
Graham Neubig's Lab at LTI/CMU
NeuLab
UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

UNION Automatic Evaluation Metric described in the paper UNION: An UNreferenced MetrIc for Evaluating Open-eNded Story Generation (EMNLP 2020). Please

null 50 Dec 30, 2022
Benchmark for evaluating open-ended generation

OpenMEVA Contributed by Jian Guan, Zhexin Zhang. Thank Jiaxin Wen for DeBugging. OpenMEVA is a benchmark for evaluating open-ended story generation me

null 25 Nov 15, 2022
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Evaluating the Factual Consistency of Abstractive Text Summarization Authors: Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher Int

Salesforce 165 Dec 21, 2022
Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

null 23 Oct 17, 2022
[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

[ICLR 2021] RAPID: A Simple Approach for Exploration in Reinforcement Learning This is the Tensorflow implementation of ICLR 2021 paper Rank the Episo

Daochen Zha 48 Nov 21, 2022
Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images"

Reverse_Engineering_GMs Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Gener

null 100 Dec 18, 2022
GAN-generated image detection based on CNNs

GAN-image-detection This repository contains a GAN-generated image detector developed to distinguish real images from synthetic ones. The detector is

Image and Sound Processing Lab 17 Dec 15, 2022
A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen.

Master Release Pytorch - Py + Nim A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen. Because Nim compiles to C+

Giovanni Petrantoni 425 Dec 22, 2022
Rename Images with Auto Generated Neural Image Captions

Recaption Images with Generated Neural Image Caption Example Usage: Commandline: Recaption all images from folder /home/feng/Downloads/images to folde

feng wang 3 May 1, 2022
Fast and robust clustering of point clouds generated with a Velodyne sensor.

Depth Clustering This is a fast and robust algorithm to segment point clouds taken with Velodyne sensor into objects. It works with all available Velo

Photogrammetry & Robotics Bonn 957 Dec 21, 2022
Convolutional neural network that analyzes self-generated images in a variety of languages to find etymological similarities

This project is a convolutional neural network (CNN) that analyzes self-generated images in a variety of languages to find etymological similarities. Specifically, the goal is to prove that computer vision can be used to identify cognates known to exist, and perhaps lead linguists to evidence of unknown cognates.

null 1 Feb 3, 2022
Lucid Sonic Dreams syncs GAN-generated visuals to music.

Lucid Sonic Dreams Lucid Sonic Dreams syncs GAN-generated visuals to music. By default, it uses NVLabs StyleGAN2, with pre-trained models lifted from

null 731 Jan 2, 2023
Utilities to bridge Canvas-generated course rosters with GitLab's API.

gitlab-canvas-utils A collection of scripts originally written for CSE 13S. Oversees everything from GitLab course group creation, student repository

Eugene Chou 5 Jun 8, 2022
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

null 4.2k Jan 1, 2023
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 3, 2023
Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology Sharon Zhou, Eric Zelikman

Stanford Machine Learning Group 34 Nov 16, 2022
Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

This is the XLM-T repository, which includes data, code and pre-trained multilingual language models for Twitter. XLM-T - A Multilingual Language Mode

Cardiff NLP 112 Dec 27, 2022
Evaluating different engineering tricks that make RL work

Reinforcement Learning Tricks, Index This repository contains the code for the paper "Distilling Reinforcement Learning Tricks for Video Games". Short

Anssi 15 Dec 26, 2022