Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Zhiling Zhang

Last update: Dec 23, 2022

Related tags

Overview

FENSE

The metric, Fluency ENhanced Sentence-bert Evaluation (FENSE), for audio caption evaluation, proposed in the paper "Can Audio Captions Be Evaluated with Image Caption Metrics?"

The main branch contains an easy-to-use interface for fast evaluation of an audio captioning system.

Online demo avaliable at https://share.streamlit.io/blmoistawinde/fense/main/streamlit_demo/app.py .

To get the dataset (AudioCaps-Eval and Clotho-Eval) and the code to reproduce, please refer to the experiment-code branch.

Installation

Clone the repository and pip install it.

git clone https://github.com/blmoistawinde/fense.git
cd fense
pip install -e .

Usage

Single Sentence

To get the detailed scores of each component for a single sentence.

from fense.evaluator import Evaluator

print("----Using tiny models----")
evaluator = Evaluator(device='cpu', sbert_model='paraphrase-MiniLM-L6-v2', echecker_model='echecker_clotho_audiocaps_tiny')

eval_cap = "An engine in idling and a man is speaking and then"
ref_cap = "A machine makes stitching sounds while people are talking in the background"

score, error_prob, penalized_score = evaluator.sentence_score(eval_cap, [ref_cap], return_error_prob=True)

print("Cand:", eval_cap)
print("Ref:", ref_cap)
print(f"SBERT sim: {score:.4f}, Error Prob: {error_prob:.4f}, Penalized score: {penalized_score:.4f}")

System Score

To get a system's overall score on a dataset by averaging sentence-level FENSE, you can use eval_system.py, with your system outputs prepared in the format like test_data/audiocaps_cands.csv or test_data/clotho_cands.csv .

For AudioCaps test set:

python eval_system.py --device cuda --dataset audiocaps --cands_dir ./test_data/audiocaps_cands.csv

For Clotho Eval set:

python eval_system.py --device cuda --dataset clotho --cands_dir ./test_data/clotho_cands.csv

Performance Benchmark

We benchmark the performance of FENSE with different choices of SBERT model and Error Detector on the two benchmark dataset AudioCaps-Eval and Clotho-Eval. (*) is the combination reported in paper.

AudioCaps-Eval

SBERT	echecker	HC	HI	HM	MM	total
paraphrase-MiniLM-L6-v2	none	62.1	98.8	93.7	75.4	80.4
paraphrase-MiniLM-L6-v2	tiny	57.6	94.7	89.5	82.6	82.3
paraphrase-MiniLM-L6-v2	base	62.6	98	82.5	85.4	85.5
paraphrase-TinyBERT-L6-v2	none	64	99.2	92.5	73.6	79.6
paraphrase-TinyBERT-L6-v2	tiny	58.6	95.1	88.3	82.2	82.1
paraphrase-TinyBERT-L6-v2	base	64.5	98.4	91.6	84.6	85.3(*)
paraphrase-mpnet-base-v2	none	63.1	98.8	94.1	74.1	80.1
paraphrase-mpnet-base-v2	tiny	58.1	94.3	90	83.2	82.7
paraphrase-mpnet-base-v2	base	63.5	98	92.5	85.9	85.9

Clotho-Eval

SBERT	echecker	HC	HI	HM	MM	total
paraphrase-MiniLM-L6-v2	none	59.5	95.1	76.3	66.2	71.3
paraphrase-MiniLM-L6-v2	tiny	56.7	90.6	79.3	70.9	73.3
paraphrase-MiniLM-L6-v2	base	60	94.3	80.6	72.3	75.3
paraphrase-TinyBERT-L6-v2	none	60	95.5	75.9	66.9	71.8
paraphrase-TinyBERT-L6-v2	tiny	59	93	79.7	71.5	74.4
paraphrase-TinyBERT-L6-v2	base	60.5	94.7	80.2	72.8	75.7(*)
paraphrase-mpnet-base-v2	none	56.2	96.3	77.6	65.2	70.7
paraphrase-mpnet-base-v2	tiny	54.8	91.8	80.6	70.1	73
paraphrase-mpnet-base-v2	base	57.1	95.5	81.9	71.6	74.9

Reference

If you use FENSE in your research, please cite:

@misc{zhou2021audio,
      title={Can Audio Captions Be Evaluated with Image Caption Metrics?}, 
      author={Zelin Zhou and Zhiling Zhang and Xuenan Xu and Zeyu Xie and Mengyue Wu and Kenny Q. Zhu},
      year={2021},
      eprint={2110.04684},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

You might also like...

I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

139 Dec 27, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Related tags

Overview

FENSE

Installation

Usage

Single Sentence

System Score

Performance Benchmark

Reference

You might also like...

I-BERT: Integer-only BERT Quantization

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

Pure python PEMDAS expression solver without using built-in eval function

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Yet another video caption

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Releases(V0.1)

V0.1(Oct 2, 2021)

Owner

Zhiling Zhang

Few-shot NLP benchmark for unified, rigorous eval

This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Pip-package for trajectory benchmarking from "Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds", ECMR'21

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.