Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Last update: Dec 3, 2022

Related tags

Deep Learning billboard

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

We propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously drives progress in language generation tasks and their evaluation. We accept two types of submissions:

Generator developers submit output text. A Billboard computes all metric scores.
Metric developers submit an executable program. A Billboard computes correlations with the human judgments, updates the ensemble metric, and measures how much it overrates machine over human generations.

Anonymous submissions are allowed!!

Submit

Submission guides and examples are available here.

Scoring Results

Scoring results for all past public submissions are available here. We have generator-name||metric-name.csv files from the Cartesian product between the generators and metrics: each contains instance-level scores.

Citations

Bidimesional Leaderboards

@misc{kasai2021billboard,
    title   = {Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Ronan Le Bras and Lavinia Dunagan and Jacob Morrison and Alexander R. Fabbri and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2112.04139}, 
}

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

@misc{kasai2021thumb,
    title   = {Transparent Human Evaluation for Image Captioning},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Lavinia Dunagan and Jacob Morrison and Ronan Le Bras and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2111.08940}, 
}

CNNDM Summarization Evaluations

@article{fabbri2021summeval,
    title   = {{SummEval}: Re-evaluating Summarization Evaluation},
    author  = {Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
    journal = {TACL},
    year    = {2021},
    url     = {https://arxiv.org/abs/2007.12626},
}

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

@misc{freitag2021experts,
      title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation}, 
      author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
      year={2021},
      url={https://arxiv.org/abs/2104.14478},
}

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

Submit

Scoring Results

Citations

Bidimesional Leaderboards

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

CNNDM Summarization Evaluations

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

You might also like...

HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

Metrics to evaluate quality and efficacy of synthetic datasets.

Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.

This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

I have created this Virtual Paint Program, in this you can paint(draw) on your screen using hand gestures, created in Python-3 using OpenCV and Mediapipe library. Gestures :- Index Finger for drawing and Index+Middle Finger for changing position and objects.

This is a virtual picture dragging application. Users may virtually slide photos across the screen. The distance between the index and middle fingers determines the movement. Smaller distances indicate click and motion, whereas bigger distances indicate only hand movement.

This tool uses Deep Learning to help you draw and write with your hand and webcam.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Owner

Hand-distance-measurement-game - Hand Distance Measurement Game

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.