SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

Facebook Research

Last update: Dec 28, 2022

Related tags

Deep Learning SimulEval

Overview

SimulEval

SimulEval is a general evaluation framework for simultaneous translation on text and speech.

Requirement

python >= 3.7.0

Installation

git clone [email protected]:fairinternal/SimulEval.git
cd SimulEval
pip install -e .

Quick Start

Following is the evaluation of a dummy agent which operates wait-k (k = 3) policy and generates random words until the length of the generated words is the same as the number of all the source words. A tutorial can be found here.

cd examples
simuleval \
  --agent dummy/dummy_waitk_text_agent.py \
  --source data/src.txt \
  --target data/tgt.txt

License

SimulEval is licensed under Creative Commons BY-SA 4.0.

Citation

Please cite as:

@inproceedings{simuleval2020,
  title = {Simuleval: An evaluation toolkit for simultaneous translation},
  author = {Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Jiatao Gu, Juan Pino},
  booktitle = {Proceedings of the EMNLP},
  year = {2020},
}

Comments

Length Adaptive Average Lagging (LAAL)

Implementation of the Length Adaptive Average Lagging (LAAL) as proposed in CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022 (https://arxiv.org/abs/2204.06028). The name was suggested in Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation (https://arxiv.org/abs/2206.05807).
CLA Signed Merged

opened by pe-trik 7
Several bug and format bug fix
Change github test plan to pytest on each individual files.

Fix the bug when submitting the slurm jobs

Fix the bug when infinite loop happens if an empty segment is produced

Formatting with black

CLA Signed Merged
opened by xutaima 6
Length Adaptive Average Lagging (LAAL)

Implementation of the Length Adaptive Average Lagging (LAAL) as proposed in CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022 (https://arxiv.org/abs/2204.06028). The name was suggested in Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation (https://arxiv.org/abs/2206.05807).

opened by pe-trik 3
Error during Simul ST(MuST-C) Evaluation
Following the steps in this guide, I got this error when trying to evaluate the simultaneous speech-to-text model.

2021-03-12 13:03:29 | ERROR | simuleval.util.agent_builder | No 'Agent' class found in /project_scratch/nmt_shdata/speech_translation/simultaneous/fairseq-master/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py

This commit in the fairseq repo changed the superclass of the ST Agent, hence this error(due to the following if statement)

https://github.com/facebookresearch/SimulEval/blob/927072ebb7df83023adcc206b7458bf9c9aad3a2/simuleval/utils/agent_finder.py#L71

Environment

python 3.8.5

torch - 1.7.0

OS: Linux

fairseq and SimulEval - master codes

Scripts: Provided in the guide
opened by mzaidi59 3
Commit 1753363 changes AL scores for audio input

In commit 1753363, AudioInstance was changed to use the base class function Instance.sentence_level_eval(). This introduced the bug (?) that 1 is added to self.source_length() also for the audio input case. (For text input, this is added because of sentence end).

This corresponds to 1ms only, however this changes the lagging_padding_mask in AverageLagging: all words predicted at the end of the sentence, i.e. after all audio was read, are now not taken into account for the AL computation. I observed the AL value to change from 5598ms to 3267ms for a very high latency system where most of the words are predicted at sentence end.

See the line:

lagging_padding_mask = delays >= src_lens.unsqueeze(1)

src_lens is now 1 greater than delays[-1], i.e. the audio length.

I think the old behaviour was correct, but please decide yourself 😬

opened by patrick-wilken 2
questions about AL and DAL

Hi! I was looking at the code SimulEval/simuleval/metrics/latency.py in order to understand how you implemented the metrics for latency that you defined in your paper about SimulEval. I have a few questions I'd like to ask: can you confirm that in the code the calculation of AL considers the reference length rather than the prediction length as you described in the paper in equation 8? If so, why did you not apply the same correction also to the calculation of DAL (both in the code and in the paper)?

Thank you very much in advance!

opened by Onoratofra 2

no module named

# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

from simuleval.agents import TextAgent
from simuleval import READ_ACTION, WRITE_ACTION, DEFAULT_EOS
from utils.fun import function


class DummyWaitkTextAgent(TextAgent):

    data_type = "text"

    def __init__(self, args):
        super().__init__(args)
        self.waitk = args.waitk
        # Initialize your agent here, for example load model, vocab, etc

    @staticmethod
    def add_args(parser):
        # Add additional command line arguments here
        parser.add_argument("--waitk", type=int, default=3)

    def policy(self, states):
        # Make decision here
        print(states)
        if len(states.source) - len(states.target) < self.waitk and not states.finish_read():
            return READ_ACTION
        else:
            return WRITE_ACTION

    def predict(self, states):
        # predict token here
        if states.finish_read():
            if states.target.length() == states.source.length():
                return DEFAULT_EOS

        # return f"word_{len(states.target)}"
        return function(len(states.target))

And the content of the utils/fun.py is

def function(number):
    return 'word_' + number

The directory looks like

├── data
│   ├── src.txt
│   └── tgt.txt
└── dummy
    ├── dummy_waitk_text_agent.py
    └── utils
        ├── fun.py
        └── __init__.py

When I run the command simuleval --agent dummy_waitk_text_agent.py --source ../data/src.txt --target ../data/tgt.txt, it failed to load the utils module.

opened by Cppowboy 2

Fails to evaluate wait-k model on SimulEval if sequence is shorter than k

(Repost of the issue: https://github.com/pytorch/fairseq/issues/3445)

I've got an error while evaluating the wait-k model following the instruction of README.md.

2021-04-03 10:36:28 | INFO     | simuleval.scorer | Evaluating on text
2021-04-03 10:36:28 | INFO     | simuleval.scorer | Source: /userdir/iwslt17.test0.en
2021-04-03 10:36:28 | INFO     | simuleval.scorer | Target: /userdir/iwslt17.test0.ja
2021-04-03 10:36:28 | INFO     | simuleval.scorer | Number of sentences: 1549
2021-04-03 10:36:28 | INFO     | simuleval.server | Evaluation Server Started (process id 132622). Listening to port 12321
2021-04-03 10:36:31 | WARNING  | simuleval.scorer | Resetting scorer
2021-04-03 10:36:31 | INFO     | simuleval.cli    | Output dir: /userdir/iwslt17.test0
2021-04-03 10:36:31 | INFO     | simuleval.cli    | Start data writer (process id 132639)
2021-04-03 10:36:31 | INFO     | simuleval.cli    | Evaluating SimulTransTextAgentJA (process id 132556) on instances from 0 to 1548
2021-04-03 10:36:31 | INFO     | fairseq.tasks.translation | [en] dictionary: 16004 types
2021-04-03 10:36:31 | INFO     | fairseq.tasks.translation | [ja] dictionary: 16004 types
Traceback (most recent call last):
  File "/userdir/.venv/bin/simuleval", line 11, in <module>
    load_entry_point('simuleval', 'console_scripts', 'simuleval')()
  File "/userdir/SimulEval/simuleval/cli.py", line 165, in main
    _main(args.client_only)
  File "/userdir/SimulEval/simuleval/cli.py", line 192, in _main
    evaluate(args, client, server_process)
  File "/userdir/SimulEval/simuleval/cli.py", line 145, in evaluate
    decode(args, client, result_queue, indices)
  File "/userdir/SimulEval/simuleval/cli.py", line 108, in decode
    action = agent.policy(states)
  File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/eval/agents/simul_t2t_enja.py", line 196, in policy
    x, outputs = self.model.decoder.forward(
  File "/userdir/fairseq-v0.10.2/fairseq/models/transformer.py", line 817, in forward
    x, extra = self.extract_features(
  File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/models/transformer_monotonic_attention.py", line 219, in extract_features
    x, attn, _ = layer(
  File "/userdir/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/modules/monotonic_transformer_layer.py", line 160, in forward
    x, attn = self.encoder_attn(
  File "/userdir/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/modules/monotonic_multihead_attention.py", line 667, in forward
    alpha = self.expected_alignment_infer(
  File "/userdir/fairseq-v0.10.2/examples/simultaneous_translation/modules/monotonic_multihead_attention.py", line 528, in expected_alignment_infer
    assert tgt_len == 1
AssertionError

This problem occurs when the sequence is shorter than k.

Environment

fairseq Version: 0.10.2 (commit 14807a361202ba34dbbd3a533899db57a0ebda19)
SimulEval Version: latest (commit 1753363071f989ea3b79fdf5a21b96089a002f36)

opened by fury00812 1

bug in AL calculation

https://github.com/facebookresearch/SimulEval/blob/927072ebb7df83023adcc206b7458bf9c9aad3a2/simuleval/metrics/latency.py#L105

Dear developers, I just updated simuleval from 0.1.0 to newest version, and got signifitianlty lower AL score compared to older version. And here I found you made a small bug, range oracle_latency as range(1, ref_len+1), but based on AL definition in your paper, $d_i^{}$ should be $(i-1)\gamma$, thansk a lot.

opened by danliu2 1
Re-sync with internal repository

The internal and external repositories are out of sync. This attempts to brings them back in sync by patching the GitHub repository. Please carefully review this patch. You must disable ShipIt for your project in order to merge this pull request. DO NOT IMPORT this pull request. Instead, merge it directly on GitHub using the MERGE BUTTON. Re-enable ShipIt after merging.
CLA Signed fh:direct-merge-enabled

opened by facebook-github-bot 0
Add ATDScore

Added Average Token Delay (ATD) for a latency metric. paper: Average Token Delay: A Latency Metric for Simultaneous Translation (https://arxiv.org/abs/2211.13173)
CLA Signed

opened by master-possible 5
Connection refused when testing set is large
Hi, I noticed that when using dev/test set with large number of sentences (e.g. CoVoST2's dev/test generally have 15k sentences), the evaluation will hang due to the following error:

2021-11-27 03:49:51 | INFO | simuleval.scorer | Number of sentences: 15492 HTTPConnectionPool(host='localhost', port=12347): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fec33199ca0>: Failed to establish a new connection: [Errno 111] Connection refused'))

After some printing here and there, I suspect this error comes from here . however I'm not familiar with web app so I have no idea how to correct/avoid this error.

Is there any easy solution to this? Much appreciated. @xutaima

EDIT: Forgot to mention that this does not happen if I use a subset of sentences e.g. 1500 sentences.
opened by George0828Zhang 5
Character level latency for speech-to-text
Hello, Will this feature ever be updated?

2021-11-14 19:58:00 | ERROR | simuleval.scorer | Character level latency for speech-to-text model is not supported at the moment. We will update this feature very soon.

Also, I'm curious as to why this combination was not implemented, does it require additional handling? AFAIK, the only differenct between word and char is when calculating the reference length here. I'm trying to implemented myself, but I'm struggling to see the difference, would be great if you can provide some explanations or if this is updated. Thanks.
opened by George0828Zhang 1
Pre- and post-processing text in Simuleval
I am playing with the MMA-hard model to replicate WMT15 DE-EN experiments reported in the paper and my question is about preprocessing and postprocessing data. The paper says that:

For each dataset, we apply tokenization with the Moses (Koehn et al., 2007) tokenizer and preserve casing. We apply byte pair encoding (BPE) (Sennrich et al., 2016) jointly on the source and target to construct a shared vocabulary with 32K symbols

Following what is said above, I applied moses scripts to tokenize raw files and applied BPE to the tokenized files. Then, tokenized and BPE applied train, val and test files were binarized using following fairseq preprocess command:

fairseq-preprocess --source-lang de --target-lang en \ --trainpref ~/wmt15_de_en_32k/train --validpref ~/wmt15_de_en_32k/valid --testpref ~/wmt15_de_en_32k/test \ --destdir ~/wmt15_de_en_32k/data-bin/ \ --workers 20

Afer that, I trained a MMA-hard model using the binarized data. Now, I would like to evaluate (w.r.t. Latency and Bleu) a checkpoint using SimulEval. My first question is about the file format: Which format should I provide the test files as --source and --target to simuleval command? There are three options as far as I can see:

Using Raw files.

Using tokenized files

Using tokenized and bpe applied files.

I am following EN-JA waitk model's agent file to understand what should be done. However, the difference between the experiment I'd like to replicate and EN-JA experiment is that in EN-JA sentencepiece model is used for tokenization whereas in my case moses is used and also bpe is applied.

So, I tried following:

I provided path of TOKENIZED files as --source and --target to simuleval. Also, I've implemented segment_to_units and build_word_splitter functions as follows but I couldn't figure out how I should implement units_to_segment.

I tried to test this implementation as follows:

$ head -n 1 ~/wmt15_de_en_32k/tmp/test.de Die Premierminister Indiens und Japans trafen sich in Tokio . $ head -n 1 ~/wmt15_de_en_32k/tmp/test.en India and Japan prime ministers meet in Tokyo simuleval --agent mma-dummy/mmaAgent.py --source ~/wmt15_de_en_32k/tmp/test.de \ --target ~/wmt15_de_en_32k/tmp/test.en --data-bin ~/wmt15_de_en_32k/data-bin/ \ --model-path ~/checkpoints/checkpoint_best.pt --bpe_code ~/wmt15_de_en_32k/code

So, my questions are:

Is it correct to provide tokenized but not bpe applied test files as --source and --target to simuleval?

Do implementations of segment_to_units and build_word_splitter functions seem correct?

Could you please explain how units_to_segment and update_states_write should be implemented?

Edit: When I evaluate the best checkpoint on a subset of test-set using the above code I got the following output:

2021-09-19 22:10:08 | WARNING | sacrebleu | That's 100 lines that end in a tokenized period ('.') 2021-09-19 22:10:08 | WARNING | sacrebleu | It looks like you forgot to detokenize your test data, which may hurt your score. 2021-09-19 22:10:08 | WARNING | sacrebleu | If you insist your data is detokenized, or don't care, you can suppress this message with '--force'. 2021-09-19 22:10:08 | INFO | simuleval.cli | Evaluation results: { "Quality": { "BLEU": 6.068334932433579 }, "Latency": { "AL": 7.8185020314753055, "AP": 0.833324143320322, "DAL": 11.775593814849854 } }
opened by kurtisxx 22
Moses tokenizer/detokenizer in cli options

Hi, I'm wondering if you can add moses detokenizer in the cli options? More specifically, when post processing prediction and reference, the " ".join() function can be optionally replaced by sacremoses's detokenizer.

Thanks.

opened by George0828Zhang 0
visualization tool

Hi, I'm using SimulEval for evaluating my system and everything works. I've read in your paper that there is a visualization tool that can be run via: simuleval server --visual --log-dir $DIR but, when I try to run this command, it raises an error which is the request for the agent. Looking at your code I don't see any reference to this visualization tool, is it available at the moment? Thanks

opened by sarapapi 1

Releases(v1.0.2)

v1.0.2(Dec 7, 2022)

Legacy SimulEval main branch
Source code(tar.gz)
Source code(zip)

Owner

Facebook Research

GitHub

Simultaneous NMT/MMT framework in PyTorch

This repository includes the codes, the experiment configurations and the scripts to prepare/download data for the Simultaneous Machine Translation wi

37 Sep 29, 2022

Paper Title: Heterogeneous Knowledge Distillation for Simultaneous Infrared-Visible Image Fusion and Super-Resolution

HKDnet Paper Title: "Heterogeneous Knowledge Distillation for Simultaneous Infrared-Visible Image Fusion and Super-Resolution" Email: 18186470991@163.

11 Nov 12, 2022

This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D face reconstruction methods from a single image under variations in viewing angle, lighting, and common occlusions.

NoW Evaluation This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard e

71 Dec 30, 2022

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

CoMIR: Contrastive Multimodal Image Representation for Registration Framework ?? Registration of images in different modalities with Deep Learning ??

55 Dec 9, 2022

An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

17.3k Dec 29, 2022

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

17k Feb 11, 2021

Machine learning evaluation metrics, implemented in Python, R, Haskell, and MATLAB / Octave

Note: the current releases of this toolbox are a beta release, to test working with Haskell's, Python's, and R's code repositories. Metrics provides i

1.6k Dec 26, 2022

A general 3D Object Detection codebase in PyTorch.

Det3D is the first 3D Object Detection toolbox which provides off the box implementations of many 3D object detection algorithms such as PointPillars, SECOND, PIXOR, etc, as well as state-of-the-art methods on major benchmarks like KITTI(ViP) and nuScenes(CBGS).

1.4k Jan 5, 2023

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Self-attention building blocks for computer vision applications in PyTorch Implementation of self attention mechanisms for computer vision in PyTorch

962 Dec 23, 2022

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. Now with tensorflow 1.0 support. Evaluation usa

349 Aug 6, 2022

Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 2, 2023

(CVPR2021) ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

ClassSR (CVPR2021) ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic Paper Authors: Xiangtao Kong, Hengyuan

308 Jan 5, 2023

Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

Perceiver - Pytorch Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch Install $ pip install perceiver-pytorch Usage

876 Dec 29, 2022

We have implemented shaDow-GNN as a general and powerful pipeline for graph representation learning. For more details, please find our paper titled Deep Graph Neural Networks with Shallow Subgraph Samplers, available on arXiv (https//arxiv.org/abs/2012.01380).

Deep GNN, Shallow Sampling Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, Andrey Malevich, Rajgopal Kannan, Viktor Prasanna, Long Jin, R

117 Dec 20, 2022

a general-purpose Transformer based vision backbone

Swin Transformer By Ze Liu*, Yutong Lin*, Yue Cao*, Han Hu*, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. This repo is the official implement

9.9k Jan 8, 2023

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

27 Dec 20, 2022

Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

FFD Source Code Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face M

88 Nov 22, 2022

GLM (General Language Model)

GLM GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language underst

421 Jan 4, 2023

SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

Related tags

Overview

SimulEval

Requirement

Installation

Quick Start

License

Citation

Comments

Environment

Releases(v1.0.2)

v1.0.2(Dec 7, 2022)

Owner

Facebook Research

Simultaneous NMT/MMT framework in PyTorch

Paper Title: Heterogeneous Knowledge Distillation for Simultaneous Infrared-Visible Image Fusion and Super-Resolution

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

An evaluation toolkit for voice conversion models.

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Machine learning evaluation metrics, implemented in Python, R, Haskell, and MATLAB / Octave

A general 3D Object Detection codebase in PyTorch.

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Scikit-learn compatible estimation of general graphical models

(CVPR2021) ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

We have implemented shaDow-GNN as a general and powerful pipeline for graph representation learning. For more details, please find our paper titled Deep Graph Neural Networks with Shallow Subgraph Samplers, available on arXiv (https//arxiv.org/abs/2012.01380).

a general-purpose Transformer based vision backbone

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

GLM (General Language Model)