Unified MultiWOZ evaluation scripts for the context-to-response task.

Overview

MultiWOZ Context-to-Response Evaluation

Standardized and easy to use Inform, Success, BLEU

~ See the paper ~

 


Easy-to-use scripts for standardized evaluation of response generation on the MultiWOZ benchmark. This repository contains an implementation of the MultiWOZ database with fuzzy matching, functions for normalization of slot names and values, and a careful implementation of the BLEU score and Inform & Succes rates.

🚀 Usage

Install the repository:

pip install git+https://github.com/Tomiinek/MultiWOZ_Evaluation.git@master

Use it directly from your code. Instantiate an evaluator and then call the evaluate method with dictionary of your predictions with a specific format (described later). Set bleu to evaluate the BLEU score, success to get the Success & Inform rate, and use richness for getting lexical richness metrics such as the number of unique unigrams, trigrams, token entropy, bigram conditional entropy, corpus MSTTR-50, and average turn length. Pseudo-code:

from mwzeval.metrics import Evaluator
...

e = Evaluator(bleu=True, success=False, richness=False)
my_predictions = {}
for item in data:
    my_predictions[item.dialog_id] = model.predict(item)
    ...
    
results = e.evaluate(my_predictions)
print(f"Epoch {epoch} BLEU: {results}")

Alternative usage:

git clone https://github.com/Tomiinek/MultiWOZ_Evaluation.git && cd MultiWOZ_Evaluation
pip install -r requirements.txt

And evaluate you predictions from the input file:

python evaluate.py [--bleu] [--success] [--richness] --input INPUT.json [--output OUTPUT.json]

Set the options --bleu, --success, and --richness as you wish.

Input format:

{
    "xxx0000" : [
        {
            "response": "Your generated delexicalized response.",
            "state": {
                "restaurant" : {
                    "food" : "eatable"
                }, ...
            }, 
            "active_domains": ["restaurant"]
        }, ...
    ], ...
}

The input to the evaluator should be a dictionary (or a .json file) with keys matching dialogue ids in the xxx0000 format (e.g. sng0073 instead of SNG0073.json), and values containing a list of turns. Each turn is a dictionary with keys:

  • response – Your generated delexicalized response. You can use either the slot names with domain names, e.g. restaurant_food, or the domain adaptive delexicalization scheme, e.g. food.

  • stateOptional, the predicted dialog state. If not present (for example in the case of policy optimization models), the ground truth dialog state from MultiWOZ 2.2 is used during the Inform & Success computation. Slot names and values are normalized prior the usage.

  • active_domainsOptional, list of active domains for the corresponding turn. If not present, the active domains are estimated from changes in the dialog state during the Inform & Success rate computation. If your model predicts the domain for each turn, place them here. If you use domains in slot names, run the following command to extract the active domains from slot names automatically:

    python add_slot_domains.py [-h] -i INPUT.json -o OUTPUT.json

See the predictions folder with examples.

Output format:

{
    "bleu" : {'damd': … , 'uniconv': … , 'hdsa': … , 'lava': … , 'augpt': … , 'mwz22': … },
    "success" : {
        "inform"  : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
        "success" : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
    },
    "richness" : {
        'entropy': … , 'cond_entropy': … , 'avg_lengths': … , 'msttr': … , 
        'num_unigrams': … , 'num_bigrams': … , 'num_trigrams': … 
    }
}

The evaluation script outputs a dictionary with keys bleu, success, and richness corresponding to BLEU, Inform & Success rates, and lexical richness metrics, respectively. Their values can be None if not evaluated, otherwise:

  • BLEU results contain multiple scores corresponding to different delexicalization styles and refernces. Currently included references are DAMD, HDSA, AuGPT, LAVA, UniConv, and MultiWOZ 2.2 whitch we consider to be the canonical one that should be reported in the future.
  • Inform & Succes rates are reported for each domain (i.e. attraction, restaurant, hotel, taxi, and train in case of the test set) separately and in total.
  • Lexical richness contains the number of distinct uni-, bi-, and tri-grams, average number of tokens in generated responses, token entropy, conditional bigram entropy, and MSTTR-50 calculated on concatenated responses.

Secret feature

You can use this code even for evaluation of dialogue state tracking (DST) on MultiWOZ 2.2. Set dst=True during initialization of the Evaluator to get joint state accuracy, slot precision, recall, and F1. Note that the resulting numbers are very different from the DST results in the original MultiWOZ evaluation. This is because we use slot name and value normalization, and careful fuzzy slot value matching.

🏆 Results

Please see the orginal MultiWOZ repository for the benchmark results.

👏 Contributing

  • If you would like to add your results, modify the particular table in the original reposiotry via a pull request, add the file with predictions into the predictions folder in this repository, and create another pull request here.
  • If you need to update the slot name mapping because of your different delexicalization style, feel free to make the changes, and create a pull request.
  • If you would like to improve normalization of slot values, add your new rules, and create a pull request.

💭 Citation

@inproceedings{nekvinda-dusek-2021-shades,
    title = "Shades of {BLEU}, Flavours of Success: The Case of {M}ulti{WOZ}",
    author = "Nekvinda, Tom{\'a}{\v{s}} and Du{\v{s}}ek, Ond{\v{r}}ej",
    booktitle = "Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.gem-1.4",
    doi = "10.18653/v1/2021.gem-1.4",
    pages = "34--46"
}

Comments
  • How to generate the prediction file?

    How to generate the prediction file?

    Hi, thanks for your great work. I wander how to generate the prediction results file. For example, there is no related scripts in UBAR code, how to generate the predicted results in the expected format?

    opened by SkyAndCloud 10
  • ConvLab pipeline approaches

    ConvLab pipeline approaches

    Hi,

    I am investigating about pipeline task-oriented agents by using the ConvLab-2 toolkit and I have some doubts regarding its results. I think that their multiwoz evaluation method is not the same, although they use same metrics. Have you ever evaluate any ConvLab-2 agent by using your scripts?

    Ref: https://github.com/thu-coai/ConvLab-2

    opened by yukioichida 2
  • Does the old evaluation need the generated system actions or generated DB entries?

    Does the old evaluation need the generated system actions or generated DB entries?

    Following the question by @SkyAndCloud in #1 and our email communication:

    In the new evaluation, you noted in README.md that the inputs are: 1) generated response 2) generated belief states 3) active domains ... It does not need the generated system actions or generated db to calculate the metrics. My question is that, in the old evaluation scripts (I mean the original implementation inside the multiwoz, https://github.com/budzianowski/multiwoz/blob/master/evaluate.py), are the inputs these three (generated response, generated belief states, active domains) too? Does the old evaluation need the generated system actions or generated db to calculate the metrics ?

    opened by Tomiinek 2
  • Add RSTOD predictions.

    Add RSTOD predictions.

    Hello, We've used your standardized scripts to (re-)evaluate our RSTOD architecture on MultiWOZ 2.2. We're also opening a pull request in the official dataset repository to report benchmark results.

    Please check it out :) Best Regards, Radostin

    opened by radi-cho 1
  • Add GALAXY Prediction File

    Add GALAXY Prediction File

    Hi,

    Our GALAXY model uses your standardized scoring scripts to conduct new evaluations on both end-to-end modeling and policy optimization tasks. We have already reported the new SOTA results in the official MultiWOZ repo. So we upload two prediction files here, where 'galaxy-e2e.json' for the end-to-end generation while 'galaxy-policy.json' for the policy optimization.

    Please check it. Thanks a lot!

    Best, Wanwei

    opened by HwwAncient 1
  • Is this an error or am I misunderstanding something?

    Is this an error or am I misunderstanding something?

    # Go through the venues returned by the API call matching the dialog state of the current turn and # use them as the list of possibly offered venues if and only if any of them is not present in the # list of possibly offered venues from the previous dialog turn if current_domain not in offered_venues or len(offered_venues[current_domain]) == 0: offered_venues[current_domain] = matching_venues else: if any(venue not in matching_venues for venue in offered_venues[current_domain]): ^ according to the comment above, is this line supposed to be any(venue not in offered_venues[current_domain] for venue in matching_venues) offered_venues[current_domain] = matching_venues

    opened by tomyoung903 1
  • Add PPTOD Prediction File

    Add PPTOD Prediction File

    Hello Guys,

    Just upload the prediction file from our new paper PPTOD (https://arxiv.org/abs/2109.14739). We also updated our scores in the main repo (https://github.com/budzianowski/multiwoz).

    Thank you very much!

    Best,

    Yixuan

    opened by yxuansu 1
  • Add Mars & BORT predictions

    Add Mars & BORT predictions

    Hello,

    We upload the prediction files from our papers Mars (https://arxiv.org/abs/2210.08917) and BORT (https://aclanthology.org/2022.findings-naacl.166/). We also updated our scores in the main repo (https://github.com/budzianowski/multiwoz).

    Thank you very much!

    Best,

    Haipeng

    opened by hpsun1109 0
  • Adding DST evaluation.

    Adding DST evaluation.

    This PR adds the possibility of evaluation dialog state tracking on MultiWOZ 2.2 test set.

    The implementation involves slot value normalization and careful fuzzy matching, so it produces much higher accuracy than other implementations.

    It calculates:

    • joint accuracy -- the percentage of turns that have exactly the same dialog state as the reference
    • slot F1, precision, recall, where:
      • TP are slots mentioned in the reference and prediction with matching value
      • TN are not calculated at all, but it should correspond to slots that are not mentioned in the reference nor prediction
      • FN are slots that are not mentioned in the prediction but are in the reference
      • FP are those slots mentioned in the prediction but not mentioned in the reference or mentioned in both but not matching by value
    opened by Tomiinek 0
  • Problems getting perfect metrics

    Problems getting perfect metrics

    I am testing the UBAR method in mwoz2.2 and created a parser to convert back the generated sequence into a state dict. First I loaded groundtruth data with datasets and parsed to "predicted" dictionary, then I computed the metrics:

    datasets = load_dataset("json", data_files={
                "train": "data/multiwoz/train/encoded.json",
                "valid": "data/multiwoz/dev/encoded.json",
                "test": "data/multiwoz/test/encoded.json",
            })
    
    predicted = {}
    
    for d in datasets["test"]:
        id = d["id"].rstrip(".json").lower()
        turns = []
        for belief in d["text"].split("<sos_b>")[1:]:
            bs = parse_state(belief.split("<eos_b>")[0])
            response = belief.split("<sos_r>")[1].split("<eos_r>")[0]
            state = {"response": response, "state":{}}
            for k,v in bs:
                state["state"][k] = v
            turns.append(state)
        predicted[id] = turns
    
    e = Evaluator(bleu=True, success=True, richness=True)
    results = e.evaluate(predicted)
    print(results)
    
    >>> {'bleu': {'mwz22': 99.15078876656015},
    'success': {'inform': {'total': 93.0, 'train': 95.6, 'restaurant': 95.9, 'hotel': 95.9, 'taxi': 100.0, 'attraction': 96.0}, 'success': {'total': 88.1, 'train': 89.1, 'restaurant': 90.2, 'hotel': 87.6, 'taxi': 90.8, 'attraction': 90.7}},
    'richness': {'entropy': 7.218217822046144, 'cond_entropy': 3.3791865228994378, 'avg_lengths': 14.094411285946826, 'msttr': 0.7501539942252144, 'num_unigrams': 1467, 'num_bigrams': 11614, 'num_trigrams': 25497}, 'dst': None}
    

    I didn't get the 100% for inform and success.

    Then I compared it with the golden states:

    with open("venv/lib/python3.10/site-packages/mwzeval/data/gold_states.json") as fin:
            data = json.load(fin)
    
    counter = 0
    for key in predicted:
        for i, value in enumerate(predicted[key]):
            if value["state"] != data[key][i]:
                counter += 1
                print(key, i)
                pprint(value["response"])
                pprint(value["state"])
                pprint(data[key][i])
    if not counter:
        print("100% matched")
    

    As output I got that it matched 100%. It is the expected behavior? I am missing something?

    opened by jadermcs 0
Owner
Tomáš Nekvinda
Wisdom giver, bacon & eggs master, ant lover
Tomáš Nekvinda
Exploring Relational Context for Multi-Task Dense Prediction [ICCV 2021]

Adaptive Task-Relational Context (ATRC) This repository provides source code for the ICCV 2021 paper Exploring Relational Context for Multi-Task Dense

David Brüggemann 35 Dec 5, 2022
Provide baselines and evaluation metrics of the task: traffic flow prediction

Note: This repo is adpoted from https://github.com/UNIMIBInside/Smart-Mobility-Prediction. Due to technical reasons, I did not fork their code. Introd

Zhangzhi Peng 11 Nov 2, 2022
3ds-Ghidra-Scripts - Ghidra scripts to help with 3ds reverse engineering

3ds Ghidra Scripts These are ghidra scripts to help with 3ds reverse engineering

Zak 7 May 23, 2022
Omniverse sample scripts - A guide for developing with Python scripts on NVIDIA Ominverse

Omniverse sample scripts ここでは、NVIDIA Omniverse ( https://www.nvidia.com/ja-jp/om

ft-lab (Yutaka Yoshisaka) 37 Nov 17, 2022
Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

DialogBERT This is a PyTorch implementation of the DialogBERT model described in DialogBERT: Neural Response Generation via Hierarchical BERT with Dis

Xiaodong Gu 67 Jan 6, 2023
《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Image2Reverb Image2Reverb is an end-to-end neural network that generates plausible audio impulse responses from single images of acoustic environments

Nikhil Singh 48 Nov 27, 2022
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022
Oriented Response Networks, in CVPR 2017

Oriented Response Networks [Home] [Project] [Paper] [Supp] [Poster] Torch Implementation The torch branch contains: the official torch implementation

ZhouYanzhao 217 Dec 12, 2022
FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR

This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.

Anton Jeran Ratnarajah 89 Dec 22, 2022
Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study Supplementary Materials for Kentaro Matsuura, Junya Honda, Imad

Kentaro Matsuura 4 Nov 1, 2022
Source code for our paper "Improving Empathetic Response Generation by Recognizing Emotion Cause in Conversations"

Source code for our paper "Improving Empathetic Response Generation by Recognizing Emotion Cause in Conversations" this repository is maintained by bo

Yuhan Liu 24 Nov 29, 2022
Source code for our paper "Empathetic Response Generation with State Management"

Source code for our paper "Empathetic Response Generation with State Management" this repository is maintained by both Jun Gao and Yuhan Liu Model Ove

Yuhan Liu 3 Oct 8, 2022
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

Raajhesh Kannaa Chidambaram 3 Aug 13, 2022
Feedback is important: response-aware feedback mechanism for background based conversation

RFM The code for the paper: "Feedback is important: response-aware feedback mechanism for background based conversation." Requirements python 3.7 pyto

Jiatao Chen 2 Sep 29, 2022
Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution.

convolver Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution. Created by Sean Higley [email protected]

Sean Higley 1 Feb 23, 2022
Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Couler What is Couler? Couler aims to provide a unified interface for constructing and managing workflows on different workflow engines, such as Argo

Couler Project 781 Jan 3, 2023
A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

The Alan Turing Institute 6k Jan 8, 2023
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 4, 2023