UNION
Automatic Evaluation Metric described in the paper UNION: An UNreferenced MetrIc for Evaluating Open-eNded Story Generation (EMNLP 2020). Please refer to the Paper List for more information about Open-eNded Language Generation (ONLG) tasks. Hopefully the paper list will help you know more about this field.
Contents
Prerequisites
The code is written in TensorFlow library. To use the program the following prerequisites need to be installed.
- Python 3.7.0
- tensorflow-gpu 1.14.0
- numpy 1.18.1
- regex 2020.2.20
- nltk 3.4.5
Computing Infrastructure
We train UNION based on the platform:
- OS: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-98-generic x86_64)
- GPU: NVIDIA TITAN Xp
Quick Start
1. Constructing Negative Samples
Execute the following command:
cd ./Data
python3 ./get_vocab.py your_mode
python3 ./gen_train_data.py your_mode
your_mode
isroc
forROCStories corpus
orwp
forWritingPrompts dataset
. Then the summary of vocabulary and the corresponding frequency and pos-tagging will be found underROCStories/ini_data/entitiy_vocab.txt
orWritingPrompts/ini_data/entity_vocab.txt
.- Negative samples and human-written stories will be constructed based on the original training set. The training set will be found under
ROCStories/train_data
orWritingPrompts/train_data
. - Note: currently only 10 samples of the full original data and training data are provided. The full data can be downloaded from THUcloud or GoogleDrive.
2. Training of UNION
Execute the following command:
python3 ./run_union.py --data_dir your_data_dir \
--output_dir ./model/union \
--task_name train \
--init_checkpoint ./model/uncased_L-12_H-768_A-12/bert_model.ckpt
your_data_dir
is./Data/ROCStories
or./Data/WritingPrompts
.- The initial checkpoint of BERT can be downloaded from bert. We use the uncased base version of BERT (about 110M parameters). We train the model for 40000 steps at most. The training process will task about 1~2 days.
3. Prediction with UNION
Execute the following command:
python3 ./run_union.py --data_dir your_data_dir \
--output_dir ./model/output \
--task_name pred \
--init_checkpoint your_model_name
-
your_data_dir
is./Data/ROCStories
or./Data/WritingPrompts
. If you want to evaluate your custom texts, you only need tp change your file format into ours. -
your_model_name
is./model/union_roc/union_roc
or./model/union_wp/union_wp
. The fine-tuned checkpoint can be downloaded from the following link:
Dataset | Fine-tuned Model |
---|---|
ROCStories | THUcloud; GoogleDrive |
WritingPrompts | THUcloud; GoogleDrive |
- The union score of the stories under
your_data_dir/ant_data
can be found under the output_dir./model/output
.
4. Correlation Calculation
Execute the following command:
python3 ./correlation.py your_mode
Then the correlation between the human judgements under your_data_dir/ant_data
and the scores of metrics under your_data_dir/metric_output
will be output. The figures under "./figure" show the score graph between metric scores and human judgments for ROCStories corpus
.
./Data
Data Instruction for files under ├── Data
└── `negation.txt` # manually constructed negation word vocabulary.
└── `conceptnet_antonym.txt` # triples with antonym relations extracted from ConceptNet.
└── `conceptnet_entity.csv` # entities acquired from ConceptNet.
└── `ROCStories`
├── `ant_data` # sampled stories and corresponding human annotation.
└── `ant_data.txt` # include only binary annotation for reasonable(1) or unreasonable(0)
└── `ant_data_all.txt` # include the annotation for specific error types: reasonable(0), repeated plots(1), bad coherence(2), conflicting logic(3), chaotic scenes(4), and others(5).
└── `reference.txt` # human-written stories with the same leading context with annotated stories.
└── `reference_ipt.txt`
└── `reference_opt.txt`
├── `ini_data` # original dataset for training/validation/testing.
└── `train.txt`
└── `dev.txt`
└── `test.txt`
└── `entity_vocab.txt` # generated by `get_vocab.py`, consisting of all the entities and the corresponding tagged POS followed by the mention frequency in the dataset.
├── `train_data` # negative samples and corresponding human-written stories for training, which are constructed by `gen_train_data.py`.
└── `train_human.txt`
└── `train_negative.txt`
└── `dev_human.txt`
└── `dev_negative.txt`
└── `test_human.txt`
└── `test_negative.txt`
├── `metric_output` # the scores of different metrics, which can be used to replicate the correlation in Table 5 of the paper.
└── `bleu.txt`
└── `bleurt.txt`
└── `ppl.txt` # the sign of the result of Perplexity needs to be changed to get the result for *minus* Perplexity.
└── `union.txt`
└── `union_recon.txt` # the ablated model without the reconstruction task
└── ...
└── `WritingPrompts`
├── ...
- The annotated data file
ant_data.txt
andant_data_all.txt
are formatted asStory ID ||| Story ||| Seven Annotated Scores
. ant_data_all.txt
is only available forROCStories corpus
.ant_data_all.txt
is the same withant_data.txt
forWrintingPrompts dataset
.
Citation
Please kindly cite our paper if this paper and the code are helpful.
@misc{guan2020union,
title={UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation},
author={Jian Guan and Minlie Huang},
year={2020},
eprint={2009.07602},
archivePrefix={arXiv},
primaryClass={cs.CL}
}