I am playing with the MMA-hard model to replicate WMT15 DE-EN experiments reported in the paper and my question is about preprocessing and postprocessing data. The paper says that:
For each dataset, we apply tokenization with the Moses (Koehn et al., 2007) tokenizer and preserve casing. We apply byte
pair encoding (BPE) (Sennrich et al., 2016) jointly on the source and target to construct a shared
vocabulary with 32K symbols
Following what is said above, I applied moses scripts to tokenize raw files and applied BPE to the tokenized files. Then, tokenized and BPE applied train, val and test files were binarized using following fairseq preprocess command:
fairseq-preprocess --source-lang de --target-lang en \
--trainpref ~/wmt15_de_en_32k/train --validpref ~/wmt15_de_en_32k/valid --testpref ~/wmt15_de_en_32k/test \
--destdir ~/wmt15_de_en_32k/data-bin/ \
--workers 20
Afer that, I trained a MMA-hard model using the binarized data. Now, I would like to evaluate (w.r.t. Latency and Bleu) a checkpoint using SimulEval. My first question is about the file format: Which format should I provide the test files as --source and --target to simuleval command? There are three options as far as I can see:
- Using Raw files.
- Using tokenized files
- Using tokenized and bpe applied files.
I am following EN-JA waitk model's agent file to understand what should be done. However, the difference between the experiment I'd like to replicate and EN-JA experiment is that in EN-JA sentencepiece model is used for tokenization whereas in my case moses is used and also bpe is applied.
So, I tried following:
I provided path of TOKENIZED files as --source and --target to simuleval. Also, I've implemented segment_to_units and build_word_splitter functions as follows but I couldn't figure out how I should implement units_to_segment.
I tried to test this implementation as follows:
$ head -n 1 ~/wmt15_de_en_32k/tmp/test.de
Die Premierminister Indiens und Japans trafen sich in Tokio .
$ head -n 1 ~/wmt15_de_en_32k/tmp/test.en
India and Japan prime ministers meet in Tokyo
simuleval --agent mma-dummy/mmaAgent.py --source ~/wmt15_de_en_32k/tmp/test.de \
--target ~/wmt15_de_en_32k/tmp/test.en --data-bin ~/wmt15_de_en_32k/data-bin/ \
--model-path ~/checkpoints/checkpoint_best.pt --bpe_code ~/wmt15_de_en_32k/code
So, my questions are:
- Is it correct to provide tokenized but not bpe applied test files as --source and --target to simuleval?
- Do implementations of
segment_to_units
and build_word_splitter
functions seem correct?
- Could you please explain how
units_to_segment
and update_states_write
should be implemented?
Edit: When I evaluate the best checkpoint on a subset of test-set using the above code I got the following output:
2021-09-19 22:10:08 | WARNING | sacrebleu | That's 100 lines that end in a tokenized period ('.')
2021-09-19 22:10:08 | WARNING | sacrebleu | It looks like you forgot to detokenize your test data, which may hurt your score.
2021-09-19 22:10:08 | WARNING | sacrebleu | If you insist your data is detokenized, or don't care, you can suppress this message with '--force'.
2021-09-19 22:10:08 | INFO | simuleval.cli | Evaluation results:
{
"Quality": {
"BLEU": 6.068334932433579
},
"Latency": {
"AL": 7.8185020314753055,
"AP": 0.833324143320322,
"DAL": 11.775593814849854
}
}