Hello!
I have read the paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" and was really interested by it. I have tried to recreate the results on for the Question Generation on SQuAD dataset, but failed. My Rouge-L score for DBA is 13.4818 and for DDBA is 9.5297. Clearly I've done something dramatically wrong and I would appreciate your help. Here are all the steps I've done:
- I've downloaded SQuAD dataset here. Then I've separated source input ant target output into train/val/test.source/target files. The train file contains the whole training set. The val and test files are identical and contain dev set from the SQuAD website. The examples can be found here.
- I've run
python finetune.py
. I did not modify finetune.py
or conf.py
. The code completed successfully and saved all the checkpoints.
- To test the pipeline I've started with using simple spacy-generated constraints. In the paper they are referred as "gold constraints". I have used
en_core_web_sm
spacy model to extract entities referring to the example here. The results were placed in a constraint_kpe_em.json
file. You can check it here.
- Finally, for the evaluation I've run
python run_eval.py
and python run_eval.py --partial True
to get DBA and DDBA scores, respectively. I did not change anything in the run_eval.py
file. The scores came out low and were already mentioned above.
I am now working on Automatic constraint generation and trying to apply this repo to SQuAD dataset. Am I correct, that in your repo you are using this code to create constraints? Yet I couldn't figure out how to apply it on SQuAD, though.
However, given the low scores, I have a feeling that there's also something that I could do wrong in the steps described above. Maybe, in the paper some special hyperparameters (different from default ones) were used for the Question Generation task? Could you please help me figure out what's wrong or suggest what steps to take in order to get better scores?