Hello there, thank you for a great paper and piece of work!
I tried to train multiencoder, but when I try to get raw data from https://github.com/Yale-LILY/QMSum, it seems to have a slightly different format
Failed to run preprocess.py, missing meeting_id and meeting_transcripts expects list of str but the oroginal data has list of dict
I can hack around and change to format to introduce dummy meeting_id and make it look as expected but I wanted to first check if I am missing something or if there is an cleaner way to do so.
Question is: before running preprocess.py should one just get the jsonl files from https://github.com/Yale-LILY/QMSum or is there additional and different data expected beyond a simple transform to the original data?
Thank you in advance!