Hello, I'm Steven, from Johns Hopkins University. I'm currently working on a research project, studying different methods to denoise the training data for low resource languages. I came across your papers (DDS, TCS, and multiDDS) and I'm very interested in your implementation. I start checking this code repo very carefully and I found some issues (I sort of "fixed" them in my own way in a forked repo, if you think it's useful to incorporate in your repo, I can submit a pull request for you to review my changes). Here are the issues:
fairseq beamsearch is out of date:
the code in fairseq/seach.py (torch.div) is deprecated so I update them using the most recent fairseq's beamsearch code.
undefined variable in trainer.py/update_language_sampler()
I think this is the most important part of the code since you calculated the gradient similarity between the training set and dev set to get the similarity score to update the language distribution. There are some undefined or unused variables like self.optimizers
, all_sim_list
. I changed them so that the code only use one vector sim_list
though theoretically there should be a N*N (N is number of language pairs) sim_list, and that's why you need all_sim_list to append different sim_list right? My change only helps me to run my own code since I'm using just 1pair of language instead of multilingual settings, but I think it shouldn't be hard to fix it, you might just leave those variables there by accident.
generator is not reporting score properly
It seems that if I use --sacrebleu to generate, the result is not a string but | Generate test with beam=5: <sacrebleu.metrics.bleu.BLEUScore object at 0x7fec308a75b0>
I'm not what causes the object to be printed.
The code is not working with Ave type data_actor
Since I'm more interested in a one-pair
setting instead of multilingual input, I want the scorer to directly work on src_tokens and trg_tokens
, which is the method you proposed in the DDS
paper. If I interpret your code correctly, this block should never be run right?
# data selection: reset epoch iter to filter out unselected data
if epoch_itr.epoch == args.select_by_dds_epoch and args.select_by_dds_epoch > 0:
epoch_itr, _ = trainer.get_filtered_train_iterator(epoch_itr.epoch, filtered_maxpos_indices=filtered_maxpos_indices)
Since I want to work with data-filtering, and I realize base data-actor is only seeing
language IDs instead of real tokens, I have to useave
type. To make it work, I changed your initialization steps (basically I added elif self.args.data_actor == 'ave':
and adam optimizer for it in your trainer.py
). I'm not sure if this modification is correct but select_by_dds_epoch
works after this change. Therefore, I just want some confirmation/help from you that this is indeed the correct way to implement a data-filtering with ave
data actor.
Last question
I'm just curious what is the usage --utility-type
in the args. I didn't find where it's triggered when I debug through my console. Also, could you share with me the training script/hyper parameters you use for DDS (Optimizing Data Usage via Differentiable Rewards) since I want to train directly on 1 pair of languages and replicate your result.
I'm really impressed by how well you modified the fairseq toolkit and incorporated the reinforcement optimization to change the data loading. If I have any misunderstanding about your methods or code implementation, please let me know. Also, please let me know that if you want me to submit a pull request for you to better view my changes. Thank you for your help in advance!