Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Overview

Cross-Attention Transfer for Machine Translation

This repo hosts the code to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021.

Setup

We provide our scripts and modifications to Fairseq. In this section, we describe how to go about running the code and, for instance, reproduce Table 2 in the paper.

Data

To view the data as we prepared and used it, switch to the main branch. But we recommend cloning code from this branch to avoid downloading a large amount of data at once. You can always obtain any data as necessary from the main branch.

Installations

We worked in a conda environment with Python 3.8.

  • First install the requirements.
      pip install requirements.txt
  • Then install Fairseq. To have the option to modify the package, install it in editable mode.
      cd fairseq-modified
      pip install -e .
  • Finally, set the following environment variable.
      export FAIRSEQ=$PWD
      cd ..

Experiments

For the purpose of this walk-through, we assume we want to train a De–En model, using the following data:

De-En
├── iwslt13.test.de
├── iwslt13.test.en
├── iwslt13.test.tok.de
├── iwslt13.test.tok.en
├── iwslt15.tune.de
├── iwslt15.tune.en
├── iwslt15.tune.tok.de
├── iwslt15.tune.tok.en
├── iwslt16.train.de
├── iwslt16.train.en
├── iwslt16.train.tok.de
└── iwslt16.train.tok.en

by transferring from a Fr–En parent model, the experiment files of which is stored under FrEn/checkpoints.

  • Start by making an experiment folder and preprocessing the data.
      mkdir test_exp
      ./xattn-transfer-for-mt/scripts/data_preprocessing/prepare_bi.sh \
          de en test_exp/ \
          De-En/iwslt16.train.tok De-En/iwslt15.tune.tok De-En/iwslt13.test.tok \
          8000
    Please note that prepare_bi.sh is written for the most general case, where you are learning vocabulary for both the source and target sides. When necessary modify it, and reuse whatever vocabulary you want. In this case, e.g., since we are transferring from Fr–En to De–En, we will reuse the target side vocabulary from the parent. So 8000 refers to the source vocabulary size, and we need to copy parent target vocabulary instead of learning one in the script.
      cp ./FrEn/data/tgt.sentencepiece.bpe.model $DATA
      cp ./FrEn/data/tgt.sentencepiece.bpe.vocab $DATA
  • Now you can run an experiment. Here we want to just update the source embeddings and the cross-attention. So we run the corresponding script. Script names are self-explanatory. Set the correct path to the desired parent model checkpoint in the script, and:
      bash ./xattn-transfer-for-mt/scripts/training/reinit-src-embeddings-and-finetune-parent-model-on-translation_src+xattn.sh \
          test_exp/ de en
  • Finally, after training, evaluate your model. Set the correct path to the detokenizer that you use in the script, and:
      bash ./xattn-transfer-for-mt/scripts/evaluation/decode_and_score_valid_and_test.sh \
          test_exp/ de en \
          $PWD/De-En/iwslt15.tune.en $PWD/De-En/iwslt13.test.en

Issues

Please contact us and report any problems you might face through the issues tab of the repo. Thanks in advance for helping us improve the repo!

Credits

The main body of code is built upon Fairseq. We found it very easy to navigate and modify. Kudos to the developers!
The data preprocessing scripts are adopted from FLORES scripts.
To have mBART fit on the GPUs that we worked with memory-wise, we used the trimming solution provided here.

Citation

@inproceedings{gheini-cross-attention,
  title = "Cross-Attention is All You Need: {A}dapting Pretrained {T}ransformers for Machine Translation",
  author = "Gheini, Mozhdeh and Ren, Xiang and May, Jonathan",
  booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  month = nov,
  year = "2021"
}
You might also like...
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

Contra-OOD Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers. Requirements PyTorch Transformers datasets

Code for EMNLP 2021 main conference paper
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Text-AutoAugment (TAA) This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classific

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”
Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

GATER This repository contains the code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”. Our implementation is

Code for our paper Aspect Sentiment Quad Prediction as Paraphrase Generation in EMNLP 2021.

Aspect Sentiment Quad Prediction (ASQP) This repo contains the annotated data and code for our paper Aspect Sentiment Quad Prediction as Paraphrase Ge

Project code for weakly supervised 3D object detectors using wide-baseline multi-view traffic camera data: WIBAM.
Project code for weakly supervised 3D object detectors using wide-baseline multi-view traffic camera data: WIBAM.

WIBAM (Work in progress) Weakly Supervised Training of Monocular 3D Object Detectors Using Wide Baseline Multi-view Traffic Camera Data 3D object dete

A PaddlePaddle version of Neural Renderer, refer to its PyTorch version
A PaddlePaddle version of Neural Renderer, refer to its PyTorch version

Neural 3D Mesh Renderer in PadddlePaddle A PaddlePaddle version of Neural Renderer, refer to its PyTorch version Install Run: pip install neural-rende

Comments
  • Model fails to train with training scripts

    Model fails to train with training scripts

    Hi there, I am doing an experiment with parent language pair En-Th and child language pair as En-Vn. I have managed to replace the source vocab with the one from the parent language pair and have changed the path to the parent checkpoint, however, when running the script reinit-src-embeddings-and-finetune-parent-model-on-translation_src+xattn.sh, I face an error which I suspect is related to the parent model checkpoint. The full error message is as such:

    Exception:
    
    -- Process 0 terminated with the following error:
    Traceback (most recent call last):
      File "/ldap_home/claudia.ong/miniconda3/envs/xattn/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
        fn(i, *args)
      File "/data/ldap_home/claudia.ong/xattn/xattn-transfer-for-mt_claudia/fairseq-modified/fairseq/distributed_utils.py", line 224, in distributed_main
        main(args, **kwargs)
      File "/data/ldap_home/claudia.ong/xattn/xattn-transfer-for-mt_claudia/fairseq-modified/fairseq_cli/train.py", line 84, in main
        param.copy_(pretrained_state_dict[name])
    KeyError: 'encoder.layers.0.self_attn.k_proj.weight'
    

    Any help will be appreciated, thanks!

    opened by coloteong 10
  • Error when resuming training from mBART checkpoint

    Error when resuming training from mBART checkpoint

    Hi,

    I'm trying to run xattn-transfer-for-mt/scripts/training/reinit-src-embeddings-and-finetune-parent-model-on-translation_src+xattn.sh on my own dataset with the mBART checkpoint (instead of the FrEn- checkpoint), but when I do I get this error:

    [--encoder-ffn-embed-dim ENCODER_FFN_EMBED_DIM] [--encoder-layers ENCODER_LAYERS] [--encoder-attention-heads ENCODER_ATTENTION_HEADS] [--encoder-normalize-before] [--encoder-learned-pos] [--encoder-layerdrop ENCODER_LAYERDROP] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--encoder-xformers-att-config ENCODER_XFORMERS_ATT_CONFIG] [--max-source-positions MAX_SOURCE_POSITIONS] [--decoder-embed-path DECODER_EMBED_PATH] [--decoder-embed-dim DECODER_EMBED_DIM] [--decoder-ffn-embed-dim DECODER_FFN_EMBED_DIM] [--decoder-layers DECODER_LAYERS] [--decoder-attention-heads DECODER_ATTENTION_HEADS] [--decoder-normalize-before] [--decoder-learned-pos] [--decoder-layerdrop DECODER_LAYERDROP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--decoder-xformers-att-config DECODER_XFORMERS_ATT_CONFIG] [--decoder-output-dim DECODER_OUTPUT_DIM] [--max-target-positions MAX_TARGET_POSITIONS] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff ADAPTIVE_SOFTMAX_CUTOFF] [--adaptive-softmax-dropout ADAPTIVE_SOFTMAX_DROPOUT] [--adaptive-softmax-factor ADAPTIVE_SOFTMAX_FACTOR] [--layernorm-embedding] [--tie-adaptive-weights] [--tie-adaptive-proj] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--quant-noise-pq QUANT_NOISE_PQ] [--quant-noise-pq-block-size QUANT_NOISE_PQ_BLOCK_SIZE] [--quant-noise-scalar QUANT_NOISE_SCALAR] [--min-params-to-wrap MIN_PARAMS_TO_WRAP] [--char-inputs] [--relu-dropout RELU_DROPOUT] [--base-layers BASE_LAYERS] [--base-sublayers BASE_SUBLAYERS] [--base-shuffle BASE_SHUFFLE] [--export] [--no-decoder-final-norm] [--source-lang SOURCE_LANG] [--target-lang TARGET_LANG] [--load-alignments] [--left-pad-source] [--left-pad-target] [--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source] [--num-batch-buckets NUM_BATCH_BUCKETS] [--eval-bleu] [--eval-bleu-args EVAL_BLEU_ARGS] [--eval-bleu-detok EVAL_BLEU_DETOK] [--eval-bleu-detok-args EVAL_BLEU_DETOK_ARGS] [--eval-tokenized-bleu] [--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]] [--eval-bleu-print-samples] [--label-smoothing LABEL_SMOOTHING] [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data train.py: error: unrecognized arguments: --min-lr -1 --load-model-but-src-embeddings-and-freeze-tgt-embeddings-from ../FrEn/checkpoints/checkpoint_best.pt --only-finetune-cross-attn

    Why does it not recognize some of the arguments? And why does it load the FrEn checkpoint when I set the path to the desired model in the script?

    Any input would be much appreciated.

    opened by theamato 3
  • STEP : 'Start by making an experiment folder and preprocessing the data'  error

    STEP : 'Start by making an experiment folder and preprocessing the data' error

    cp: 'De-En/iwslt15.tune.tok.de' Unexplained: No such file or directory cp: 'De-En/iwslt15.tune.tok.en' Unexplained: No such file or directory cp: 'De-En/iwslt13.test.tok.de' Unexplained: No such file or directory cp: 'De-En/iwslt13.test.tok.en' Unexplained: No such file or directory cp: 'De-En/iwslt16.train.tok.de' Unexplained: No such file or directory cp: 'De-En/iwslt16.train.tok.en' Unexplained: No such file or directory . . . FileNotFoundError: [Errno 2] No such file or directory: 'test_exp//data/train.bpe.de'

    How to solve this problem?

    opened by zozni 2
  • TypeError: forward() missing 1 required positional argument: 'prev_output_tokens'

    TypeError: forward() missing 1 required positional argument: 'prev_output_tokens'

    Hi,

    When running finetune-mbart-on-transaltion_embed+xattn.sh I get the error TypeError: forward() missing 1 required positional argument: 'prev_output_tokens' in the beginning of epoch 1. When checking the dictionary that is passed into forward, there is no key called 'prev_output_token', so I suppose this is what it is complaining about. Moreover, 'target' is None. Any idea what could cause this?

    I'm working in a conda environment with python 3.7.15.

    This is what the arguments passed into forward look like:

    samples {'id': tensor([67186, 27642, 9526, 57293, 27958, 19522, 48434, 26559]), 'nsentences': 8, 'ntokens': 312, 'net_input': {'src_tokens': tensor([[ 91, 1176, 598, 276, 3504, 1322, 902, 17, 419, 18, 1074, 2246, 4, 1074, 77, 332, 63, 83, 4787, 85, 769, 1393, 25, 662, 769, 2368, 102, 4, 83, 47, 1088, 1205, 85, 47, 6217, 650, 5, 2, 3], [1290, 15, 2490, 100, 297, 218, 119, 889, 4, 294, 12, 3290, 493, 3, 1789, 3414, 1069, 155, 2780, 2142, 173, 78, 251, 1553, 5473, 145, 9, 415, 9, 630, 12, 9767, 9, 4997, 137, 25, 5, 2, 3], [ 34, 802, 44, 460, 6847, 1558, 4, 9, 44, 201, 522, 1208, 4583, 2219, 323, 9, 941, 4, 9, 44, 201, 1975, 265, 1126, 507, 9, 888, 25, 9, 930, 16, 46, 1666, 905, 289, 145, 4, 2, 3], [2520, 88, 29, 5256, 2813, 210, 1611, 7949, 551, 727, 3041, 980, 4, 247, 1368, 1088, 8195, 4563, 447, 1414, 3322, 2209, 102, 926, 1072, 2571, 4, 4102, 15, 340, 3, 131, 45, 6850, 203, 1049, 5, 2, 3], [ 189, 369, 22, 428, 11, 385, 191, 47, 254, 9, 567, 5406, 4, 9, 44, 16, 9773, 1650, 1891, 19, 9, 414, 3634, 4433, 63, 964, 93, 782, 25, 9, 81, 1650, 1891, 16, 1317, 4332, 5, 2, 3], [3709, 3, 151, 554, 3745, 205, 98, 3, 9, 8756, 4, 5273, 107, 2782, 12, 5546, 1158, 5, 2532, 893, 4275, 107, 12, 5546, 191, 151, 8328, 253, 9, 111, 6920, 12, 468, 384, 9989, 889, 5, 2, 3], [ 466, 1592, 178, 868, 8147, 29, 2953, 8609, 4, 4504, 2249, 2131, 4504, 3844, 63, 557, 4, 106, 701, 16, 6478, 102, 44, 5728, 297, 65, 350, 3476, 71, 1592, 9, 16, 93, 25, 882, 3385, 5, 2, 3], [ 482, 562, 495, 5325, 22, 69, 4222, 12, 1931, 4571, 76, 11, 1023, 417, 11, 2927, 7987, 4, 253, 2280, 2644, 2247, 261, 4, 373, 232, 4, 722, 1529, 841, 286, 2642, 5920, 102, 15, 1753, 5, 2, 3]]), 'src_lengths': tensor([39, 39, 39, 39, 39, 39, 39, 39])}, 'target': None}

    And this is the whole terminal output when running the script: 2022-11-20 15:22:11 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.98)', adam_eps=1e-06, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='mbart_large', attention_dropout=0.1, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layerdrop=0, decoder_layers=12, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=True, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.3, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=True, end_learning_rate=0.0, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='moses', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, finetune_from_mbart_at='/proj/uppmax2022-2-18/cross_attn/cross_attn/trimmed_mbart_new', finetune_from_mbart_with_reinit_xattn_at=None, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freeze_pretrained_transformer_body=False, keep_best_checkpoints=-1, keep_interval_updates=1, keep_last_epochs=-1, label_smoothing=0.2, langs='ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN', layernorm_embedding=True, left_pad_source='True', left_pad_target='False', load_alignments=False, load_model_but_src_embeddings_and_freeze_tgt_embeddings_from=None, load_model_but_src_embeddings_and_xattn_and_freeze_tgt_embeddings_from=None, load_model_but_tgt_embeddings_and_freeze_src_embeddings_from=None, load_model_but_tgt_embeddings_and_xattn_and_freeze_src_embeddings_from=None, localsgd_frequency=3, log_format='simple', log_interval=20, lr=[3e-05], lr_scheduler='polynomial_decay', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=512, max_tokens_valid=512, max_update=150000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=1, num_batch_buckets=0, num_workers=8, only_finetune_cross_attn=True, optimizer='adam', optimizer_overrides='{}', patience=25, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=None, pipeline_devices=None, pipeline_model_parallel=False, pooler_activation_fn='tanh', pooler_dropout=0.0, power=1.0, prepend_bos=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=500, scoring='bleu', seed=222, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang='som', stop_time_hours=0, target_lang='en', task='translation_from_pretrained_bart', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, total_num_update=1000000, tpu=False, train_subset='train', truncate_source=False, update_freq=[8], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=2500, weight_decay=0.0, zero_sharding='none') 2022-11-20 15:22:11 | INFO | fairseq.tasks.translation | [som] dictionary: 10000 types 2022-11-20 15:22:11 | INFO | fairseq.tasks.translation | [en] dictionary: 10000 types 2022-11-20 15:22:11 | INFO | fairseq.data.data_utils | loaded 9106 examples from: data-bin/valid.som-en.som 2022-11-20 15:22:11 | INFO | fairseq.data.data_utils | loaded 9106 examples from: data-bin/valid.som-en.en 2022-11-20 15:22:11 | INFO | fairseq.tasks.translation | data-bin valid som-en 9106 examples 2022-11-20 15:22:28 | INFO | fairseq_cli.train | loading the (trimmed) mbart from /proj/uppmax2022-2-18/cross_attn/cross_attn/trimmed_mbart_new 2022-11-20 15:23:02 | INFO | fairseq_cli.train | only cross-attention layers will be trained in addition to embeddings, freezing all other parameters 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.embed_tokens.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.embed_positions.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.layernorm_embedding.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.layernorm_embedding.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.embed_positions.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layernorm_embedding.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layernorm_embedding.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.k_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.k_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.v_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.v_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.q_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.q_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.out_proj.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.out_proj.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn_layer_norm.weight will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn_layer_norm.bias will be trained 2022-11-20 15:23:02 | INFO | fairseq_cli.train | BARTModel( (encoder): TransformerEncoder( (dropout_module): FairseqDropout() (embed_tokens): Embedding(10026, 1024, padding_idx=1) (embed_positions): LearnedPositionalEmbedding(1026, 1024, padding_idx=1) (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (layers): ModuleList( (0): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (1): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (2): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (3): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (4): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (5): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (6): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (7): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (8): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (9): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (10): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (11): TransformerEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout_module): FairseqDropout() (activation_dropout_module): FairseqDropout() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) ) (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (decoder): TransformerDecoder( (dropout_module): FairseqDropout() (embed_tokens): Embedding(10026, 1024, padding_idx=1) (embed_positions): LearnedPositionalEmbedding(1026, 1024, padding_idx=1) (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (layers): ModuleList( (0): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (1): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (2): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (3): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (4): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (5): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (6): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (7): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (8): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (9): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (10): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (11): TransformerDecoderLayer( (dropout_module): FairseqDropout() (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_dropout_module): FairseqDropout() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) ) (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (output_projection): Linear(in_features=1024, out_features=10026, bias=False) ) (classification_heads): ModuleDict() ) 2022-11-20 15:23:02 | INFO | fairseq_cli.train | task: translation_from_pretrained_bart (TranslationFromPretrainedBARTTask) 2022-11-20 15:23:02 | INFO | fairseq_cli.train | model: mbart_large (BARTModel) 2022-11-20 15:23:02 | INFO | fairseq_cli.train | criterion: label_smoothed_cross_entropy (LabelSmoothedCrossEntropyCriterion) 2022-11-20 15:23:02 | INFO | fairseq_cli.train | num. model params: 365090816 (num. trained: 62777344) 2022-11-20 15:23:04 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight 2022-11-20 15:23:04 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight 2022-11-20 15:23:04 | INFO | fairseq.utils | CUDA enviroments for all 1 workers 2022-11-20 15:23:04 | INFO | fairseq.utils | rank 0: capabilities = 5.0 ; total memory = 3.949 GB ; name = Quadro K2200
    2022-11-20 15:23:04 | INFO | fairseq.utils | CUDA enviroments for all 1 workers 2022-11-20 15:23:04 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs) 2022-11-20 15:23:04 | INFO | fairseq_cli.train | max tokens per GPU = 512 and max sentences per GPU = None 2022-11-20 15:23:04 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt 2022-11-20 15:23:04 | INFO | fairseq.trainer | loading train data for epoch 1 2022-11-20 15:23:04 | INFO | fairseq.data.data_utils | loaded 69865 examples from: data-bin/train.som-en.som 2022-11-20 15:23:04 | INFO | fairseq.tasks.translation | data-bin train som-en 69865 examples Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.98)', adam_eps=1e-06, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='mbart_large', attention_dropout=0.1, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layerdrop=0, decoder_layers=12, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=True, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.3, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=True, end_learning_rate=0.0, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='moses', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, finetune_from_mbart_at='/proj/uppmax2022-2-18/cross_attn/cross_attn/trimmed_mbart_new', finetune_from_mbart_with_reinit_xattn_at=None, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freeze_pretrained_transformer_body=False, keep_best_checkpoints=-1, keep_interval_updates=1, keep_last_epochs=-1, label_smoothing=0.2, langs='ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN', layernorm_embedding=True, left_pad_source=True, left_pad_target=False, load_alignments=False, load_model_but_src_embeddings_and_freeze_tgt_embeddings_from=None, load_model_but_src_embeddings_and_xattn_and_freeze_tgt_embeddings_from=None, load_model_but_tgt_embeddings_and_freeze_src_embeddings_from=None, load_model_but_tgt_embeddings_and_xattn_and_freeze_src_embeddings_from=None, localsgd_frequency=3, log_format='simple', log_interval=20, lr=[3e-05], lr_scheduler='polynomial_decay', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=512, max_tokens_valid=512, max_update=150000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=1, num_batch_buckets=0, num_workers=8, only_finetune_cross_attn=True, optimizer='adam', optimizer_overrides='{}', patience=25, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=None, pipeline_devices=None, pipeline_model_parallel=False, pooler_activation_fn='tanh', pooler_dropout=0.0, power=1.0, prepend_bos=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=500, scoring='bleu', seed=222, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang='som', stop_time_hours=0, target_lang='en', task='translation_from_pretrained_bart', tensorboard_logdir='', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, total_num_update=1000000, tpu=False, train_subset='train', truncate_source=False, update_freq=[8], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=2500, weight_decay=0.0, zero_sharding='none') 2022-11-20 15:23:04 | INFO | fairseq.trainer | begin training epoch 1

    Traceback (most recent call last): File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 545, in cli_main() File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 541, in cli_main distributed_utils.call_main(args, main) File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/distributed_utils.py", line 255, in call_main main(args, **kwargs) File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 311, in main valid_losses, should_stop = train(args, trainer, task, epoch_itr) File "cross_attn/conda_env/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 396, in train log_output = trainer.train_step(samples) File "cross_attn/conda_env/lib/python3.7/contextlib.py", line 74, in inner return func(*args, **kwds) File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/trainer.py", line 479, in train_step ignore_grad=is_dummy_batch, File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/tasks/fairseq_task.py", line 412, in train_step loss, sample_size, logging_output = criterion(model, sample) File "cross_attn/conda_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 551, in call result = self.forward(*input, **kwargs) File "/crex/proj/uppmax2022-2-18/cross_attn/cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/criterions/label_smoothed_cross_entropy.py", line 56, in forward net_output = model(**sample['net_input']) File "cross_attn/conda_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 551, in call result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'prev_output_tokens'

    opened by theamato 0
Owner
Mozhdeh Gheini
Computer Science Ph.D. Student at the University of Southern California
Mozhdeh Gheini
Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

Geometry-Aware Gradient Algorithms for Neural Architecture Search This repository contains the code required to run the experiments for the DARTS sear

null 18 May 27, 2022
Camera-caps - Examine the camera capabilities for V4l2 cameras

camera-caps This is a graphical user interface over the v4l2-ctl command line to

Jetsonhacks 25 Dec 26, 2022
Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

ToxiChat Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Install depen

Ashutosh Baheti 11 Jan 1, 2023
Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

GD-VCR Code for Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning (EMNLP 2021). Research Questions and Aims: How well can a model perform o

Da Yin 24 Oct 13, 2022
Data augmentation for NLP, accepted at EMNLP 2021 Findings

AEDA: An Easier Data Augmentation Technique for Text Classification This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Techni

Akbar Karimi 81 Dec 9, 2022
Related resources for our EMNLP 2021 paper

Plan-then-Generate: Controlled Data-to-Text Generation via Planning Authors: Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier Code

Yixuan Su 61 Jan 3, 2023
“Data Augmentation for Cross-Domain Named Entity Recognition” (EMNLP 2021)

Data Augmentation for Cross-Domain Named Entity Recognition Authors: Shuguang Chen, Gustavo Aguilar, Leonardo Neves and Thamar Solorio This repository

RiTUAL@UH 18 Sep 10, 2022
This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories This repo is the code release of EMNLP 2021 con

null 12 Nov 22, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023