Hi,
When running finetune-mbart-on-transaltion_embed+xattn.sh
I get the error TypeError: forward() missing 1 required positional argument: 'prev_output_tokens' in the beginning of epoch 1.
When checking the dictionary that is passed into forward, there is no key called 'prev_output_token', so I suppose this is what it is complaining about. Moreover, 'target' is None. Any idea what could cause this?
I'm working in a conda environment with python 3.7.15.
This is what the arguments passed into forward look like:
samples {'id': tensor([67186, 27642, 9526, 57293, 27958, 19522, 48434, 26559]), 'nsentences': 8, 'ntokens': 312, 'net_input': {'src_tokens': tensor([[ 91, 1176, 598, 276, 3504, 1322, 902, 17, 419, 18, 1074, 2246,
4, 1074, 77, 332, 63, 83, 4787, 85, 769, 1393, 25, 662,
769, 2368, 102, 4, 83, 47, 1088, 1205, 85, 47, 6217, 650,
5, 2, 3],
[1290, 15, 2490, 100, 297, 218, 119, 889, 4, 294, 12, 3290,
493, 3, 1789, 3414, 1069, 155, 2780, 2142, 173, 78, 251, 1553,
5473, 145, 9, 415, 9, 630, 12, 9767, 9, 4997, 137, 25,
5, 2, 3],
[ 34, 802, 44, 460, 6847, 1558, 4, 9, 44, 201, 522, 1208,
4583, 2219, 323, 9, 941, 4, 9, 44, 201, 1975, 265, 1126,
507, 9, 888, 25, 9, 930, 16, 46, 1666, 905, 289, 145,
4, 2, 3],
[2520, 88, 29, 5256, 2813, 210, 1611, 7949, 551, 727, 3041, 980,
4, 247, 1368, 1088, 8195, 4563, 447, 1414, 3322, 2209, 102, 926,
1072, 2571, 4, 4102, 15, 340, 3, 131, 45, 6850, 203, 1049,
5, 2, 3],
[ 189, 369, 22, 428, 11, 385, 191, 47, 254, 9, 567, 5406,
4, 9, 44, 16, 9773, 1650, 1891, 19, 9, 414, 3634, 4433,
63, 964, 93, 782, 25, 9, 81, 1650, 1891, 16, 1317, 4332,
5, 2, 3],
[3709, 3, 151, 554, 3745, 205, 98, 3, 9, 8756, 4, 5273,
107, 2782, 12, 5546, 1158, 5, 2532, 893, 4275, 107, 12, 5546,
191, 151, 8328, 253, 9, 111, 6920, 12, 468, 384, 9989, 889,
5, 2, 3],
[ 466, 1592, 178, 868, 8147, 29, 2953, 8609, 4, 4504, 2249, 2131,
4504, 3844, 63, 557, 4, 106, 701, 16, 6478, 102, 44, 5728,
297, 65, 350, 3476, 71, 1592, 9, 16, 93, 25, 882, 3385,
5, 2, 3],
[ 482, 562, 495, 5325, 22, 69, 4222, 12, 1931, 4571, 76, 11,
1023, 417, 11, 2927, 7987, 4, 253, 2280, 2644, 2247, 261, 4,
373, 232, 4, 722, 1529, 841, 286, 2642, 5920, 102, 15, 1753,
5, 2, 3]]), 'src_lengths': tensor([39, 39, 39, 39, 39, 39, 39, 39])}, 'target': None}
And this is the whole terminal output when running the script:
2022-11-20 15:22:11 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.98)', adam_eps=1e-06, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='mbart_large', attention_dropout=0.1, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layerdrop=0, decoder_layers=12, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=True, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.3, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=True, end_learning_rate=0.0, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='moses', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, finetune_from_mbart_at='/proj/uppmax2022-2-18/cross_attn/cross_attn/trimmed_mbart_new', finetune_from_mbart_with_reinit_xattn_at=None, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freeze_pretrained_transformer_body=False, keep_best_checkpoints=-1, keep_interval_updates=1, keep_last_epochs=-1, label_smoothing=0.2, langs='ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN', layernorm_embedding=True, left_pad_source='True', left_pad_target='False', load_alignments=False, load_model_but_src_embeddings_and_freeze_tgt_embeddings_from=None, load_model_but_src_embeddings_and_xattn_and_freeze_tgt_embeddings_from=None, load_model_but_tgt_embeddings_and_freeze_src_embeddings_from=None, load_model_but_tgt_embeddings_and_xattn_and_freeze_src_embeddings_from=None, localsgd_frequency=3, log_format='simple', log_interval=20, lr=[3e-05], lr_scheduler='polynomial_decay', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=512, max_tokens_valid=512, max_update=150000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=1, num_batch_buckets=0, num_workers=8, only_finetune_cross_attn=True, optimizer='adam', optimizer_overrides='{}', patience=25, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=None, pipeline_devices=None, pipeline_model_parallel=False, pooler_activation_fn='tanh', pooler_dropout=0.0, power=1.0, prepend_bos=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=500, scoring='bleu', seed=222, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang='som', stop_time_hours=0, target_lang='en', task='translation_from_pretrained_bart', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, total_num_update=1000000, tpu=False, train_subset='train', truncate_source=False, update_freq=[8], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=2500, weight_decay=0.0, zero_sharding='none')
2022-11-20 15:22:11 | INFO | fairseq.tasks.translation | [som] dictionary: 10000 types
2022-11-20 15:22:11 | INFO | fairseq.tasks.translation | [en] dictionary: 10000 types
2022-11-20 15:22:11 | INFO | fairseq.data.data_utils | loaded 9106 examples from: data-bin/valid.som-en.som
2022-11-20 15:22:11 | INFO | fairseq.data.data_utils | loaded 9106 examples from: data-bin/valid.som-en.en
2022-11-20 15:22:11 | INFO | fairseq.tasks.translation | data-bin valid som-en 9106 examples
2022-11-20 15:22:28 | INFO | fairseq_cli.train | loading the (trimmed) mbart from /proj/uppmax2022-2-18/cross_attn/cross_attn/trimmed_mbart_new
2022-11-20 15:23:02 | INFO | fairseq_cli.train | only cross-attention layers will be trained in addition to embeddings, freezing all other parameters
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.embed_tokens.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.embed_positions.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.layernorm_embedding.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter encoder.layernorm_embedding.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.embed_positions.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layernorm_embedding.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layernorm_embedding.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.0.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.1.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.2.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.3.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.4.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.5.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.6.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.7.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.8.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.9.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.10.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.k_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.k_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.v_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.v_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.q_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.q_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.out_proj.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn.out_proj.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn_layer_norm.weight will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | parameter decoder.layers.11.encoder_attn_layer_norm.bias will be trained
2022-11-20 15:23:02 | INFO | fairseq_cli.train | BARTModel(
(encoder): TransformerEncoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(10026, 1024, padding_idx=1)
(embed_positions): LearnedPositionalEmbedding(1026, 1024, padding_idx=1)
(layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(6): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(7): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(8): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(9): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(10): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(11): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(decoder): TransformerDecoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(10026, 1024, padding_idx=1)
(embed_positions): LearnedPositionalEmbedding(1026, 1024, padding_idx=1)
(layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(layers): ModuleList(
(0): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(6): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(7): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(8): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(9): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(10): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(11): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(output_projection): Linear(in_features=1024, out_features=10026, bias=False)
)
(classification_heads): ModuleDict()
)
2022-11-20 15:23:02 | INFO | fairseq_cli.train | task: translation_from_pretrained_bart (TranslationFromPretrainedBARTTask)
2022-11-20 15:23:02 | INFO | fairseq_cli.train | model: mbart_large (BARTModel)
2022-11-20 15:23:02 | INFO | fairseq_cli.train | criterion: label_smoothed_cross_entropy (LabelSmoothedCrossEntropyCriterion)
2022-11-20 15:23:02 | INFO | fairseq_cli.train | num. model params: 365090816 (num. trained: 62777344)
2022-11-20 15:23:04 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2022-11-20 15:23:04 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2022-11-20 15:23:04 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2022-11-20 15:23:04 | INFO | fairseq.utils | rank 0: capabilities = 5.0 ; total memory = 3.949 GB ; name = Quadro K2200
2022-11-20 15:23:04 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2022-11-20 15:23:04 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2022-11-20 15:23:04 | INFO | fairseq_cli.train | max tokens per GPU = 512 and max sentences per GPU = None
2022-11-20 15:23:04 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2022-11-20 15:23:04 | INFO | fairseq.trainer | loading train data for epoch 1
2022-11-20 15:23:04 | INFO | fairseq.data.data_utils | loaded 69865 examples from: data-bin/train.som-en.som
2022-11-20 15:23:04 | INFO | fairseq.tasks.translation | data-bin train som-en 69865 examples
Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.98)', adam_eps=1e-06, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='mbart_large', attention_dropout=0.1, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layerdrop=0, decoder_layers=12, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=True, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.3, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=True, end_learning_rate=0.0, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='moses', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, finetune_from_mbart_at='/proj/uppmax2022-2-18/cross_attn/cross_attn/trimmed_mbart_new', finetune_from_mbart_with_reinit_xattn_at=None, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freeze_pretrained_transformer_body=False, keep_best_checkpoints=-1, keep_interval_updates=1, keep_last_epochs=-1, label_smoothing=0.2, langs='ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN', layernorm_embedding=True, left_pad_source=True, left_pad_target=False, load_alignments=False, load_model_but_src_embeddings_and_freeze_tgt_embeddings_from=None, load_model_but_src_embeddings_and_xattn_and_freeze_tgt_embeddings_from=None, load_model_but_tgt_embeddings_and_freeze_src_embeddings_from=None, load_model_but_tgt_embeddings_and_xattn_and_freeze_src_embeddings_from=None, localsgd_frequency=3, log_format='simple', log_interval=20, lr=[3e-05], lr_scheduler='polynomial_decay', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=512, max_tokens_valid=512, max_update=150000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=1, num_batch_buckets=0, num_workers=8, only_finetune_cross_attn=True, optimizer='adam', optimizer_overrides='{}', patience=25, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=None, pipeline_devices=None, pipeline_model_parallel=False, pooler_activation_fn='tanh', pooler_dropout=0.0, power=1.0, prepend_bos=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=500, scoring='bleu', seed=222, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang='som', stop_time_hours=0, target_lang='en', task='translation_from_pretrained_bart', tensorboard_logdir='', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, total_num_update=1000000, tpu=False, train_subset='train', truncate_source=False, update_freq=[8], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=2500, weight_decay=0.0, zero_sharding='none')
2022-11-20 15:23:04 | INFO | fairseq.trainer | begin training epoch 1
Traceback (most recent call last):
File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 545, in
cli_main()
File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 541, in cli_main
distributed_utils.call_main(args, main)
File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/distributed_utils.py", line 255, in call_main
main(args, **kwargs)
File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 311, in main
valid_losses, should_stop = train(args, trainer, task, epoch_itr)
File "cross_attn/conda_env/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq_cli/train.py", line 396, in train
log_output = trainer.train_step(samples)
File "cross_attn/conda_env/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/trainer.py", line 479, in train_step
ignore_grad=is_dummy_batch,
File "cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/tasks/fairseq_task.py", line 412, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "cross_attn/conda_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 551, in call
result = self.forward(*input, **kwargs)
File "/crex/proj/uppmax2022-2-18/cross_attn/cross_attn/xattn-transfer-for-mt/fairseq-modified/fairseq/criterions/label_smoothed_cross_entropy.py", line 56, in forward
net_output = model(**sample['net_input'])
File "cross_attn/conda_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 551, in call
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'prev_output_tokens'