Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Overview



GitHub spaces DOI

Paper | Blog





OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework. For more information, please refer to our paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.

We welcome contributions to our project. Feel free to contact us or send us issues/PRs!

Online Demos

We provide online demo via Hugging Face Spaces for you to interact with our pretrained and finetuned models. Below are the links to the demos:

Also we provide Colab notebooks for you to better perceive the procedures. Click here to check them out!

News

  • 2022.4.28: Add support of inference on huggingface transformers. For how to use it, please refer to the doc transformers.md and our huggingface models.
  • 2022.4.16: Released lightweight pretrained models OFA-Medium (~93M params) and OFA-Tiny (~33M params) in checkpoints.md. To use them, you just need to load the corresponding checkpoint and set --arch=ofa_medium or --arch=ofa_tiny in the scripts.
  • 2022.3.23: Added Encouraging Loss as a feature. See README_EncouragingLoss.md. Leveraging this feature, OFA-Large has achieved improved results in both VQA (test-std acc: 80.67) and Image Classification (test acc: 85.6) recently.
  • 2022.3.21: Released codes for pretraining OFA.
  • 2022.3.18: Released the finetuned OFA-Base (~180M parameters) checkpoints and running scripts for vision & language tasks, including: Caption (146.4 CIDEr), VQA (78.07 on test-std), SNLI-VE (89.3 on dev), RefCOCO (90.67 on testA), RefCOCO+ (87.15 on testA) and RefCOCOg (82.31 on test-u) .
  • 2022.3.11: Released the finetuning & inference code/checkpoints for Gigaword.
  • 2022.3.08: Released the pretrained checkpoint of OFA-Base in checkpoints.md. To use OFA-Base, you just need to load ofa_base.pt and change --arch=ofa_large to --arch=ofa_base in the training scripts.
More News

  • 2022.3.07: Released the finetuning & inference code/checkpoints for Image Classification, which achieves 85.0 accuracy on ImageNet-1K, slightly better than reported in OFA paper.
  • 2022.3.04: Released the finetuning & inference code/checkpoints for Text-to-Image Generation.
  • 2022.3.03: Released the finetuning & inference code/checkpoints for SNLI-VE and GLUE.
  • 2022.2.22: Released the finetuning & inference code/checkpoints for Visual Question Answering, which can reproduce the reported VQA accuracy in OFA paper (80.02 on test-std). Check our results on the VQA Challenge.
  • 2022.2.15: Released finetuning & inference code/checkpoints for Referring Expression Comprehension
  • 2022.2.10: Released the inference code & finetuned checkpoint for Image captioning, which can reproduce the results on COCO Karparthy test split (149.6 CIDEr). OFA also achieves No.1 on the COCO image captioning online leaderboard Link (marked as M6-Team).



Model Card

We list the parameters and pretrained checkpoints of OFAs below. For finetuned checkpoints, please refer to checkpoints.md.

Model Ckpt Params Backbone Hidden size Intermediate size Num. of heads Enc layers Dec layers
OFATiny Download 33M ResNet50 256 1024 4 4 4
OFAMedium Download 93M ResNet101 512 2048 8 4 4
OFABase Download 180M ResNet101 768 3072 12 6 6
OFALarge Download 470M ResNet152 1024 4096 16 12 12


Results

Below we demonstrate the results of OFAs on cross-modal understanding and generation.

Task Image Captioning Text-to-Image Generation VQA Visual Entailment Referring Expression Comprehension
Dataset COCO COCO VQA v2 SNLI-VE RefCOCO RefCOCO+ RefCOCOg
Split Kaparthy test test test-dev test-std val test val test-a test-b val test-a test-b val-u test-u
Metric CIDEr FID CLIPSIM IS Acc. Acc. Acc.
OFATiny 128.4 - - - 70.25 70.41 85.3 85.2 80.20 84.07 75.00 68.22 75.13 57.66 72.02 69.74
OFAMedium 140.3 - - - 75.35 75.45 86.6 87.0 85.34 87.68 77.92 76.09 83.04 66.25 78.76 78.58
OFABase 146.4 20.8 31.6 21.8 77.98 78.07 89.3 89.2 88.48 90.67 83.30 81.39 87.15 74.29 82.29 82.31
OFALarge 150.2 10.5 34.4 31.1 80.43 80.67 90.3 90.2 90.05 92.93 85.26 85.80 89.87 79.22 85.89 86.55


Requirements

  • python 3.7.4
  • pytorch 1.8.1
  • torchvision 0.9.1
  • JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt



Datasets and Checkpoints

See datasets.md and checkpoints.md.

Pretraining

Below we provide methods for pretraining OFA.

1. Prepare the Dataset

To pretrain OFA, you should first download the dataset we provide (pretrain_data_examples.zip, a small subset of the original pretraining data). For your customed pretraining datasets, please prepare your training samples into the same format. pretrain_data_examples.zip contains 4 TSV files: vision_language_examples.tsv, text_examples.tsv, image_examples.tsv and detection_examples.tsv. Details of these files are as follows:

  • vision_language_examples.tsv: Each line contains uniq-id, image (base64 string), caption, question, answer, ground-truth objects (objects appearing in the caption or question), dataset name (source of the data) and task type (caption, qa or visual gronunding). Prepared for the pretraining tasks of visual grounding, grounded captioning, image-text matching, image captioning and visual question answering.
  • text_examples.tsv: Each line contains uniq-id and text. Prepared for the pretraining task of text infilling.
  • image_examples.tsv: Each line contains uniq-id, image (base64 string) and image-code (generated by VQ-GAN). Prepared for the pretraining task of image infilling.
  • detection_examples.tsv: Each line contains uniq-id, image (base64 string) and bounding box annotations (contains the top-left and bottom-right coordinates of the bounding box, object_id and object_name, seperated by commas). Prepared for the pretraining task of detection.
In addition, the folder negative_sample in pretrain_data_examples.zip contains three files all_captions.txt, object.txt and type2ans.json. The data in these files are used as negative samples for the image-text matching (ITM) task.

2. Pretraining

By default, the pretraining script will attempt to restore the released pretrained checkpoints of OFA-Base or OFA-Large and perform continuous pretraining. Continuous pretraining is more recommended, which achieves much better results compared with pretraining from scratch. For continuous pretraining, please download the pretrained weights in advance (see checkpoints.md) and put them in the correct directory OFA/checkpoints/. If not, the pretraining will begin from scratch.

cd run_scripts/pretraining
bash pretrain_ofa_large.sh # Pretrain OFA-Large. For OFA-Base, use pretrain_ofa_base.sh

If the pretrained OFA checkpoint is restored successfully, you will see the following information in the log:

INFO: Loaded checkpoint ../../checkpoints/ofa_large.pt



Finetuning & Inference

Below we provide methods for finetuning and inference on different downstream tasks. We provide both pretrained OFA-Large and OFA-Base in checkpoints.md. The scripts mentioned in this section are prepared for OFA-Large. For reproducing the downstreaming results of OFA-Base, we have also provided the corresponding finetuning and inference scripts for OFA-Base in the run_scripts/ folder.

We recommend that your workspace directory should be organized like this:

OFA/
├── checkpoints/
│   ├── ofa_base.pt
│   ├── ofa_large.pt
│   ├── caption_large_best_clean.pt
│   └── ...
├── criterions/
├── data/
├── dataset/
│   ├── caption_data/
│   ├── gigaword_data/
│   └── ...
├── fairseq/
├── models/
├── run_scripts/
├── tasks/
├── train.py
├── trainer.py
└── utils/

Image Captioning

We provide procedures to reproduce our results of image captioning on our paper below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile caption_data.zip contains caption_stage1_train.tsv, caption_stage2_train.tsv, caption_val.tsv and caption_test.tsv. Each image corresponds to only 1 caption in caption_stage1_train.tsv and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.

162365  12455   the sun sets over the trees beyond some docks.  sky&&water&&dock&&pole  /9j/4AAQSkZJ....UCP/2Q==
2. Finetuning

Following previous standard practice, we divide the finetuning process of image captioning into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 NVIDIA-V100 GPUs with 32GB memory (expected to obtain ~139.5 CIDEr on the validation set at this stage). In stage 2, we select the best checkpoint of stage 1 and train with CIDEr optimization on 8 NVIDIA-V100 GPUs. Note that CIDEr optimization is very unstable and requires careful hyperparameter tuning. If you encounter training errors in the stage2 finetuning, you can increase the batch size or reduce the learning rate. If neither of these works, you can directly set --freeze-resnet to freeze the inner states of batch normalization.

cd run_scripts/caption
nohup sh train_caption_stage1.sh > train_stage1.out &  # stage 1, train with cross-entropy loss
nohup sh train_caption_stage2.sh > train_stage2.out &  # stage 2, load the best ckpt of stage1 and train with CIDEr optimization 
3. Inference

Run the following commands to get your results and evaluate your model.

cd run_scripts/caption ; sh evaluate_caption.sh  # inference & evaluate

Text-to-Image Generation

This part provides procedures for the finetuning and inference of text-to-image generation. See below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile coco_image_gen.zip contains coco_vqgan_train.tsv, coco_vqgan_dev.tsv and coco_vqgan_full_test.tsv. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by vqgan, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.

1	6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846	the people are posing for a group photo.

The checkpoint zipfile image_gen_large_best.zip contains image_gen_large_best.pt, vqgan/last.ckpt, vqgan/model.yaml and clip/Vit-B-16.pt.

2. Shuffle the Training Data

(Optional, but achieves better result): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance.

cd dataset/image_gen
ln coco_vqgan_train.tsv coco_vqgan_train_1.tsv
for idx in `seq 1 9`;do shuf coco_vqgan_train_${idx}.tsv > coco_vqgan_train_$[${idx}+1].tsv;done # each file is used for an epoch
3. Finetuning

Following previous practice, we divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 8-V100-32G-GPU servers (expected to obtain ~32.5+ CLIP Score on the validation set at this stage). In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization on 4 8-V100-32G-GPU servers (expected to obtain ~34.0+ CLIP Score on the validation set at this stage). During the validation, the generated image will be dumped into _GEN_IMAGE_PATH_.

# run on each worker after the distributed and data configs have been correctly set following the guide in train_image_gen_stage1_distributed.sh 
cd run_scripts/image_gen
nohup sh train_image_gen_stage1_distributed.sh # stage 1, train with cross-entropy loss
nohup sh train_image_gen_stage2_distributed.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization 
4. Inference

Run the command below to generate your images.

cd run_scripts/image_gen ; sh evaluate_image_gen.sh  # inference & evaluate (FID, IS and CLIP Score)

Visual Question Answering

Here we provide the finetuning and inference codes to reproduce the VQAv2 result reported in our paper (test-std 80.02). We believe much improvement on accuracy can still be achieved based on this codebase :)

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile vqa_data.zip is around 100G and the decompressed data costs around 135G disk storage, which contains the training, validation and testing samples together with other necessary data resources. (Since vqa_data.zip is large in size, we have also provided chunked parts of the dataset files for more convenient and stable downloading. Please refer to issue #68.) Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from VinVL, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs.

79459   79459   is this person wearing shorts?  0.6|!+no    house&&short&&...&&sky  /9j/4AAQS...tigZ/9k=

For fine-tuning on customed VQA-formulated tasks, please refer to issue #76 and #73 for more information.

2. Shuffle the Training Data

(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around +0.3 improvement on VQA accuracy.

cd dataset/vqa_data
ln vqa_train.tsv vqa_train_1.tsv
for idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv > vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch
3. Finetuning

In our experiments, the VQA finetuning is performed on 4 8-A100-GPU servers (with RDMA). Here provides the finetuning script train_vqa_distributed.sh, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. The command should be run on each worker.

# run on each worker after the distributed and data configs have been correctly set following the guide in train_vqa_distributed.sh 
cd run_scripts/vqa
bash train_vqa_distributed.sh 

In our experiments, the finetuning costs around 36 hours (for 12 epochs). After each epoch, an evaluation on validation set is performed. The best validation accuracy during finetuning will be around 80.8. The log is saved in ${log_dir}.

(Update on validation time-cost) As will be mentioned in the 4. Inference section, we prepare 2 types of inference: beam-search and all-candidate inference. By default, all-candidate inference is used for validation during fine-tuning, which achieves better accuracy but costs much time. Now we have added a new option in the training scripts called --val-inference-type to switch the validation inference type during fine-tuning. If you feel the validation takes too long, you can refer to PR #79 to activate beam-search validation, which significantly takes much less time, with around 0.5-0.6 validation score degradation compared with all-candidate validation.

4. Inference

We provide 2 types of inference, beam-search (much faster but gets sub-optimal accuracy) and all-candidate evaluation (slower but best accuracy).

For beam-search inference, use the script evaluate_vqa_beam.sh. Refer to the command below. The inference on test set costs around 16 GPU hours. After inference on test set, the result JSON file will be dumped in the ${result_path} defined in the shell script. You can submit the result test_predict.json to EvalAI. Using our released finetuned checkpoint, beam-search inference will get 80.15 validation accuracy, 79.36 test-dev accuracy and 79.48 test-std accuracy (around 0.6 lower than all-candidate evaluation).

cd run_scripts/vqa
bash evaluate_vqa_beam.sh val # specify 'val' or 'test'

For all-candidate evaluation, we recommend to use the distributed script evaluate_vqa_allcand_distributed.sh. Please refer to the guide in the script to set the distributed configs before running. The result JSON file will be dumped in the ${result_path} defined in the shell script of rank-0 server. All-candidate evaluation computes scores on all the candidate answers in the VQA dataset, which achieves 80.82 validation accuracy, 79.87 test-dev accuracy and 80.02 test-std accuracy, reproducing our reported results in the paper. However, the inference on test set costs around 1k GPU hours, which is much slower.

# run on each worker after the distributed configs have been correctly set following the guide in evaluate_vqa_allcand_distributed.sh
cd run_scripts/vqa
bash evaluate_vqa_allcand_distributed.sh val # specify 'val' or 'test'

Referring Expression Comprehension

Here provides procedures for you to prepare data, train, and evaluate your model on visual grounding.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. We provide RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.

79_1    237367  A woman in a white blouse holding a glass of wine.  230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=
2. Finetuning

Unlike the original paper, we finetune OFA with a drop-path rate of 0.2, and found that training with this hyper-parameter achieves better results. We will update the reported results of the paper later.

cd run_scripts/refcoco
nohup sh train_refcoco.sh > train_refcoco.out &  # finetune for refcoco
nohup sh train_refcocoplus.sh > train_refcocoplus.out &  # finetune for refcoco+
nohup sh train_refcocog.sh > train_refcocog.out &  # finetune for refcocog
3. Inference

Run the following commands for the evaluation.

cd run_scripts/refcoco ; sh evaluate_refcoco.sh  # inference & evaluate for refcoco/refcoco+/refcocog

Visual Entailment

We provide steps for you to reproduce our results in visual entailment. See the details below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.

252244149.jpg#1r1n  252244149   /9j/4AAQ...MD/2Q==   a man in pink and gold is chewing on a wooden toothpick.   a man in pink is chewing a toothpick on the subway.   neutral 
2. Finetuning

In our experiments, the SNLI-VE finetuning is performed on 8 NVIDIA-V100 GPUs with 32GB memory. In this task, we experimented with only a few sets of hyperparameters. We believe that proper hyperparameter tuning can lead to further accuracy improvement.

cd run_scripts/snli_ve
nohup sh train_snli_ve.sh > train_snli_ve.out &  # finetune for snli_ve
3. Inference

Run the following command to obtain the results.

cd run_scripts/snli_ve ; sh evaluate_snli_ve.sh  # inference & evaluate for snli_ve

GLUE

Here we provide steps for you to finetune and evaluate our model on language understanding tasks. We demonstrate our practice for the GLUE benchmark.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. we provide 7 language understanding datasets from GLUE benchmark, including COLA, MNLI, MRPC, QNLI, QQP, RTE and SST2. More details about these datasets can be found in this link.

2. Finetuning

For each task, we have tried multiple sets of hyperparameters (including learning rate, batch size, training epochs). The results under different sets of hyperparameters can be found in ${log_dir}.

cd run_scripts/glue
nohup sh train_cola.sh > train_cola.out &  # finetune for cola
nohup sh train_mnli.sh > train_mnli.out &  # finetune for mnli
nohup sh train_mrpc.sh > train_mrpc.out &  # finetune for mrpc
nohup sh train_qnli.sh > train_qnli.out &  # finetune for qnli
nohup sh train_qqp.sh > train_qqp.out &  # finetune for qqp
nohup sh train_rte.sh > train_rte.out &  # finetune for rte
nohup sh train_sst2.sh > train_sst2.out &  # finetune for sst2

Image Classification on ImageNet-1K

We provide the finetuning and inference codes which reproduce 85.0 ImageNet-1K accuracy, slightly better than reported in our paper.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Our provided data is derived from the original ImageNet-1K (ILSVRC2012 train & validation) dataset and shares the same data split with it. To formulate the classification task into seq2seq paradigm, we use the synset words provided by Caffe as the generation target for each image class. Each line of the processed dataset represents a sample with the following format. The information of image base64 string, classification label (1-indexed, conform to the order in synset_words.txt), synset words of the label are separated by tabs.

_9j_4AAQS...fzX__Z  769 rugby ball
2. Shuffle the Training Data

(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around +0.2 improvement on ImageNet-1K accuracy.

cd dataset/imagenet_1k_data
ln imagenet_1k_train.tsv imagenet_1k_train_1.tsv
for idx in `seq 1 9`;do shuf imagenet_1k_train_${idx}.tsv > imagenet_1k_train_$[${idx}+1].tsv;done # each file is used for an epoch one by one
3. Finetuning

In our experiments, the ImageNet-1K finetuning is performed on 2 8-A100-GPU servers (with RDMA). Here provides the finetuning script train_imagenet_distributed.sh, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. The command should be run on each worker. For quick evaluation during finetuning, by default we sample 20% of the original validation split and report accuracy on this subset after each epoch. The accuracy on the validation subset is generally ±0.1 relative to accuracy on the whole validation split.

# run on each worker after the distributed and data configs have been correctly set following the guide in train_imagenet_distributed.sh
cd run_scripts/image_classify
bash train_imagenet_distributed.sh

In our experiments, the finetuning costs around 80 hours (for 32 epochs). The best accuracy on validation subset during finetuning will be around 85.0. The log is saved in ${log_dir}.

4. Inference

To get the validation accuracy on the whole ImageNet-1K validation set, run the following command. The evaluation costs around 10 GPU hours. The accuracy will be reported in the stdout (expected to be around 85.0).

cd run_scripts/image_classify ; sh evaluate_imagenet.sh  # inference & evaluate for imagenet-1k

Gigaword

We provide steps for you to reproduce our results in Gigaword. See the details below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The original dataset is taken from UniLM and we organized the data into the tsv format. Each line of the processed dataset represents a sample with the following format. The information of source and target texts are separated by tabs.

factory orders for manufactured goods rose #.# percent in september...  us september factory orders up #.# percent
2. Finetuning

Run the following command to train the model.

cd run_scripts/gigaword
nohup sh train_gigaword.sh > train_gigaword.out &  # finetune for gigaword
3. Inference

Run the following command to obtain the results (~36.43 rougeL).

cd run_scripts/gigaword ; sh evaluate_gigaword.sh  # inference & evaluate for gigaword



Gallery

Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).

Text-to-Image Generation (normal query)

t2i_normal

Text-to-Image Generation (counterfactual query)

t2i_counterfactual

Open-Ended VQA

open_vqa

Grounded QA (unseen task)

grounded_qa

Visual Grounding (unseen domain)

vg

Related Codebase

Getting Involved

Feel free to submit Github issues or pull requests. Welcome to contribute to our project!

To contact us, never hestitate to send an email to [email protected] or [email protected]!

Citation

Please cite our paper if you find it helpful :)

@article{wang2022OFA,
  title={Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework},
  author={Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia},
  journal={arXiv preprint arXiv:2202.03052},
  year={2022}
}



Comments
  • Inference model using Huggingface library

    Inference model using Huggingface library

    First of all, Thanks for your amazing work.👍 I'm very surprised at the results you've made. But I have a question. Is it possible to use this model using the transformers library using the checkpoint of the model? You made it possible to infer to the model in the spaces of the transformers library, so are you planning to upload a checkpoint in the transformers library and use that library for the inference? When I saw the colab you posted, it said how to use only fairseq I'll be waiting for the reply. Once again, thank you for the amazing results!

    enhancement 
    opened by fightnyy 20
  • How to evaluate prompt tuning model?

    How to evaluate prompt tuning model?

    Hi, OFA team,

    I have used prompt tuning method to train a vqa-gen task, and evaluate the model via run_scripts/vqa/evaluate_vqa_beam.sh directly, but got error below:

    Traceback (most recent call last):
      File "../../evaluate.py", line 160, in <module>
        cli_main()
      File "../../evaluate.py", line 154, in cli_main
        distributed_utils.call_main(
      File "/workspace/project/OFA/fairseq/fairseq/distributed/utils.py", line 376, in call_main
        distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
      File "/workspace/project/OFA/fairseq/fairseq/distributed/utils.py", line 350, in distributed_main
        main(cfg, **kwargs)
      File "../../evaluate.py", line 138, in main
        result, scores = eval_step(task, generator, models, sample, **kwargs)
      File "/workspace/project/OFA/utils/eval_utils.py", line 306, in eval_step
        return eval_vqa_gen(task, generator, models, sample, **kwargs)
      File "/workspace/project/OFA/utils/eval_utils.py", line 47, in eval_vqa_gen
        hypos = task.inference_step(generator, models, sample, prefix_tokens=sample['prefix_tokens'])
      File "/workspace/project/OFA/fairseq/fairseq/tasks/fairseq_task.py", line 517, in inference_step
        return generator.generate(
      File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
        return func(*args, **kwargs)
      File "/workspace/project/OFA/models/sequence_generator.py", line 209, in generate
        return self._generate(models, sample, **kwargs)
      File "/workspace/project/OFA/models/sequence_generator.py", line 354, in _generate
        lprobs, avg_attn_scores = model.forward_decoder(
      File "/workspace/project/OFA/models/sequence_generator.py", line 824, in forward_decoder
        decoder_out = model.decoder.forward(
      File "/workspace/project/OFA/models/ofa/unify_transformer.py", line 1343, in forward
        x, extra = self.extract_features(
      File "/workspace/project/OFA/models/ofa/unify_transformer.py", line 1367, in extract_features
        return self.extract_features_scriptable(
      File "/workspace/project/OFA/models/ofa/unify_transformer.py", line 1532, in extract_features_scriptable
        x, layer_attn, _ = layer(
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/workspace/project/OFA/models/ofa/unify_transformer_layer.py", line 500, in forward
        x, attn = self.self_attn(
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/workspace/project/OFA/models/ofa/unify_multihead_attention.py", line 342, in forward
        assert key_padding_mask.size(1) == k.size(1), "{} vs {}".format(
    AssertionError: 101 vs 102
    

    Could you please provide an example for evaluating prompt tuning models, thanks?

    opened by flymark2010 14
  • ConfigAttributeError when load the checkpoint

    ConfigAttributeError when load the checkpoint

    Hi, Thanks for the great work! I meet problems when I load the pre-trained checkpoint (refcocog_large_best.pt). I load the model by

    overrides={"bpe_dir":"utils/BPE"}
    models, cfg, task = checkpoint_utils.load_model_ensemble_and_task(
            utils.split_paths('checkpoints/refcocog.pt'),
            arg_overrides=overrides
        )
    

    The error occurs

    Traceback (most recent call last):
      File "eval_refcoco.py", line 22, in <module>
        arg_overrides=overrides
      File "/home/tiger/.local/lib/python3.7/site-packages/fairseq-1.0.0a0+4095baa-py3.7-linux-x86_64.egg/fairseq/checkpoint_utils.py", line 457, in load_model_ensemble_and_task
        model = task.build_model(cfg.model)
      File "/opt/tiger/OFA_offical/tasks/mm_tasks/refcoco.py", line 79, in build_model
        if self.cfg.scst:
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 305, in __getattr__
        self._format_and_raise(key=key, value=None, cause=e)
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/base.py", line 101, in _format_and_raise
        type_override=type_override,
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 629, in format_and_raise
        _raise(ex, cause)
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 610, in _raise
        raise ex  # set end OC_CAUSE=1 for full backtrace
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 303, in __getattr__
        return self._get_impl(key=key, default_value=DEFAULT_VALUE_MARKER)
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 361, in _get_impl
        node = self._get_node(key=key)
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 383, in _get_node
        self._validate_get(key)
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 136, in _validate_get
        key=key, value=value, cause=ConfigAttributeError(msg)
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/base.py", line 101, in _format_and_raise
        type_override=type_override,
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 694, in format_and_raise
        _raise(ex, cause)
      File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 610, in _raise
        raise ex  # set end OC_CAUSE=1 for full backtrace
    omegaconf.errors.ConfigAttributeError: Key 'scst' not in 'RefcocoConfig'
            full_key: scst
            reference_type=Optional[RefcocoConfig]
            object_type=RefcocoConfig
    

    I would appreciate your help!

    opened by zd11024 12
  • Problems with finetuned model for VQAv2 (ms coco)

    Problems with finetuned model for VQAv2 (ms coco)

    I am doing inference for the VQA val set manually to get all answers using your demo colab notebook. I used to do everything like you wrote there and so I was using Pre-trained checkpoint (OFA-Large) as it was in tutorial, the quality was around 68% accuracy on a val set. Then I decided to change the model to the Finetuned checkpoint for VQAv2. It works with the same code, however, it's behaviour is strange, inference is very slow: for pretrained model it was 214k answers by 10 hours and now I got only 30k answers by 15 hours on the same Tesla V100. Also the quality is worse for some reason, it's around 60% accuracy now. Some answers are strange, some are completely correct and some are, for example "bedroom bedroom bedroom bedroom bedroom ...", "no no no no no no no..." etc. For some reason model gives very long answers and doesn't stop generating sequence of words.

    I am a bit confused, why is it happening, maybe I am doing something wrong. I run model as in this code: https://colab.research.google.com/drive/1lsMsF-Vum3MVyXwSVF5E-Y23rHFvj_3y?usp=sharing

    I only change path to finetuned model in this part:

    parser = options.get_generation_parser()
    input_args = ["", "--task=vqa_gen", "--beam=100", "--unnormalized", "--path=checkpoints/vqa_large_best.pt", "--bpe-dir=utils/BPE"]
    args = options.parse_args_and_arch(parser, input_args)
    cfg = convert_namespace_to_omegaconf(args)
    
    opened by 25icecreamflavors 11
  • How to train VQA on my custom data?

    How to train VQA on my custom data?

    Hello! I am trying to finetune OFA-large on VQA using custom dataset, using the finetuning instruction in the repo. I have checked my .tsv and .pkl file several times and they are correct as your provided sample. But after command "bash train_vqa_distributed.sh", the terminal just prints:

    total_num_updates 40000 warmup_updates 1000 lr 5e-5 patch_image_size 480

    The GPU usage will rise to a certain value and then suddenly return to zero, and then the program will end. I train on single server with 2 GPU. Looking forward to reply, thanks for your sharing work!

    opened by xiaoqiang-lu 11
  • The code of OFA-base is inconsistent with the pre-trained checkpoint

    The code of OFA-base is inconsistent with the pre-trained checkpoint

    Thanks for your awesome work. Something has bothered me recently. When I continued to train OFA-base (I tried to collect all the pre-training data of OFA), I found that a few training steps (10 steps) would make the performance of OFA-base worse. I checked the config in checkpoint and the config in pretrain_ofa_base.sh, and found many differences. What might affect the results?

    In addition, I found that there is a dimension inconsistency in the network. decoder.image_position_idx": "<class 'torch.Tensor'> torch.Size([1026]) in code and decoder.image_position_idx": "<class 'torch.Tensor'> torch.Size([1025]) in ckpt. Is this the reason for the decline of performance?

    opened by zzhanghub 10
  • Huggingface transformers inference: ModuleNotFoundError: No module named 'generate'

    Huggingface transformers inference: ModuleNotFoundError: No module named 'generate'

    When running the imports listed in transformers.md:

    from PIL import Image
    from torchvision import transforms
    from transformers import OFATokenizer, OFAModel
    from generate import sequence_generator
    

    I get ModuleNotFoundError: No module named 'generate'

    Where is generate supposed to come from? The implementations of sequence_generator.SequenceGenerator that I see in e.g. fairseq also don't have the same signature, so it's not clear how to proceed.

    opened by steve-marmalade 10
  • subprocess.CalledProcessError

    subprocess.CalledProcessError

    Hello

    I tried finetuning large model for image captioning but I keep getting subprocess.CalledProcessError. I've tried various numbers for Port number but it did not work out. What could be the possible reason for this error? (though it seems like a gpu distribution issue...) Thank you so much for your help

    export MASTER_PORT=1052
    
    log_dir=./stage2_logs
    save_dir=./stage2_checkpoints
    mkdir -p $log_dir $save_dir
    
    bpe_dir=../../utils/BPE
    user_dir=../../ofa_module
    
    data_dir=../../dataset/caption_data
    data=${data_dir}/caption_train_stage2_new.tsv,${data_dir}/caption_val_ct.tsv
    restore_file=../../checkpoints/caption_stage1_best.pt
    selected_cols=1,4,2
    
    task=caption
    arch=ofa_large
    criterion=scst_reward_criterion
    label_smoothing=0.1
    lr=1e-5
    max_epoch=5
    warmup_ratio=0.06
    batch_size=1
    update_freq=4
    resnet_drop_path_rate=0.0
    encoder_drop_path_rate=0.0
    decoder_drop_path_rate=0.0
    dropout=0.0
    attention_dropout=0.0
    max_src_length=80
    max_tgt_length=20
    num_bins=1000
    patch_image_size=480
    eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
    scst_cider_cached=${data_dir}/cider_cached_tokens/coco-train-words.p
    
    for lr in 5e-6,; do
      echo "lr "${lr}
      for max_epoch in 5; do
        echo "max_epoch "${max_epoch}
    
        log_file=${log_dir}/${lr}"_"${max_epoch}".log"
        save_path=${save_dir}/${lr}"_"${max_epoch}
        mkdir -p $save_path
    
        CUDA_VISIBLE_DEVICES=1,2 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=${MASTER_PORT} ../../train.py \
            $data \
            --selected-cols=${selected_cols} \
            --bpe-dir=${bpe_dir} \
            --user-dir=${user_dir} \
            --restore-file=${restore_file} \
            --reset-optimizer --reset-dataloader --reset-meters \
            --save-dir=${save_path} \
            --task=${task} \
            --arch=${arch} \
            --criterion=${criterion} \
            --batch-size=${batch_size} \
            --update-freq=${update_freq} \
            --encoder-normalize-before \
            --decoder-normalize-before \
            --share-decoder-input-output-embed \
            --share-all-embeddings \
            --layernorm-embedding \
            --patch-layernorm-embedding \
            --code-layernorm-embedding \
            --resnet-drop-path-rate=${resnet_drop_path_rate} \
            --encoder-drop-path-rate=${encoder_drop_path_rate} \
            --decoder-drop-path-rate=${decoder_drop_path_rate} \
            --dropout=${dropout} \
            --attention-dropout=${attention_dropout} \
            --weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
            --lr-scheduler=polynomial_decay --lr=${lr} --end-learning-rate=2e-7 \
            --max-epoch=${max_epoch} --warmup-ratio=${warmup_ratio} \
            --log-format=simple --log-interval=10 \
            --fixed-validation-seed=7 \
            --no-epoch-checkpoints --keep-best-checkpoints=1 \
            --save-interval=1 --validate-interval=1 \
            --save-interval-updates=500 --validate-interval-updates=500 \
            --eval-cider \
            --eval-cider-cached-tokens=${eval_cider_cached} \
            --eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
            --best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
            --max-src-length=${max_src_length} \
            --max-tgt-length=${max_tgt_length} \
            --find-unused-parameters \
            --freeze-encoder-embedding \
            --freeze-decoder-embedding \
            --freeze-resnet \
            --add-type-embedding \
            --scale-attn \
            --scale-fc \
            --scale-heads \
            --disable-entangle \
            --num-bins=${num_bins} \
            --patch-image-size=${patch_image_size} \
            --scst \
            --scst-cider-cached-tokens=${scst_cider_cached} \
            --scst-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
            --memory-efficient-fp16 \
            --fp16-scale-window=512 \
            --num-workers=0 > ${log_file} 2>&1
      done
    done
    
    opened by Jihyun0510 10
  • How to train OFA for VQA in open-ended?

    How to train OFA for VQA in open-ended?

    Dear authors: Thanks for the great work! In VQA validation, If I want the model to predict the most likely next token (i.e. generating a token in the answer) from the output logits. And then I append this token to the input and repeat this step until the model predicts ⟨EOS⟩. What could I do to achieve it? Thanks a lot!

    opened by qyc-98 10
  • Additional issues trying to finetune on custom (VQA-like) dataset (VizWiz)

    Additional issues trying to finetune on custom (VQA-like) dataset (VizWiz)

    Hello, first I'd like to thank you for your amazing work and especially all the detailed answers to issues.

    I've been following the different issues on the finetuning on a custom dataset (VizWiz) and produced the .tsv files according to your format. You stated in issue #76 that the trainval_ans2label.pkl file is not used when using beam-search evaluation - is this correct?

    I've skipped its creation and training does run for the first epoch. However, upon validation on the valid subset, I get an assertion error in the sequence_generator.py - I've tracked down the error and I can "fix" it by removing the one extra step that is for the EOS marker, but my understanding of how to properly fix that error is limited.

    To give some more information of how the .tsv files look, I have attached an image for the train and val subset.

    Thank you very much for any kind of input in advance! image

    opened by Velcorn 10
  • How can I handle this in a modified model?

    How can I handle this in a modified model?

    Hi, I add another layer to the model but there is a problem that happened after several steps.

    2022-03-21 23:16:50 - progress_bar.py[line:272] - INFO: epoch 001:     41 / 24544 loss=1.825, loss_v1=0, loss_v2=0, nll_loss=1.825, ntokens=16, nsentences=16, sample_size=16, sample_size_v1=0, sample_size_v2=0, ppl=3.54, wps=11.3, ups=0.7, wpb=16, bsz=16, num_updates=41, lr=5.56838e-07, gnorm=32.218, clip=100, loss_scale=16, train_wall=1, gb_free=14.5, wall=67
    2022-03-21 23:16:51 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
    2022-03-21 23:16:53 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
    2022-03-21 23:16:54 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
    2022-03-21 23:16:55 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
    2022-03-21 23:16:56 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
    2022-03-21 23:16:57 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
    2022-03-21 23:16:58 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
    2022-03-21 23:16:59 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
    2022-03-21 23:17:01 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.03125
    2022-03-21 23:17:02 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.015625
    2022-03-21 23:17:02 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0078125
    2022-03-21 23:17:03 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00390625
    2022-03-21 23:17:04 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.001953125
    2022-03-21 23:17:05 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0009765625
    2022-03-21 23:17:06 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00048828125
    2022-03-21 23:17:07 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.000244140625
    2022-03-21 23:17:08 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0001220703125
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not return a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not return a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not return a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not return a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
    /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
      warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
    2022-03-21 23:17:09 - nan_detector.py[line:89] - WARNING: NaN detected in output of encoder.layers.2.moe.moe_layer, shape: torch.Size([60, 1, 768]), forward input max: 3.67578125, input min: -7.75
    Traceback (most recent call last):
      File "/workspace/OFA/trainer.py", line 871, in train_step
        grad_norm = self.clip_grad_norm(self.cfg.optimization.clip_norm)
      File "/workspace/OFA/trainer.py", line 1208, in clip_grad_norm
        return self.optimizer.clip_grad_norm(
      File "/workspace/OFA/fairseq/fairseq/optim/fp16_optimizer.py", line 200, in clip_grad_norm
        self.scaler.check_overflow(grad_norm)
      File "/workspace/OFA/fairseq/fairseq/optim/dynamic_loss_scaler.py", line 61, in check_overflow
        raise FloatingPointError(
    FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.
    

    Then the training broke down. So how can I fix this problem? Hyperparameter Tuning? Or something else I need to pay attention to? I will really really really appreciate it if you can help me!

    opened by dannyxiaocn 10
  • training number of ofa-cn-muge

    training number of ofa-cn-muge

        It is fine to just use the pretrained model as it is pretrained on many image-text pairs. See [https://github.com/OFA-Sys/OFA/blob/main/checkpoints_cn.md](https://github.com/OFA-Sys/OFA/blob/main/checkpoints_cn.md). To achieve a better effect, finetuning on domain-specific data is recommended. Now we only provide one caption model finetuned on MUGE caption data, which are collected from the e-commerce.
    

    Originally posted by @JustinLin610 in https://github.com/OFA-Sys/OFA/issues/227#issuecomment-1236575608

    How many training data do you use to finetune OFA-CN-MUGE?50000 images in the ECommerce-IC.zip from https://tianchi.aliyun.com/dataset/107332 ?

    opened by yangjianxin1 0
  • Question about fintune Document Task

    Question about fintune Document Task

    Thanks for great repo I very appreciated the document task, it's very admired. image I want to fintune document task, it means that how to train OCR task. I am very happy if you can give me step by step or tutorial to fintune it. I saw that maybe I wrong, but I don't see any doc for this. You only provide caption, classify,.. not OCR One more times, many thanks.

    opened by phamkhactu 1
  • code for converting OFA fairseq ckpt --> huggingface inference ckpt

    code for converting OFA fairseq ckpt --> huggingface inference ckpt

    Hi there,

    I was curious if you might be able to make public the code that was used to convert the OFA fairseq checkpoints to the huggingface format. Apologies if this is already in the codebase --- I couldn't find it anywhere after some searching.

    What do you think?

    Jack

    opened by jmhessel 4
  • cannot import name 'TransformerEncoderLayer' from partially initialized module 'fairseq.modules' (most likely due to a circular import)

    cannot import name 'TransformerEncoderLayer' from partially initialized module 'fairseq.modules' (most likely due to a circular import)

    /home/pemi/miniconda3/envs/env/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torchrun.
    Note that --use_env is set by default in torchrun.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ['LOCAL_RANK']` instead. See 
    https://pytorch.org/docs/stable/distributed.html#launch-utility for 
    further instructions
    
      warnings.warn(
    WARNING:torch.distributed.run:
    *****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    *****************************************
    Traceback (most recent call last):
      File "/home/pemi/OFA/run_scripts/glue/../../train.py", line 29, in <module>
        from fairseq import (
      File "/home/pemi/OFA/fairseq/fairseq/quantization_utils.py", line 8, in <module>
        from fairseq.modules.quantization import pq, quantization_options, scalar
      File "/home/pemi/OFA/fairseq/fairseq/modules/__init__.py", line 39, in <module>
        from .transformer_layer import TransformerDecoderLayer, TransformerEncoderLayer
      File "/home/pemi/OFA/fairseq/fairseq/modules/transformer_layer.py", line 15, in <module>
        from fairseq.models.transformer import (
      File "/home/pemi/OFA/fairseq/fairseq/models/__init__.py", line 236, in <module>
        import_models(models_dir, "fairseq.models")
      File "/home/pemi/OFA/fairseq/fairseq/models/__init__.py", line 218, in import_models
        importlib.import_module(namespace + "." + model_name)
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "/home/pemi/OFA/fairseq/fairseq/models/speech_to_text/__init__.py", line 7, in <module>
        from .convtransformer import *  # noqa
      File "/home/pemi/OFA/fairseq/fairseq/models/speech_to_text/convtransformer.py", line 19, in <module>
        from fairseq.modules import LayerNorm, PositionalEmbedding, TransformerEncoderLayer
    ImportError: cannot import name 'TransformerEncoderLayer' from partially initialized module 'fairseq.modules' (most likely due to a circular import) (/home/pemi/OFA/fairseq/fairseq/modules/__init__.py)
    Traceback (most recent call last):
      File "/home/pemi/OFA/run_scripts/glue/../../train.py", line 29, in <module>
        from fairseq import (
      File "/home/pemi/OFA/fairseq/fairseq/quantization_utils.py", line 8, in <module>
        from fairseq.modules.quantization import pq, quantization_options, scalar
      File "/home/pemi/OFA/fairseq/fairseq/modules/__init__.py", line 39, in <module>
        from .transformer_layer import TransformerDecoderLayer, TransformerEncoderLayer
      File "/home/pemi/OFA/fairseq/fairseq/modules/transformer_layer.py", line 15, in <module>
        from fairseq.models.transformer import (
      File "/home/pemi/OFA/fairseq/fairseq/models/__init__.py", line 236, in <module>
        import_models(models_dir, "fairseq.models")
      File "/home/pemi/OFA/fairseq/fairseq/models/__init__.py", line 218, in import_models
        importlib.import_module(namespace + "." + model_name)
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "/home/pemi/OFA/fairseq/fairseq/models/speech_to_text/__init__.py", line 7, in <module>
        from .convtransformer import *  # noqa
      File "/home/pemi/OFA/fairseq/fairseq/models/speech_to_text/convtransformer.py", line 19, in <module>
        from fairseq.modules import LayerNorm, PositionalEmbedding, TransformerEncoderLayer
    ImportError: cannot import name 'TransformerEncoderLayer' from partially initialized module 'fairseq.modules' (most likely due to a circular import) (/home/pemi/OFA/fairseq/fairseq/modules/__init__.py)
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6079) of binary: /home/pemi/miniconda3/envs/env/bin/python3
    Traceback (most recent call last):
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
        main()
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
        launch(args)
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
        run(args)
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
        elastic_launch(
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/pemi/miniconda3/envs/env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    ../../train.py FAILED
    ------------------------------------------------------------
    Failures:
    [1]:
      time      : 2022-12-16_09:50:47
      host      : baker
      rank      : 1 (local_rank: 1)
      exitcode  : 1 (pid: 6080)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2022-12-16_09:50:47
      host      : baker
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 6079)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
    
    
    opened by peminguyen 0
  • IS A RTXA6000 ALONE able to finetune this model in a custom dataset?

    IS A RTXA6000 ALONE able to finetune this model in a custom dataset?

    Hello, I have been recently trying to utilize this model in my research, wondering if it is possible to finetune OFA in a custom dataset, most of the image in a special field (civil engineering), using only a RTXA6000(48G). And I'd like to know would this try have good result, since this model will need to learn a lot of knowledge in a specific field? Thank You.

    opened by Practicing7 1
Owner
OFA Sys
OFA Sys
Official implementation of the paper ``Unifying Nonlocal Blocks for Neural Networks'' (ICCV'21)

Spectral Nonlocal Block Overview Official implementation of the paper: Unifying Nonlocal Blocks for Neural Networks (ICCV'21) Spectral View of Nonloca

null 91 Dec 14, 2022
Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks"

LUNAR Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks" Adam Goodge, Bryan Hooi, Ng See Kiong and

Adam Goodge 25 Dec 28, 2022
AugLy is a data augmentations library that currently supports four modalities (audio, image, text & video) and over 100 augmentations

AugLy is a data augmentations library that currently supports four modalities (audio, image, text & video) and over 100 augmentations. Each modality’s augmentations are contained within its own sub-library. These sub-libraries include both function-based and class-based transforms, composition operators, and have the option to provide metadata about the transform applied, including its intensity.

Facebook Research 4.6k Jan 9, 2023
[ICCV'21] UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction

UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction Project Page | Paper | Supplementary | Video This reposit

null 331 Dec 28, 2022
Official Code for AdvRush: Searching for Adversarially Robust Neural Architectures (ICCV '21)

AdvRush Official Code for AdvRush: Searching for Adversarially Robust Neural Architectures (ICCV '21) Environmental Set-up Python == 3.6.12, PyTorch =

null 11 Dec 10, 2022
Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution Figure: Example visualization of the method and baseline as a

Oliver Hahn 16 Dec 23, 2022
Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

zhaohu xing 112 Dec 16, 2022
Code image classification of MNIST dataset using different architectures: simple linear NN, autoencoder, and highway network

Deep Learning for image classification pip install -r http://webia.lip6.fr/~baskiotisn/requirements-amal.txt Train an autoencoder python3 train_auto

Hector Kohler 0 Mar 30, 2022
Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

Sunbow Liu 22 Nov 25, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Open source implementation of AceNAS: Learning to Rank Ace Neural Architectures with Weak Supervision of Weight Sharing

AceNAS This repo is the experiment code of AceNAS, and is not considered as an official release. We are working on integrating AceNAS as a built-in st

Yuge Zhang 6 Sep 7, 2022
Keras like implementation of Deep Learning architectures from scratch using numpy.

Mini-Keras Keras like implementation of Deep Learning architectures from scratch using numpy. How to contribute? The project contains implementations

MANU S PILLAI 5 Oct 10, 2021
Learning Versatile Neural Architectures by Propagating Network Codes

Learning Versatile Neural Architectures by Propagating Network Codes Mingyu Ding, Yuqi Huo, Haoyu Lu, Linjie Yang, Zhe Wang, Zhiwu Lu, Jingdong Wang,

Mingyu Ding 36 Dec 6, 2022
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

Abhay Gupta 161 Dec 8, 2022
[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Fudan Zhang Vision Group 897 Jan 5, 2023
Sequence to Sequence Models with PyTorch

Sequence to Sequence models with PyTorch This repository contains implementations of Sequence to Sequence (Seq2Seq) models in PyTorch At present it ha

Sandeep Subramanian 708 Dec 19, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022