Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Jie Lei 雷杰

Last update: Jan 4, 2023

Related tags

Overview

ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei*, Linjie Li*, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning. In this repository, we support end-to-end pretraining and finetuning for the following tasks:

Image-text pretraining on COCO and VG captions.
Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
Video-QA finetuning on TGIF-QA and MSRVTT-QA.
Image-QA finetuning on VQA 2.0.

It is also feasible and easy to add other image-text or video-text tasks for pretraining and finetuning.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

Create a folder that stores pretrained models, all the data, and results.

PATH_TO_STORAGE=/path/to/your/data/
mkdir -p $PATH_TO_STORAGE/txt_db  # annotations
mkdir -p $PATH_TO_STORAGE/vis_db  # image and video 
mkdir -p $PATH_TO_STORAGE/finetune  # finetuning results
mkdir -p $PATH_TO_STORAGE/pretrained  # pretrained models

Download pretrained models.

Our e2e pretrained ClipBERT model (849MB), can be downloaded with the following command.
```
bash scripts/download_pretrained.sh $PATH_TO_STORAGE
```
This pretrained model can be used for finetuning on video-text tasks and image-text tasks. For your convenience, this script will also download bert-base-uncased and grid-feat-vqa model weights, which are used as initialization for pretraining.
Launch the Docker container for running the experiments.
```
# docker image should be automatically pulled
source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
    $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained
```
The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /clipbert instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Downstream Task Finetuning

Text-to-Video Retrieval

Tasks: MSRVTT retrieval, DiDeMo and ActivityNet Captions paragprah-to-video retrieval, MSRVTT MC Test.

Download data.

# outside the container  
# download videos + annotations for $DSET
bash scripts/download_$DSET.sh $PATH_TO_STORAGE

$DSET can be one of msrvtt, didemo, anet.

Finetuning.

# inside the container
horovodrun -np 4 python src/tasks/run_video_retrieval.py \
    --config $CONFIG_PATH \
    --output_dir $OUTPUT_DIR

# for single GPU
python src/tasks/run_video_retrieval.py \
    --config $CONFIG_PATH \
    --output_dir $OUTPUT_DIR

$CONFIG_PATH should be set to one of the .json config files available at src/configs prefixed with _ret. For example, you can use src/configs/msrvtt_ret_base_resnet50.json for MSRVTT retrieval.

Run inference.
```
# inside the container
horovodrun -np 4 python src/tasks/run_video_retrieval.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db $TXT_DB \
  --inference_img_db $IMG_DB --inference_batch_size 64 \
  --inference_n_clips $INFERENCE_N_CLIPS
```
$STEP is an integer, it tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_retrieval_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT retrieval val split. The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

After MSRVTT retrieval model is trained, you can use it for inference for the MSRVTT MC Test task as well, which is essentially a retrieval task in a multiple-choice task setup.
```
# inside the container
horovodrun -np 4 python src/tasks/run_msrvtt_mc.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db /txt/downstream/msrvtt_retrieval_mc/msrvtt_retrieval_mc_test.jsonl \
  --inference_img_db /img/msrvtt --inference_batch_size 64 \
  --inference_n_clips $INFERENCE_N_CLIPS
```

Video Question Answering

Tasks: TGIF-QA action, transition, and frameQA tasks; MSRVTT-QA.

Download data.

# outside the container  
# download MSRVTT videos, and QA + retrieval annotations
bash scripts/download_msrvtt.sh $PATH_TO_STORAGE  
# download TGIF-QA videos and annotations
bash scripts/download_tgif_qa.sh $PATH_TO_STORAGE

Finetuning.
```
# inside the container
horovodrun -np 4 python src/tasks/run_video_qa.py \
    --config $CONFIG_PATH \
    --output_dir $OUTPUT_DIR
```
$CONFIG_PATH should be set to one of the .json config files available at src/configs contains the substring _qa. For example, you can use src/configs/msrvtt_qa_base_resnet50.json for MSRVTT-QA.
Run inference.
```
# inside the container
horovodrun -np 4 python src/tasks/run_video_qa.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db $TXT_DB \
  --inference_img_db $IMG_DB --inference_batch_size 64 \
  --inference_n_clips $INFERENCE_N_CLIPS
```
$STEP is an integer, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_qa_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT QA val split.

The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

Image Question Answering (VQA)

Download data

# outside the container
# download COCO and VG data
bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
# download VQA annotations
bash scripts/download_vqa.sh $PATH_TO_STORAGE

Finetuning

# inside the container
horovodrun -np 4 python src/tasks/run_vqa.py \
    --config src/configs/vqa_base_resnet50.json \
    --output_dir $OUTPUT_DIR

Inference

# inside the container
horovodrun -np 4 python src/tasks/run_vqa.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db $TXT_DB \
  --inference_img_db $IMG_DB \
  --inference_batch_size 64

Pretraining

Download data

# outside the container
bash scripts/download_coco_vg.sh $PATH_TO_STORAGE

Pretraining

#inside the container
horovodrun -np 8 python src/pretrain/run_pretrain.py \
    --config src/configs/pretrain_indomain_base_resnet50_mlm_itm.json \
    --output_dir $OUTPUT_DIR

Data Preprocessing

ClipBERT takes raw video and text as inputs, there is no need to do feature extraction. However, to improve data loading speed, we use LMDB to store the raw image and video files. You can use the following script to convert a list of videos with file extensions mp4 and avi into LMDB:

# outside the container
python src/preprocessing/file2lmdb.py \
    --data_root /path/to/videos \
    --lmdb_save_dir /path/to/save/lmdb \
    --ext avi mp4 \
    --file_type video

For images, use appropriate file extensions for --ext and --file_type image. Text annotation files are reorganized into jsonl files, see example preprocessed files downloaded by the scripts in scripts/.

Citation

If you find this code useful for your research, please consider citing:

@article{lei2021less,
  title={Less is More: ClipBERT for Video-and-Language Learningvia Sparse Sampling},
  author={Lei, Jie and Li, Linjie and Zhou, Luowei and Gan, Zhe and Berg, Tamara L. and Bansal, Mohit and Liu, Jingjing},
  journal={arXiv},
  year={2021}
}

Acknowledgement

We thank Yen-Chun Chen and Ruotian Luo for suggestions on the implementation. We also thank other members and interns at Microsoft Multimodal AI for their helpful discussions.

This code used resources from transformers, UNITER, HERO, grid-feats-vqa, SlowFast, Detectron2. The code is implemented using PyTorch, with multi-GPU support from Horovod and mixed precision support from apex. We thank the authors for open-sourcing their awesome projects.

License

MIT

Comments

error: can't start new thread

During the training of the model, I frequently encounter the error error: can't start new thread which occurs after <stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable. I also notice that the CPU usage is incredibly high during the training process.

I am currently following what zoe did in #32, changing the n_workers to 0, however this drastically increases the training time, is there any workaround for this problem?

Here is a more complete error output:

[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
[1,3]<stderr>:    model_saver.save(step=global_step, model=model)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,3]<stderr>:    return func(*args, **kwargs)
[1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
[1,3]<stderr>:    for val_step, batch in enumerate(val_loader):
[1,3]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
[1,3]<stderr>:    loader_it = iter(self.loader)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
[1,3]<stderr>:    return _MultiProcessingDataLoaderIter(self)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
[1,3]<stderr>:    w.start()
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
[1,3]<stderr>:    self._popen = self._Popen(self)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
[1,3]<stderr>:    return _default_context.get_context().Process._Popen(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
[1,3]<stderr>:    return Popen(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
[1,3]<stderr>:    self._launch(process_obj)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
[1,3]<stderr>:    self.pid = os.fork()
[1,3]<stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
[1,1]<stderr>:    model_saver.save(step=global_step, model=model)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,1]<stderr>:    return func(*args, **kwargs)
[1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
[1,1]<stderr>:    for val_step, batch in enumerate(val_loader):
[1,1]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
[1,1]<stderr>:    loader_it = iter(self.loader)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
[1,1]<stderr>:    return _MultiProcessingDataLoaderIter(self)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 733, in __init__
[1,1]<stderr>:    pin_memory_thread.start()
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/threading.py", line 846, in start
[1,1]<stderr>:    _start_new_thread(self._bootstrap, ())
[1,1]<stderr>:RuntimeError: can't start new thread
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32362,1],1]
  Exit code:    1

opened by Tangolin 5

When run finetuning task MSRVTT retrieval,an error occurred: "Failed Resource Temporarily Unavailable".

We found the reason is that the dataloader constantly create worker threads, however ,the threads can't exit normally.So when the number of threads exceed the upper limit，run_video_retrieval.py will exit unexpectedly. We use your docker image to run the program，have you ever had such problem before? Thanks!

opened by MrZihan 5
Fine-tuning ClipBERT on custom datasets

Hi, thank you for sharing this interesting work!

I would like to try fine-tuining ClipBERT on other video-and-language dataset, such as YouCook2. My target downstream task is cross-modal retrieval in sentence-level, rather than paragraph-level.

Do you have any recommendations to train ClipBERT on custom datasets? In particular, I am curious about how to decide hyper-parameters described in config files for other datasets. Thank you.

opened by misogil0116 4
Extracting frame level visual features

Hi,

Thanks for making your code and pretrained models available publicly. I was wondering if you have suggestions regarding extracting frame level features using your models.

Thanks!

opened by srikanth-sfu 4

CUDA error: no kernel image is available for execution on the device

I followed the set up instructions pretty much step by step and bumped into this error:

12/30/2021 06:50:47 - INFO - __main__ -     Total #steps = 175250
12/30/2021 06:50:47 - INFO - __main__ -     Validate every 1800 steps, in total 98 times
Traceback (most recent call last):
  File "src/tasks/run_video_retrieval.py", line 833, in <module>
    start_training(input_cfg)
  File "src/tasks/run_video_retrieval.py", line 385, in start_training
    optimizer.step()
  File "/opt/conda/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 359, in new_step
    self._master_params_to_model_params()
  File "/opt/conda/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 22, in _master_params_to_model_params
    1.0)
  File "/opt/conda/lib/python3.6/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__
    *args)
RuntimeError: CUDA error: no kernel image is available for execution on the device

Might you have any experience with this error? If not I will redirect it to the APEX repo instead. Thanks in advance!

opened by Tangolin 3

How was the T set in the default setting?

In Section 4.2 Analysis of Sparse Sampling, it is read If not otherwise stated, we randomly sample a single frame (Ntrain=1 and T=1) from full-length videos for training, and use the middle frame (Ntest=1) for inference, with input image size L=448. I am confused, if not otherwise stated in the following analysis, is T of training equals to T of test ? or T of test always equals to 1? Since i have noticed that there is no T_train or T_test.

opened by JianJuly 3
Pre-training speed is slow.

Thank you for releasing the code. We tried to reproduce the pre-training work, using 8 v100 and the parameter settings are the same as in the paper, batch_size=32, num_workers=4, but the training speed is always very slow, the GPU utilization rate vibrates between 0 and 100% , most of the time is 0, the CPU occupancy rate is about 30%. We need about 350 hours to train for 40 epochs, which is about 4 times the amount mentioned in the paper. We think maybe the dataloader is the bottleneck , but the training speed has not improved when we using larger num_workers like 8. On the other hand, when we only use one GPU for training, the GPU utilization can always reach 100%, and the total time is only about 370 hours. So we would like to ask whether there is anything wrong in our pre-training work and what is the possible reason. Thank you and looking forward to a reply.

opened by wangtianbaowtb 3
Question on the for loop in forward pass

Hi, Jie. Thank you for doing this excellent work and publishing the code. I have one question regarding the fine-tuning for downstream tasks. I noticed that N clips sampled from each video are forwarded individually using a for loop: https://github.com/jayleicn/ClipBERT/blob/main/src/tasks/run_video_qa.py#L250

May I ask what is the purpose of forwarding these clips separately instead of grouping them as batch_size * n_clips?

opened by Chuhanxx 3
Disk full when fine-tuning Image Question Answering

Thank you for your work! I encountered a problem when running VQA fine-tuning with:
horovodrun -np 1 python src/tasks/run_vqa.py \ --config src/configs/vqa_base_resnet50.json \ --output_dir $OUTPUT_DIR The output message is as follows:
root@a2d64a8b9de3:/clipbert# horovodrun -np 1 python src/tasks/run_vqa.py --config src/configs/vqa_base_resnet50.json --output_dir ./output [1,0]<stderr>:04/18/2021 11:07:07 - INFO - __main__ - device: cuda:0 n_gpu: 1, rank: 0, 16-bits training: True [1,0]<stderr>:04/18/2021 11:07:07 - INFO - __main__ - Setup model... [1,0]<stderr>:04/18/2021 11:07:07 - INFO - __main__ - setup e2e model [1,0]<stdout>:cnn_cls <class 'src.modeling.grid_feat.GridFeatBackbone'> [1,0]<stderr>:04/18/2021 11:07:10 - INFO - __main__ - Loading e2e weights from /pretrain/clipbert_image_text_pretrained.pt [1,0]<stderr>:04/18/2021 11:07:34 - INFO - __main__ - You can ignore the keys withnum_batches_tracked` or from task heads [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in loaded but not in model: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 9, ['transformer.cls.predictions.bias', 'transformer.cls.predictions.decoder.bias', 'transformer.cls.predictions.decoder.weight', 'transformer.cls.predictions.transform.LayerNorm.bias', 'transformer.cls.predictions.transform.LayerNorm.weight', 'transformer.cls.predictions.transform.dense.bias', 'transformer.cls.predictions.transform.dense.weight', 'transformer.cls.seq_relationship.bias', 'transformer.cls.seq_relationship.weight'] [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in model but not in loaded: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 4, ['transformer.classifier.0.bias', 'transformer.classifier.0.weight', 'transformer.classifier.2.bias', 'transformer.classifier.2.weight'] [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in model and loaded, but shape mismatched: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 0, [] [1,0]:04/18/2021 11:07:37 - INFO - main - Setup model done! [1,0]:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights. [1,0]: [1,0]:Defaults for this optimization level are: [1,0]:enabled : True [1,0]:opt_level : O2 [1,0]:cast_model_type : torch.float16 [1,0]:patch_torch_functions : False [1,0]:keep_batchnorm_fp32 : True [1,0]:master_weights : True [1,0]:loss_scale : dynamic [1,0]:Processing user overrides (additional kwargs that are not None)... [1,0]:After processing overrides, optimization options are: [1,0]:enabled : True [1,0]:opt_level : O2 [1,0]:cast_model_type : torch.float16 [1,0]:patch_torch_functions : False [1,0]:keep_batchnorm_fp32 : True [1,0]:master_weights : True [1,0]:loss_scale : dynamic [1,0]:/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated: [1,0]: add_(Number alpha, Tensor other) [1,0]:Consider using one of the following signatures instead: [1,0]: add_(Tensor other, , Number alpha) [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - Model name '/pretrain/bert-base-uncased/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming '/pretrain/bert-base-uncased/' is a path, a model identifier, or url to a directory containing tokenizer files. [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - Didn't find file /pretrain/bert-base-uncased/added_tokens.json. We won't load it. [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/vocab.txt [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file None [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/special_tokens_map.json [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/tokenizer_config.json [1,0]:04/18/2021 11:07:41 - INFO - main - Init. train_loader and val_loader... [1,0]:Using example_unique_key question_id to check whether input and output ids m [1,0]:04/18/2021 11:07:53 - INFO - main - is_train True, dataset size 587314 groups, each group 2 [1,0]:Using example_unique_key question_id to check whether input and output ids m [1,0]:04/18/2021 11:07:54 - INFO - main - is_train False, dataset size 26280 groups, each group 1 [1,0]:04/18/2021 11:07:54 - INFO - main - Saving training meta... [1,0]:04/18/2021 11:07:54 - INFO - main - Saving code from /clipbert to ./output/code.zip... [1,0]:Traceback (most recent call last): [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1646, in write [1,0]: shutil.copyfileobj(src, dest, 10248) [1,0]: File "/opt/conda/lib/python3.6/shutil.py", line 82, in copyfileobj [1,0]: fdst.write(buf) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1015, in write [1,0]: self._fileobj.write(data) [1,0]:OSError: [Errno 28] No space left on device [1,0]: [1,0]:During handling of the above exception, another exception occurred: [1,0]: [1,0]:Traceback (most recent call last): [1,0]: File "/clipbert/src/utils/basic_utils.py", line 122, in make_zipfile [1,0]: zf.write(absname, arcname) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1646, in write [1,0]: shutil.copyfileobj(src, dest, 1024*8) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1043, in close [1,0]: raise RuntimeError('File size unexpectedly exceeded ZIP64 ' [1,0]:RuntimeError: File size unexpectedly exceeded ZIP64 limit [1,0]: [1,0]:During handling of the above exception, another exception occurred: [1,0]: [1,0]:Traceback (most recent call last): [1,0]: File "src/tasks/run_vqa.py", line 568, in [1,0]: start_training(input_cfg) [1,0]: File "src/tasks/run_vqa.py", line 314, in start_training [1,0]: save_training_meta(cfg) [1,0]: File "/clipbert/src/utils/load_save.py", line 39, in save_training_meta [1,0]: exclude_extensions=[".pyc", ".ipynb", ".swap"]) [1,0]: File "/clipbert/src/utils/basic_utils.py", line 122, in make_zipfile [1,0]: zf.write(absname, arcname) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1174, in exit [1,0]: self.close() [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1695, in close [1,0]: raise ValueError("Can't close the ZIP file while there is " [1,0]:ValueError: Can't close the ZIP file while there is an open writing handle on it. Close the writing handle before closing the zip.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[48786,1],0] Exit code: 1

`
I found my disk storage is full after running it: /dev/nvme0n1p10 83G 83G 0 100% / Is this normal? How can I solve this problem?

opened by junyi-tiger 3
Environment Setup

Dear all,

I try to start my temporal grounding project based on ClipBERT. Currently, I am struggling with the environment setup for ClipBERT. I have tried to build a docker image and also create a virtual environment. However, I failed in both ways.

Could you please share the docker image? It would be really helpful and I would really appreciate it.

Waiting for your reply. Thank you very much.

Best regards, Yimeng

opened by damon-demon 2
Problems with vqa config

Hi, I found in the vqa config, there are two lines for txt files of the train dataset, for coco and vg separately, but only one line for txt file for the image file (coco). It seems there's a mismatch between image and text for vg dataset. https://github.com/jayleicn/ClipBERT/blob/7adfe795c6056190885c14ec0c3cb8f12b50238a/src/configs/vqa_base_resnet50.json#L7

opened by Steve-Tod 2
Problem with import statement of transformer

This import statement raises an error: ImportError: cannot import name 'swish' from 'transformers.activations' Why does this happen? Is it related to the version of transformers you use? How should we make the code runnable?

opened by JisenRen 0
409 status code when downloading pretrained_model

bash scripts/download_pretrained.sh $PATH_STORAGE return 409 status code

--2022-11-11 00:42:49-- https://convaisharables.blob.core.windows.net/clipbert/pretrained/clipbert_image_text_pretrained.pt Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 20.60.20.68 Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|20.60.20.68|:443... connected. HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account. 2022-11-11 00:42:50 ERROR 409: Public access is not permitted on this storage account.. --2022-11-11 00:42:50-- https://convaisharables.blob.core.windows.net/clipbert/pretrained/bert-base-uncased.tar Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 20.60.20.68 Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|20.60.20.68|:443... connected. HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account. 2022-11-11 00:42:51 ERROR 409: Public access is not permitted on this storage account.. tar: all_data/pretrained/bert-base-uncased.tar: Cannot open: No such file or directory tar: Error is not recoverable: exiting now rm: cannot remove 'all_data/pretrained/bert-base-uncased.tar': No such file or directory --2022-11-11 00:42:51-- https://convaisharables.blob.core.windows.net/clipbert/pretrained/grid_feat_R-50.pth Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 20.60.20.68 Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|20.60.20.68|:443... connected. HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account. 2022-11-11 00:42:51 ERROR 409: Public access is not permitted on this storage account..

Does it mean I need to re-download all the pretrained model?

opened by svetlana-work 2

TypeError: init() got an unexpected keyword argument 'train_on_pred_boxes'

While running run_video_qa.py with tgif_qa_action. I'm unable to figure out if there is something wrong with the dataset or Detectron2. I've installed the latest version.

08/18/2022 22:46:34 - INFO - __main__ -   device: cuda:0 n_gpu: 1, rank: 0, 16-bits training: True
08/18/2022 22:46:34 - INFO - __main__ -   Setup model...
08/18/2022 22:46:34 - INFO - __main__ -   setup e2e model
Traceback (most recent call last):
  File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/tasks/run_video_qa.py", line 722, in <module>
    start_training(input_cfg)
  File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/tasks/run_video_qa.py", line 385, in start_training
    model = setup_model(cfg, device=device)
  File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/tasks/run_video_qa.py", line 193, in setup_model
    transformer_cls=transformer_model_cls)
  File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/modeling/e2e_model.py", line 25, in __init__
    config=config, input_format=input_format)
  File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/modeling/grid_feat.py", line 42, in __init__
    self.feature = build_model(self.detectron2_cfg)
  File "/home/rishihazra/detectron2/detectron2/modeling/meta_arch/build.py", line 22, in build_model
    model = META_ARCH_REGISTRY.get(meta_arch)(cfg)
  File "/home/rishihazra/detectron2/detectron2/config/config.py", line 189, in wrapped
    explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
  File "/home/rishihazra/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
    ret = from_config_func(*args, **kwargs)
  File "/home/rishihazra/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 77, in from_config
    "roi_heads": build_roi_heads(cfg, backbone.output_shape()),
  File "/home/rishihazra/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 43, in build_roi_heads
    return ROI_HEADS_REGISTRY.get(name)(cfg, input_shape)
  File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/modeling/grid_feats/roi_heads.py", line 175, in __init__
    super(StandardROIHeads, self).__init__(cfg, input_shape)
  File "/home/rishihazra/detectron2/detectron2/config/config.py", line 190, in wrapped
    init_func(self, **explicit_args)
TypeError: __init__() got an unexpected keyword argument 'train_on_pred_boxes'

Process finished with exit code 1

opened by RishiHazra 0

Purpose of having both E2E_TrainingRestorer and ModelSaver

I noticed that both classes were used but they seem to perform similar functions of saving model and optimizer state dict, may I know what's the reason for utilising both models?

opened by Tangolin 0
Got 'Resource temporarily unavailable' using docker

Hi, I always got 'runtime/cgo: pthread_create failed: Resource temporarily unavailable' error when using docker. And the docker process cannot stop itself, I need to use sudo to kill the process, which is very inconvenient. What's more, I found that saving the code and backup checkpoints needs very large memory space(~GB) which may cause the above error. Any suggestions for this error? Thanks a lot!

opened by Zoe-Ziyi 2

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Related tags

Overview

ClipBERT

Requirements

Getting Started

General

Downstream Task Finetuning

Text-to-Video Retrieval

Video Question Answering

Image Question Answering (VQA)

Pretraining

Data Preprocessing

Citation

Acknowledgement

License

Comments

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[48786,1],0] Exit code: 1

Owner

Jie Lei 雷杰

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

A PyTorch Implementation of End-to-End Models for Speech-to-Text

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

A demo for end-to-end English and Chinese text spotting using ABCNet.

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

An open source library for deep learning end-to-end dialog systems and chatbots.

An open source library for deep learning end-to-end dialog systems and chatbots.

An open source library for deep learning end-to-end dialog systems and chatbots.

Deal or No Deal? End-to-End Learning for Negotiation Dialogues