Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Overview

ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei*, Linjie Li*, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning. In this repository, we support end-to-end pretraining and finetuning for the following tasks:

  • Image-text pretraining on COCO and VG captions.
  • Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
  • Video-QA finetuning on TGIF-QA and MSRVTT-QA.
  • Image-QA finetuning on VQA 2.0.

It is also feasible and easy to add other image-text or video-text tasks for pretraining and finetuning.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

  1. Create a folder that stores pretrained models, all the data, and results.

    PATH_TO_STORAGE=/path/to/your/data/
    mkdir -p $PATH_TO_STORAGE/txt_db  # annotations
    mkdir -p $PATH_TO_STORAGE/vis_db  # image and video 
    mkdir -p $PATH_TO_STORAGE/finetune  # finetuning results
    mkdir -p $PATH_TO_STORAGE/pretrained  # pretrained models
  2. Download pretrained models.

    Our e2e pretrained ClipBERT model (849MB), can be downloaded with the following command.

    bash scripts/download_pretrained.sh $PATH_TO_STORAGE

    This pretrained model can be used for finetuning on video-text tasks and image-text tasks. For your convenience, this script will also download bert-base-uncased and grid-feat-vqa model weights, which are used as initialization for pretraining.

  3. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /clipbert instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Downstream Task Finetuning

Text-to-Video Retrieval

Tasks: MSRVTT retrieval, DiDeMo and ActivityNet Captions paragprah-to-video retrieval, MSRVTT MC Test.

  1. Download data.

    # outside the container  
    # download videos + annotations for $DSET
    bash scripts/download_$DSET.sh $PATH_TO_STORAGE

    $DSET can be one of msrvtt, didemo, anet.

  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR
    
    # for single GPU
    python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs prefixed with _ret. For example, you can use src/configs/msrvtt_ret_base_resnet50.json for MSRVTT retrieval.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, it tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_retrieval_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT retrieval val split. The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

    After MSRVTT retrieval model is trained, you can use it for inference for the MSRVTT MC Test task as well, which is essentially a retrieval task in a multiple-choice task setup.

    # inside the container
    horovodrun -np 4 python src/tasks/run_msrvtt_mc.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db /txt/downstream/msrvtt_retrieval_mc/msrvtt_retrieval_mc_test.jsonl \
      --inference_img_db /img/msrvtt --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

Video Question Answering

Tasks: TGIF-QA action, transition, and frameQA tasks; MSRVTT-QA.

  1. Download data.

    # outside the container  
    # download MSRVTT videos, and QA + retrieval annotations
    bash scripts/download_msrvtt.sh $PATH_TO_STORAGE  
    # download TGIF-QA videos and annotations
    bash scripts/download_tgif_qa.sh $PATH_TO_STORAGE  
  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs contains the substring _qa. For example, you can use src/configs/msrvtt_qa_base_resnet50.json for MSRVTT-QA.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_qa_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT QA val split.

    The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

Image Question Answering (VQA)

  1. Download data

    # outside the container
    # download COCO and VG data
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
    # download VQA annotations
    bash scripts/download_vqa.sh $PATH_TO_STORAGE
  2. Finetuning

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
        --config src/configs/vqa_base_resnet50.json \
        --output_dir $OUTPUT_DIR
  3. Inference

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB \
      --inference_batch_size 64

Pretraining

  1. Download data

    # outside the container
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
  2. Pretraining

    #inside the container
    horovodrun -np 8 python src/pretrain/run_pretrain.py \
        --config src/configs/pretrain_indomain_base_resnet50_mlm_itm.json \
        --output_dir $OUTPUT_DIR 

Data Preprocessing

ClipBERT takes raw video and text as inputs, there is no need to do feature extraction. However, to improve data loading speed, we use LMDB to store the raw image and video files. You can use the following script to convert a list of videos with file extensions mp4 and avi into LMDB:

# outside the container
python src/preprocessing/file2lmdb.py \
    --data_root /path/to/videos \
    --lmdb_save_dir /path/to/save/lmdb \
    --ext avi mp4 \
    --file_type video 

For images, use appropriate file extensions for --ext and --file_type image. Text annotation files are reorganized into jsonl files, see example preprocessed files downloaded by the scripts in scripts/.

Citation

If you find this code useful for your research, please consider citing:

@article{lei2021less,
  title={Less is More: ClipBERT for Video-and-Language Learningvia Sparse Sampling},
  author={Lei, Jie and Li, Linjie and Zhou, Luowei and Gan, Zhe and Berg, Tamara L. and Bansal, Mohit and Liu, Jingjing},
  journal={arXiv},
  year={2021}
}

Acknowledgement

We thank Yen-Chun Chen and Ruotian Luo for suggestions on the implementation. We also thank other members and interns at Microsoft Multimodal AI for their helpful discussions.

This code used resources from transformers, UNITER, HERO, grid-feats-vqa, SlowFast, Detectron2. The code is implemented using PyTorch, with multi-GPU support from Horovod and mixed precision support from apex. We thank the authors for open-sourcing their awesome projects.

License

MIT

Comments
  • error: can't start new thread

    error: can't start new thread

    During the training of the model, I frequently encounter the error error: can't start new thread which occurs after <stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable. I also notice that the CPU usage is incredibly high during the training process.

    I am currently following what zoe did in #32, changing the n_workers to 0, however this drastically increases the training time, is there any workaround for this problem?

    Here is a more complete error output:

    [1,3]<stderr>:Traceback (most recent call last):
    [1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
    [1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
    [1,3]<stderr>:    model_saver.save(step=global_step, model=model)
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    [1,3]<stderr>:    return func(*args, **kwargs)
    [1,3]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
    [1,3]<stderr>:    for val_step, batch in enumerate(val_loader):
    [1,3]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
    [1,3]<stderr>:    loader_it = iter(self.loader)
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
    [1,3]<stderr>:    return _MultiProcessingDataLoaderIter(self)
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
    [1,3]<stderr>:    w.start()
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
    [1,3]<stderr>:    self._popen = self._Popen(self)
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    [1,3]<stderr>:    return _default_context.get_context().Process._Popen(process_obj)
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    [1,3]<stderr>:    return Popen(process_obj)
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    [1,3]<stderr>:    self._launch(process_obj)
    [1,3]<stderr>:  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    [1,3]<stderr>:    self.pid = os.fork()
    [1,3]<stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable
    [1,1]<stderr>:Traceback (most recent call last):
    [1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 829, in <module>
    [1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 509, in start_training
    [1,1]<stderr>:    model_saver.save(step=global_step, model=model)
    [1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    [1,1]<stderr>:    return func(*args, **kwargs)
    [1,1]<stderr>:  File "src/tasks/run_video_retrieval.py", line 238, in validate
    [1,1]<stderr>:    for val_step, batch in enumerate(val_loader):
    [1,1]<stderr>:  File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
    [1,1]<stderr>:    loader_it = iter(self.loader)
    [1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
    [1,1]<stderr>:    return _MultiProcessingDataLoaderIter(self)
    [1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 733, in __init__
    [1,1]<stderr>:    pin_memory_thread.start()
    [1,1]<stderr>:  File "/opt/conda/lib/python3.6/threading.py", line 846, in start
    [1,1]<stderr>:    _start_new_thread(self._bootstrap, ())
    [1,1]<stderr>:RuntimeError: can't start new thread
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    mpirun detected that one or more processes exited with non-zero status, thus causing
    the job to be terminated. The first process to do so was:
    
      Process name: [[32362,1],1]
      Exit code:    1
    
    opened by Tangolin 5
  • When run finetuning task MSRVTT retrieval,an error occurred:

    When run finetuning task MSRVTT retrieval,an error occurred: "Failed Resource Temporarily Unavailable".

    We found the reason is that the dataloader constantly create worker threads, however ,the threads can't exit normally.So when the number of threads exceed the upper limit,run_video_retrieval.py will exit unexpectedly. We use your docker image to run the program,have you ever had such problem before? Thanks!

    opened by MrZihan 5
  • Fine-tuning ClipBERT on custom datasets

    Fine-tuning ClipBERT on custom datasets

    Hi, thank you for sharing this interesting work!

    I would like to try fine-tuining ClipBERT on other video-and-language dataset, such as YouCook2. My target downstream task is cross-modal retrieval in sentence-level, rather than paragraph-level.

    Do you have any recommendations to train ClipBERT on custom datasets? In particular, I am curious about how to decide hyper-parameters described in config files for other datasets. Thank you.

    opened by misogil0116 4
  • Extracting frame level visual features

    Extracting frame level visual features

    Hi,

    Thanks for making your code and pretrained models available publicly. I was wondering if you have suggestions regarding extracting frame level features using your models.

    Thanks!

    opened by srikanth-sfu 4
  • CUDA error: no kernel image is available for execution on the device

    CUDA error: no kernel image is available for execution on the device

    I followed the set up instructions pretty much step by step and bumped into this error:

    12/30/2021 06:50:47 - INFO - __main__ -     Total #steps = 175250
    12/30/2021 06:50:47 - INFO - __main__ -     Validate every 1800 steps, in total 98 times
    Traceback (most recent call last):
      File "src/tasks/run_video_retrieval.py", line 833, in <module>
        start_training(input_cfg)
      File "src/tasks/run_video_retrieval.py", line 385, in start_training
        optimizer.step()
      File "/opt/conda/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 359, in new_step
        self._master_params_to_model_params()
      File "/opt/conda/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 22, in _master_params_to_model_params
        1.0)
      File "/opt/conda/lib/python3.6/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__
        *args)
    RuntimeError: CUDA error: no kernel image is available for execution on the device
    

    Might you have any experience with this error? If not I will redirect it to the APEX repo instead. Thanks in advance!

    opened by Tangolin 3
  • How was the T set in the default setting?

    How was the T set in the default setting?

    In Section 4.2 Analysis of Sparse Sampling, it is read If not otherwise stated, we randomly sample a single frame (Ntrain=1 and T=1) from full-length videos for training, and use the middle frame (Ntest=1) for inference, with input image size L=448. I am confused, if not otherwise stated in the following analysis, is T of training equals to T of test ? or T of test always equals to 1? Since i have noticed that there is no T_train or T_test.

    opened by JianJuly 3
  • Pre-training speed is slow.

    Pre-training speed is slow.

    Thank you for releasing the code. We tried to reproduce the pre-training work, using 8 v100 and the parameter settings are the same as in the paper, batch_size=32, num_workers=4, but the training speed is always very slow, the GPU utilization rate vibrates between 0 and 100% , most of the time is 0, the CPU occupancy rate is about 30%. We need about 350 hours to train for 40 epochs, which is about 4 times the amount mentioned in the paper. We think maybe the dataloader is the bottleneck , but the training speed has not improved when we using larger num_workers like 8. On the other hand, when we only use one GPU for training, the GPU utilization can always reach 100%, and the total time is only about 370 hours. So we would like to ask whether there is anything wrong in our pre-training work and what is the possible reason. Thank you and looking forward to a reply.

    opened by wangtianbaowtb 3
  • Question on the for loop in forward pass

    Question on the for loop in forward pass

    Hi, Jie. Thank you for doing this excellent work and publishing the code. I have one question regarding the fine-tuning for downstream tasks. I noticed that N clips sampled from each video are forwarded individually using a for loop: https://github.com/jayleicn/ClipBERT/blob/main/src/tasks/run_video_qa.py#L250

    May I ask what is the purpose of forwarding these clips separately instead of grouping them as batch_size * n_clips?

    opened by Chuhanxx 3
  • Disk full when fine-tuning Image Question Answering

    Disk full when fine-tuning Image Question Answering

    Thank you for your work! I encountered a problem when running VQA fine-tuning with:
    horovodrun -np 1 python src/tasks/run_vqa.py \ --config src/configs/vqa_base_resnet50.json \ --output_dir $OUTPUT_DIR The output message is as follows:
    root@a2d64a8b9de3:/clipbert# horovodrun -np 1 python src/tasks/run_vqa.py --config src/configs/vqa_base_resnet50.json --output_dir ./output [1,0]<stderr>:04/18/2021 11:07:07 - INFO - __main__ - device: cuda:0 n_gpu: 1, rank: 0, 16-bits training: True [1,0]<stderr>:04/18/2021 11:07:07 - INFO - __main__ - Setup model... [1,0]<stderr>:04/18/2021 11:07:07 - INFO - __main__ - setup e2e model [1,0]<stdout>:cnn_cls <class 'src.modeling.grid_feat.GridFeatBackbone'> [1,0]<stderr>:04/18/2021 11:07:10 - INFO - __main__ - Loading e2e weights from /pretrain/clipbert_image_text_pretrained.pt [1,0]<stderr>:04/18/2021 11:07:34 - INFO - __main__ - You can ignore the keys withnum_batches_tracked` or from task heads [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in loaded but not in model: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 9, ['transformer.cls.predictions.bias', 'transformer.cls.predictions.decoder.bias', 'transformer.cls.predictions.decoder.weight', 'transformer.cls.predictions.transform.LayerNorm.bias', 'transformer.cls.predictions.transform.LayerNorm.weight', 'transformer.cls.predictions.transform.dense.bias', 'transformer.cls.predictions.transform.dense.weight', 'transformer.cls.seq_relationship.bias', 'transformer.cls.seq_relationship.weight'] [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in model but not in loaded: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 4, ['transformer.classifier.0.bias', 'transformer.classifier.0.weight', 'transformer.classifier.2.bias', 'transformer.classifier.2.weight'] [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in model and loaded, but shape mismatched: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 0, [] [1,0]:04/18/2021 11:07:37 - INFO - main - Setup model done! [1,0]:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights. [1,0]: [1,0]:Defaults for this optimization level are: [1,0]:enabled : True [1,0]:opt_level : O2 [1,0]:cast_model_type : torch.float16 [1,0]:patch_torch_functions : False [1,0]:keep_batchnorm_fp32 : True [1,0]:master_weights : True [1,0]:loss_scale : dynamic [1,0]:Processing user overrides (additional kwargs that are not None)... [1,0]:After processing overrides, optimization options are: [1,0]:enabled : True [1,0]:opt_level : O2 [1,0]:cast_model_type : torch.float16 [1,0]:patch_torch_functions : False [1,0]:keep_batchnorm_fp32 : True [1,0]:master_weights : True [1,0]:loss_scale : dynamic [1,0]:/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated: [1,0]: add_(Number alpha, Tensor other) [1,0]:Consider using one of the following signatures instead: [1,0]: add_(Tensor other, , Number alpha) [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - Model name '/pretrain/bert-base-uncased/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming '/pretrain/bert-base-uncased/' is a path, a model identifier, or url to a directory containing tokenizer files. [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - Didn't find file /pretrain/bert-base-uncased/added_tokens.json. We won't load it. [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/vocab.txt [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file None [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/special_tokens_map.json [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/tokenizer_config.json [1,0]:04/18/2021 11:07:41 - INFO - main - Init. train_loader and val_loader... [1,0]:Using example_unique_key question_id to check whether input and output ids m [1,0]:04/18/2021 11:07:53 - INFO - main - is_train True, dataset size 587314 groups, each group 2 [1,0]:Using example_unique_key question_id to check whether input and output ids m [1,0]:04/18/2021 11:07:54 - INFO - main - is_train False, dataset size 26280 groups, each group 1 [1,0]:04/18/2021 11:07:54 - INFO - main - Saving training meta... [1,0]:04/18/2021 11:07:54 - INFO - main - Saving code from /clipbert to ./output/code.zip... [1,0]:Traceback (most recent call last): [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1646, in write [1,0]: shutil.copyfileobj(src, dest, 10248) [1,0]: File "/opt/conda/lib/python3.6/shutil.py", line 82, in copyfileobj [1,0]: fdst.write(buf) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1015, in write [1,0]: self._fileobj.write(data) [1,0]:OSError: [Errno 28] No space left on device [1,0]: [1,0]:During handling of the above exception, another exception occurred: [1,0]: [1,0]:Traceback (most recent call last): [1,0]: File "/clipbert/src/utils/basic_utils.py", line 122, in make_zipfile [1,0]: zf.write(absname, arcname) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1646, in write [1,0]: shutil.copyfileobj(src, dest, 1024*8) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1043, in close [1,0]: raise RuntimeError('File size unexpectedly exceeded ZIP64 ' [1,0]:RuntimeError: File size unexpectedly exceeded ZIP64 limit [1,0]: [1,0]:During handling of the above exception, another exception occurred: [1,0]: [1,0]:Traceback (most recent call last): [1,0]: File "src/tasks/run_vqa.py", line 568, in [1,0]: start_training(input_cfg) [1,0]: File "src/tasks/run_vqa.py", line 314, in start_training [1,0]: save_training_meta(cfg) [1,0]: File "/clipbert/src/utils/load_save.py", line 39, in save_training_meta [1,0]: exclude_extensions=[".pyc", ".ipynb", ".swap"]) [1,0]: File "/clipbert/src/utils/basic_utils.py", line 122, in make_zipfile [1,0]: zf.write(absname, arcname) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1174, in exit [1,0]: self.close() [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1695, in close [1,0]: raise ValueError("Can't close the ZIP file while there is " [1,0]:ValueError: Can't close the ZIP file while there is an open writing handle on it. Close the writing handle before closing the zip.

    Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


    mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

    Process name: [[48786,1],0] Exit code: 1

    `
    I found my disk storage is full after running it: /dev/nvme0n1p10 83G 83G 0 100% / Is this normal? How can I solve this problem?

    opened by junyi-tiger 3
  • Environment Setup

    Environment Setup

    Dear all,

    I try to start my temporal grounding project based on ClipBERT. Currently, I am struggling with the environment setup for ClipBERT. I have tried to build a docker image and also create a virtual environment. However, I failed in both ways.

    Could you please share the docker image? It would be really helpful and I would really appreciate it.

    Waiting for your reply. Thank you very much.

    Best regards, Yimeng

    opened by damon-demon 2
  • Problems with vqa config

    Problems with vqa config

    Hi, I found in the vqa config, there are two lines for txt files of the train dataset, for coco and vg separately, but only one line for txt file for the image file (coco). It seems there's a mismatch between image and text for vg dataset. https://github.com/jayleicn/ClipBERT/blob/7adfe795c6056190885c14ec0c3cb8f12b50238a/src/configs/vqa_base_resnet50.json#L7

    opened by Steve-Tod 2
  • Problem with import statement of transformer

    Problem with import statement of transformer

    This import statement raises an error: ImportError: cannot import name 'swish' from 'transformers.activations' Why does this happen? Is it related to the version of transformers you use? How should we make the code runnable?

    opened by JisenRen 0
  • 409 status code when downloading pretrained_model

    409 status code when downloading pretrained_model

    bash scripts/download_pretrained.sh $PATH_STORAGE return 409 status code

    --2022-11-11 00:42:49-- https://convaisharables.blob.core.windows.net/clipbert/pretrained/clipbert_image_text_pretrained.pt Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 20.60.20.68 Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|20.60.20.68|:443... connected. HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account. 2022-11-11 00:42:50 ERROR 409: Public access is not permitted on this storage account.. --2022-11-11 00:42:50-- https://convaisharables.blob.core.windows.net/clipbert/pretrained/bert-base-uncased.tar Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 20.60.20.68 Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|20.60.20.68|:443... connected. HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account. 2022-11-11 00:42:51 ERROR 409: Public access is not permitted on this storage account.. tar: all_data/pretrained/bert-base-uncased.tar: Cannot open: No such file or directory tar: Error is not recoverable: exiting now rm: cannot remove 'all_data/pretrained/bert-base-uncased.tar': No such file or directory --2022-11-11 00:42:51-- https://convaisharables.blob.core.windows.net/clipbert/pretrained/grid_feat_R-50.pth Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 20.60.20.68 Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|20.60.20.68|:443... connected. HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account. 2022-11-11 00:42:51 ERROR 409: Public access is not permitted on this storage account..

    Does it mean I need to re-download all the pretrained model?

    opened by svetlana-work 2
  • TypeError: __init__() got an unexpected keyword argument 'train_on_pred_boxes'

    TypeError: __init__() got an unexpected keyword argument 'train_on_pred_boxes'

    While running run_video_qa.py with tgif_qa_action. I'm unable to figure out if there is something wrong with the dataset or Detectron2. I've installed the latest version.

    08/18/2022 22:46:34 - INFO - __main__ -   device: cuda:0 n_gpu: 1, rank: 0, 16-bits training: True
    08/18/2022 22:46:34 - INFO - __main__ -   Setup model...
    08/18/2022 22:46:34 - INFO - __main__ -   setup e2e model
    Traceback (most recent call last):
      File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/tasks/run_video_qa.py", line 722, in <module>
        start_training(input_cfg)
      File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/tasks/run_video_qa.py", line 385, in start_training
        model = setup_model(cfg, device=device)
      File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/tasks/run_video_qa.py", line 193, in setup_model
        transformer_cls=transformer_model_cls)
      File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/modeling/e2e_model.py", line 25, in __init__
        config=config, input_format=input_format)
      File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/modeling/grid_feat.py", line 42, in __init__
        self.feature = build_model(self.detectron2_cfg)
      File "/home/rishihazra/detectron2/detectron2/modeling/meta_arch/build.py", line 22, in build_model
        model = META_ARCH_REGISTRY.get(meta_arch)(cfg)
      File "/home/rishihazra/detectron2/detectron2/config/config.py", line 189, in wrapped
        explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
      File "/home/rishihazra/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
        ret = from_config_func(*args, **kwargs)
      File "/home/rishihazra/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 77, in from_config
        "roi_heads": build_roi_heads(cfg, backbone.output_shape()),
      File "/home/rishihazra/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 43, in build_roi_heads
        return ROI_HEADS_REGISTRY.get(name)(cfg, input_shape)
      File "/home/rishihazra/PycharmProjects/VisionLangaugeGrounding/baselines/ClipBERT/src/modeling/grid_feats/roi_heads.py", line 175, in __init__
        super(StandardROIHeads, self).__init__(cfg, input_shape)
      File "/home/rishihazra/detectron2/detectron2/config/config.py", line 190, in wrapped
        init_func(self, **explicit_args)
    TypeError: __init__() got an unexpected keyword argument 'train_on_pred_boxes'
    
    Process finished with exit code 1
    
    
    opened by RishiHazra 0
  • Purpose of having both E2E_TrainingRestorer and ModelSaver

    Purpose of having both E2E_TrainingRestorer and ModelSaver

    I noticed that both classes were used but they seem to perform similar functions of saving model and optimizer state dict, may I know what's the reason for utilising both models?

    opened by Tangolin 0
  • Got 'Resource temporarily unavailable' using docker

    Got 'Resource temporarily unavailable' using docker

    Hi, I always got 'runtime/cgo: pthread_create failed: Resource temporarily unavailable' error when using docker. And the docker process cannot stop itself, I need to use sudo to kill the process, which is very inconvenient. What's more, I found that saving the code and backup checkpoints needs very large memory space(~GB) which may cause the above error. Any suggestions for this error? Thanks a lot!

    opened by Zoe-Ziyi 2
Owner
Jie Lei 雷杰
UNC CS PhD student, vision+language.
Jie Lei 雷杰
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 6, 2023
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 2, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

null 2 Feb 10, 2022
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 4, 2022
Rhasspy 673 Dec 28, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 86 Jun 11, 2021
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

?? Contributing to OpenSpeech ?? OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 3, 2023
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 6, 2023
Abhijith Neil Abraham 2 Nov 5, 2021
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 31, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 5k Feb 18, 2021
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022