Tensorflow implementation of soft-attention mechanism for video caption generation.

Overview

SA-tensorflow

Tensorflow implementation of soft-attention mechanism for video caption generation.

An example of soft-attention mechanism. The attention weight alpha indicates the temporal attention in one video based on each word.

[Yao et al. 2015 Describing Videos by Exploiting Temporal Structure] The original code implemented in Torch can be found here.

Prerequisites

  • Python 2.7
  • Tensorflow >= 0.7.1
  • NumPy
  • pandas
  • keras
  • java 1.8.0

Data

The MSVD [2] dataset can be download from here.

We pack the data into the format of HDF5, where each file is a mini-batch for training and has the following keys:

[u'data', u'fname', u'label', u'title']

batch['data'] stores the visual features. shape (n_step_lstm, batch_size, hidden_dim)

batch['fname'] stores the filenames(no extension) of videos. shape (batch_size)

batch['title'] stores the description. If there are multiple sentences correspond to one video, the other metadata such as visual features, filenames and labels have to duplicate for one-to-one mapping. shape (batch_size)

batch['label'] indicates where the video ends. For instance, [-1., -1., -1., -1., 0., -1., -1.] means that the video ends at index 4.

shape (n_step_lstm, batch_size)

Generate HDF5 data

We generate the HDF5 data by following the steps below. The codes are a little messy. If you have any questions, feel free to ask.

1. Generate Label

Once you change the video_path and output_path, you can generate labels by running the script:

python hdf5_generator/generate_nolabel.py

I set the length of each clip to 10 frames and the maximum length of frames to 450. You can change the parameters in function get_frame_list(frame_num).

2. Pack features together (no caption information)

Inputs:

label_path: The path for the labels generated earlier.

feature_path: The path that stores features such as VGG and C3D. You can change the directory name whatever you want.

Ouputs:

h5py_path: The path that you store the concatenation of different features, the code will automatically put the features in the subdirectory cont

python hdf5_generator/input_generator.py

Note that in function get_feats_depend_on_label(), you can choose whether to take the mean feature or random sample feature of frames in one clip. The random sample script is commented out since the performance is worse.

3. Add captions into HDF5 data

I set the maxmimum number of words in a caption to 35. feature folder is where our final output features store.

python hdf5_generator/trans_video_youtube.py

(The codes here are written by Kuo-Hao)

Generate data list

video_data_path_train = '$ROOTPATH/SA-tensorflow/examples/train_vn.txt'

You can change the path variable to the absolute path of your data. Then simply run python getlist.py to generate the list.

P.S. The filenames of HDF5 data start with train, val, test.

Usage

training

$ python Att.py --task train

testing

Test the model after a certain number of training epochs.

$ python Att.py --task test --net models/model-20

Author

Tseng-Hung Chen

Kuo-Hao Zeng

Disclaimer

We modified the code from this repository jazzsaxmafia/video_to_sequence to the temporal-attention model.

References

[1] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. arXiv:1502.08029v4, 2015.

[2] chen:acl11, title = "Collecting Highly Parallel Data for Paraphrase Evaluation", author = "David L. Chen and William B. Dolan", booktitle = "Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011)", address = "Portland, OR", month = "June", year = 2011

[3] Microsoft COCO Caption Evaluation

Comments
  • Handling longer videos while preparing data files

    Handling longer videos while preparing data files

    Thanks for preparing a wonderful code!

    While preparing the data h5 files, as mentioned- "batch['data'] stores the visual features. shape (n_step_lstm, batch_size, hidden_dim)" How to deal with videos that are longer than "n_step_lstm" length? If the video is broken into parts and stored as separate input samples, would the model figure out and learn from parts of same video using the batch['label'] parameter.

    Any help on preparing the data h5 files would be appreciated. Thanks.

    enhancement 
    opened by sxs4337 10
  • Error in generate_nolabel

    Error in generate_nolabel

    [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit [h264 @ 0x163d0e0] missing picture in access unit Traceback (most recent call last): File "hdf5_generator/generate_nolabel.py", line 88, in get_label_list(fname) File "hdf5_generator/generate_nolabel.py", line 71, in get_label_list frame_len = get_total_frame_number(fname) File "hdf5_generator/generate_nolabel.py", line 33, in get_total_frame_number length = float(cap.get(cv2.cv.CV_CAP_PROP_FRAME_COUNT)) AttributeError: 'module' object has no attribute 'cv' andy1028@andy1028-Envy:/media/andy1028/data1t/os_prj/github/SA-tensorflow$

    opened by loveJasmine 3
  • Number of epochs to reproduce paper scores

    Number of epochs to reproduce paper scores

    I was able to write a script for data generation for MSVD. Could you please comment on the number of epochs to run to reproduce scores as the [Yao et al. 2015 Describing Videos by Exploiting Temporal Structure] paper. I see that in the code it is mentioned 900 epochs. Thanks.

    opened by sxs4337 1
  • Generating vocabulary only from the training set

    Generating vocabulary only from the training set

    The vocabulary should be generated only using the training data. Currently, in function- https://github.com/tsenghungchen/SA-tensorflow/blob/master/Att.py#L370 , the input is "captions" which is generated from all data- train+val+test. Ideally, the network should not be fed any words from the test set (any unseen new words in testing to the network should be just <unknown_word> for evaluation). Thanks.

    bug 
    opened by sxs4337 1
  • Can't run

    Can't run "input_generator.py"

    the function def splitdata(path, train_num, val_num) is not executed in "generate_nolabel.py" so the file "msvd_dataset_final.npz" is not generated so i can't run "input_generator.py"

    opened by Nadern96 1
  • HDF5 example data

    HDF5 example data

    Hello! Thank you very much for the nice project! I am trying to reproduce the results, but I have problems creating the hdf5 files, can anyone maybe provide one or two of the hdf5 file/s to have an example to compare? That would be great!

    opened by jiseungshin 0
  • What does the 'feature_path' mean?

    What does the 'feature_path' mean?

    Hi, I have trouble with what the feature_path means?
    As you said, "The path that stores features such as VGG and C3D". Does it means the weights from VGG-16 which is already trained? I have a pretrained vgg16_weights.h5 file but it doesn't work well.

    opened by lcmaster-hx 7
Owner
Paul Chen
Paul Chen
Attention mechanism with MNIST dataset

[TensorFlow] Attention mechanism with MNIST dataset Usage $ python run.py Result Training Loss graph. Test Each figure shows input digit, attention ma

YeongHyeon Park 12 Jun 10, 2022
Investigating Attention Mechanism in 3D Point Cloud Object Detection (arXiv 2021)

Investigating Attention Mechanism in 3D Point Cloud Object Detection (arXiv 2021) This repository is for the following paper: "Investigating Attention

null 52 Nov 19, 2022
Stacked Hourglass Network with a Multi-level Attention Mechanism: Where to Look for Intervertebral Disc Labeling

⚠️ ‎‎‎ A more recent and actively-maintained version of this code is available in ivadomed Stacked Hourglass Network with a Multi-level Attention Mech

Reza Azad 14 Oct 24, 2022
Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

null 28 Dec 2, 2022
code for paper"A High-precision Semantic Segmentation Method Combining Adversarial Learning and Attention Mechanism"

PyTorch implementation of UAGAN(U-net Attention Generative Adversarial Networks) This repository contains the source code for the paper "A High-precis

Tong 8 Apr 25, 2022
Yet another video caption

Yet another video caption

Fan Zhimin 5 May 26, 2022
Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

Phil Wang 109 Dec 28, 2022
Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

mandos 43 Dec 7, 2022
PyTorch implementation of Soft-DTW: a Differentiable Loss Function for Time-Series in CUDA

Soft DTW Loss Function for PyTorch in CUDA This is a Pytorch Implementation of Soft-DTW: a Differentiable Loss Function for Time-Series which is batch

Keon Lee 76 Dec 20, 2022
Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation"

DSP Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation". Accepted by ACM Multimedia 2021. Authors

null 20 Oct 24, 2022
Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Softlearning Softlearning is a deep reinforcement learning toolbox for training maximum entropy policies in continuous domains. The implementation is

Robotic AI & Learning Lab Berkeley 997 Dec 30, 2022
Implementation of parameterized soft-exponential activation function.

Soft-Exponential-Activation-Function: Implementation of parameterized soft-exponential activation function. In this implementation, the parameters are

Shuvrajeet Das 1 Feb 23, 2022
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 172 Dec 22, 2022
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

Microsoft 61 Nov 14, 2022
Gif-caption - A straightforward GIF Captioner written in Python

Broksy's GIF Captioner Have you ever wanted to easily caption a GIF without havi

null 3 Apr 9, 2022
IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL.

IJON SPACE EXPLORER IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL. Using only a small (usually one line) annotati

Chair for Sys­tems Se­cu­ri­ty 146 Dec 16, 2022
Code for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss"

PurNet Project for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss" Abstract Image-based salie

Jinming Su 4 Aug 25, 2022
Feedback is important: response-aware feedback mechanism for background based conversation

RFM The code for the paper: "Feedback is important: response-aware feedback mechanism for background based conversation." Requirements python 3.7 pyto

Jiatao Chen 2 Sep 29, 2022