A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video.
In our framework we use a sequence-to-sequence model to perform video visual relationship predictions where the input is a sequence of video frames and the output is a relation triplet < object1 − relationship − object2 > representing the videos. We extend the sequence-to-sequence modelling approach to an input of sequence of video frames.
Figure: Bidirectional LSTM layer (coloured red) encodes visual feature inputs, and the LSTM layer (coloured green) decodes the features into a sequence of words.
For training the model, run the script train.py.
For training on your own dataset: Save your data in a directory (for the format check the data folder). Update the json files.
object1_object2.json: It contains a dictionary for each object, with object labels as keys and ids as values.
relationship.json: It contains a dictionary for each relationship, with relationship labels as keys and ids as values.
training_annotations.json: It contains a dictionary for each video in the training data, with video ids as keys and a list of
While running the script provide your directory path.
python eval.py --train_data
For testing the model or making predictions on your own dataset, run the script eval.py.
python eval.py --test_data
Result will be saved to a csv file 'test_data_predictions.csv'.