Codebase for "Revisiting spatio-temporal layouts for compositional action recognition" (Oral at BMVC 2021).

Overview

Revisiting spatio-temporal layouts for compositional action recognition

Codebase for "Revisiting spatio-temporal layouts for compositional action recognition" (Oral at BMVC 2021).

New year's arrival (01.01.2022) > Code release date > CVPR 2021 deadline (17.11.2021)...

Comments
  • Clarification on Models

    Clarification on Models

    Am I right in understanding that currently, you only implement the STLT model, without any of the appearance features?

    Is there a timeline for when the use of appearance features will be incorporated? I am particularly interested in either the PBF (hoping to use my own appearance features) or CACNF models (if it can be pretrained end-to-end).

    Thanks

    opened by michael-camilleri 15
  • Position Embedding

    Position Embedding

    As I was tweaking the code, I found some design choices in the position embeddings which I cannot understand.

    1. For the STLT, it seems that the position embedding vector does not take into consideration the number of sampled frames. Specifically, although it does use the config parameter config.layout_num_frames, this is never set in train.py (lines 87-96) or inference.py (lines 47-54). This means that the position embedding vector is always of size 256 (configs.py line 109). Is there a reason for this? I have changed this in my code, and ensuring that the position embedding is always config.layout_num_frames + 1 - does this make sense?

    2. For CACNF, the position embedding seems to be dependent on the Batch-Size and Image size itself for the summation with the features to work (models.py line 267). If I change the config.spatial_size or the batch size to less than 16, then there is a dimension mismatch. I hacked this by computing some intermediary values in my code but it is somewhat hard-coded all the same based on my data. Is there a reason for this dependency/architecture?

    opened by michael-camilleri 12
  • Mismatch between appearance_num_frames and feature-size

    Mismatch between appearance_num_frames and feature-size

    Hi Gorjan

    I am running into an issue with setting the number of frames to sample from each video. To put it into context, I need to classify 1s clips at a time, which amount to 25 frames, and hence, cannot sample more than those. The current setup is 32 frames, and I changed appearance_num_frames to 25. However, this may be interfering with the forward_features() method of TransformerResnet. It seems that the ResNet outputs by default 32 sequence length, and not sure if this can be modified.

    The error happens in models.py line 267, when it tries to join it with the position embedding. Any idea how I can rectify? am I interpreting the appearance_num_frames correctly to begin with?

    opened by michael-camilleri 7
  • file missing for training stlt

    file missing for training stlt

    hello, I tried to train the stlt model but got some errors. I first run create_dataset.py to prepare dataset, after that, I canot find the videoid2size.json file, which however is necessary for running the code. How can I get this json file? Thanks a lot ~

    opened by patrolli 4
  • Saving Trained Model

    Saving Trained Model

    Hi Gorjan

    Once trained, how can I save a CACNF model? does setting the save_model_path store also the backbones (including the fine-tuned Resnet) or do I need to store these (and load them) separately using the resnet_model_path and save_backbone_path?

    opened by michael-camilleri 3
  • 0- or 1- based indexing

    0- or 1- based indexing

    Just checking

    The labels for the classifier, are they 1- or 0-based indexed? i.e. is the smallest label 0 or 1? the reason I am asking is that the associated label file seems to start from 1, but when I did the same for my classes, it gave me indexing errors and I had to do everything starting from 0.

    opened by michael-camilleri 2
  • results on ActionGenome dataset

    results on ActionGenome dataset

    @gorjanradevski Dear author, I downloaded the STLT model on Action Genome Oracle and tried to run the following code, but the map is 58.75, slightly lower than 60.6 in the paper. Can you tell me why? `(dp-env) wn@node01:~/temp/STLT$ python inference.py --test_dataset_path "data/ActionGenome/val_dataset.json" --batch_size 1 --dataset_type "layout" --model_name "stlt" --checkpoint_path "checkpoints/action_genome_gt_stlt.pt" --dataset_name "action_genome" --labels_path "data/ActionGenome/labels.json" --videoid2size_path "data/ActionGenome/videoid2size.json" INFO:root:Preparing dataset... INFO:root:Inference on 1814 INFO:root:Preparing model... INFO:root:================================== INFO:root:The model's configuration is:

    • Unique categories: 38
    • Number of classes: 157
    • Hidden size: 768
    • Hidden dropout probability: 0.1
    • Layer normalization epsilon: 1e-12
    • Number of attention heads: 12
    • Number of spatial layers: 4
    • Number of temporal layers: 8
    • Max number of layout frames: 256
    • The backbone path is: None
    • Freezing the backbone: False INFO:root:================================== INFO:root:Starting inference... 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1814/1814 [00:37<00:00, 48.10it/s] INFO:root:================================= INFO:root:The metrics are: INFO:root:map: 58.75 INFO:root:================================= `
    opened by NingWang2049 1
Owner
Gorjan
Third year PhD in Machine Learning at KU Leuven. Interested in multi-modal deep learning and spatial reasoning involving vision and language.
Gorjan
[BMVC 2021] Official PyTorch Implementation of Self-supervised learning of Image Scale and Orientation Estimation

Self-Supervised Learning of Image Scale and Orientation Estimation (BMVC 2021) This is the official implementation of the paper "Self-Supervised Learn

Jongmin Lee 17 Nov 10, 2022
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

Christoph Reich 23 Sep 21, 2022
Source code of our BMVC 2021 paper: AniFormer: Data-driven 3D Animation with Transformer

AniFormer This is the PyTorch implementation of our BMVC 2021 paper AniFormer: Data-driven 3D Animation with Transformer. Haoyu Chen, Hao Tang, Nicu S

null 7 Oct 22, 2021
Cascading Feature Extraction for Fast Point Cloud Registration (BMVC 2021)

Cascading Feature Extraction for Fast Point Cloud Registration This repository contains the source code for the paper [Arxive link comming soon]. Meth

null 7 May 26, 2022
Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021)

PGpoints Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021) Hyeontae Son, Young Min Kim Pre

Hyeontae Son 9 Jun 6, 2022
Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

Inter-Prototype (BMVC 2021): Official Project Webpage This repository provides the official PyTorch implementation of the following paper: Improving F

Jungsoo Lee 16 Jun 30, 2022
This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

inverse_attention This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021. Le

Firas Laakom 5 Jul 8, 2022
This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

Trajectory Prediction using Equivariant Continuous Convolution (ECCO) This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivar

Spatiotemporal Machine Learning 45 Jul 22, 2022
Codebase for the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL BASALT Challenge.

KAIROS MineRL BASALT Codebase for the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL B

Vinicius G. Goecks 37 Oct 30, 2022
Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

Formerly: Harvard Intelligent Probabilistic Systems Group -- Now at Princeton 1.5k Dec 29, 2022
A general 3D Object Detection codebase in PyTorch.

Det3D is the first 3D Object Detection toolbox which provides off the box implementations of many 3D object detection algorithms such as PointPillars, SECOND, PIXOR, etc, as well as state-of-the-art methods on major benchmarks like KITTI(ViP) and nuScenes(CBGS).

Benjin Zhu 1.4k Jan 5, 2023
Official codebase for Pretrained Transformers as Universal Computation Engines.

universal-computation Overview Official codebase for Pretrained Transformers as Universal Computation Engines. Contains demo notebook and scripts to r

Kevin Lu 210 Dec 28, 2022
AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

AOT-GAN for High-Resolution Image Inpainting Arxiv Paper | AOT-GAN: Aggregated Contextual Transformations for High-Resolution Image Inpainting Yanhong

Multimedia Research 214 Jan 3, 2023
This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

OpenAI 3k Dec 26, 2022
Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Decision Transformer Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas†, and Igor M

Kevin Lu 1.4k Jan 7, 2023
Codebase for the Summary Loop paper at ACL2020

Summary Loop This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples. Training

Canny Lab @ The University of California, Berkeley 44 Nov 4, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

null 910 Dec 28, 2022
Codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

Katherine Crowson 128 Dec 2, 2022