Hi Gorjan
I am running into an issue with setting the number of frames to sample from each video. To put it into context, I need to classify 1s clips at a time, which amount to 25 frames, and hence, cannot sample more than those. The current setup is 32 frames, and I changed appearance_num_frames to 25. However, this may be interfering with the forward_features() method of TransformerResnet. It seems that the ResNet outputs by default 32 sequence length, and not sure if this can be modified.
The error happens in models.py line 267, when it tries to join it with the position embedding. Any idea how I can rectify? am I interpreting the appearance_num_frames correctly to begin with?