A PyTorch Implementation of PGL-SUM from "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. IEEE ISM 2021


PGL-SUM: Combining Global and Local Attention with Positional Encoding for Video Summarization

PyTorch Implementation of PGL-SUM

  • From "PGL-SUM: Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. IEEE ISM 2021.
  • Written by Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris and Ioannis Patras.
  • This software can be used for training a deep learning architecture which estimates frames' importance after modeling their dependencies with the help of global and local multi-head attention mechanisms that integrate a positional encoding component. Training is performed in a supervised manner based on ground-truth data (human-generated video summaries). After being trained on a collection of videos, the PGL-SUM model is capable of producing representative summaries for unseen videos, according to a user-specified time-budget about the summary duration.

Main dependencies

Developed, checked and verified on an Ubuntu 20.04.3 PC with a NVIDIA TITAN Xp GPU. Main packages required:

Python PyTorch CUDA Version cuDNN Version TensorBoard TensorFlow NumPy H5py
3.8(.8) 1.7.1 11.0 8005 2.4.1 2.3.0 1.20.2 2.10.0


Structured h5 files with the video features and annotations of the SumMe and TVSum datasets are available within the data folder. The GoogleNet features of the video frames were extracted by Ke Zhang and Wei-Lun Chao and the h5 files were obtained from Kaiyang Zhou. These files have the following structure:

    /features                 2D-array with shape (n_steps, feature-dimension)
    /gtscore                  1D-array with shape (n_steps), stores ground truth importance score (used for training, e.g. regression loss)
    /user_summary             2D-array with shape (num_users, n_frames), each row is a binary vector (used for test)
    /change_points            2D-array with shape (num_segments, 2), each row stores indices of a segment
    /n_frame_per_seg          1D-array with shape (num_segments), indicates number of frames in each segment
    /n_frames                 number of frames in original video
    /picks                    positions of subsampled frames in original video
    /n_steps                  number of subsampled frames
    /gtsummary                1D-array with shape (n_steps), ground truth summary provided by user (used for training, e.g. maximum likelihood)
    /video_name (optional)    original video name, only available for SumMe dataset

Original videos and annotations for each dataset are also available in the dataset providers' webpages:


To train the model using one of the aforementioned datasets and for a number of randomly created splits of the dataset (where in each split 80% of the data is used for training and 20% for testing) use the corresponding JSON file that is included in the data/splits directory. This file contains the 5 randomly-generated splits that were utilized in our experiments.

For training the model using a single split, run:

python model/main.py --split_index N --n_epochs E --batch_size B --video_type 'dataset_name'

where, N refers to the index of the used data split, E refers to the number of training epochs, B refers to the batch size, and dataset_name refers to the name of the used dataset.

Alternatively, to train the model for all 5 splits, use the run_summe_splits.sh and/or run_tvsum_splits.sh script and do the following:

chmod +x model/run_summe_splits.sh    # Makes the script executable.
chmod +x model/run_tvsum_splits.sh    # Makes the script executable.
./model/run_summe_splits.sh           # Runs the script. 
./model/run_tvsum_splits.sh           # Runs the script.  

Please note that after each training epoch the algorithm performs an evaluation step, using the trained model to compute the importance scores for the frames of each video of the test set. These scores are then used by the provided evaluation scripts to assess the overall performance of the model (in F-Score).

The progress of the training can be monitored via the TensorBoard platform and by:

  • opening a command line (cmd) and running: tensorboard --logdir=/path/to/log-directory --host=localhost
  • opening a browser and pasting the returned URL from cmd.


Setup for the training process:

  • In data_loader.py, specify the path to the h5 file of the used dataset, and the path to the JSON file containing data about the utilized data splits.
  • In configs.py, define the directory where the analysis results will be saved to.

Arguments in configs.py:

Parameter name Description Default Value Options
--mode Mode for the configuration. 'train' 'train', 'test'
--verbose Print or not training messages. 'false' 'true', 'false'
--video_type Used dataset for training the model. 'SumMe' 'SumMe', 'TVSum'
--input_size Size of the input feature vectors. 1024 int > 0
--seed Chosen number for generating reproducible random numbers. 12345 None, int
--fusion Type of the used approach for feature fusion. 'add' None, 'add', 'mult', 'avg', 'max'
--n_segments Number of video segments; equal to the number of local attention mechanisms. 4 None, int ≥ 2
--pos_enc Type of the applied positional encoding. 'absolute' None, 'absolute', 'relative'
--heads Number of heads of the global attention mechanism. 8 int > 0
--n_epochs Number of training epochs. 200 int > 0
--batch_size Size of the training batch, 20 for 'SumMe' and 40 for 'TVSum'. 20 0 < int ≤ len(Dataset)
--clip Gradient norm clipping parameter. 5 float
--lr Value of the adopted learning rate. 5e-5 float
--l2_req Value of the regularization factor. 1e-5 float
--split_index Index of the utilized data split. 0 0 ≤ int ≤ 4
--init_type Weight initialization method. 'xavier' None, 'xavier', 'normal', 'kaiming', 'orthogonal'
--init_gain Scaling factor for the initialization methods. None None, float

Model Selection and Evaluation

The utilized model selection criterion relies on the post-processing of the calculated losses over the training epochs and enables the selection of a well-trained model by indicating the training epoch. To evaluate the trained models of the architecture and automatically select a well-trained model, define the dataset_path in compute_fscores.py and run evaluate_exp.sh. To run this file, specify:

  • base_path/exp$exp_num: the path to the folder where the analysis results are stored,
  • $dataset: the dataset being used, and
  • $eval_method: the used approach for computing the overall F-Score after comparing the generated summary with all the available user summaries (i.e., 'max' for SumMe and 'avg' for TVSum).
sh evaluation/evaluate_exp.sh $exp_num $dataset $eval_method

For further details about the adopted structure of directories in our implementation, please check line #6 and line #11 of evaluate_exp.sh.

Trained models and Inference

We have released the trained models for our main experiments -namely Table III and Table IV- of our ISM 2021 paper. The inference.py script, lets you evaluate the -reported- trained models, for our 5 randomly-created data splits. Firstly, download the trained models, with the following script:

sudo apt-get install unzip wget
wget "https://zenodo.org/record/5635735/files/pretrained_models.zip?download=1" -O pretrained_models.zip
unzip pretrained_models.zip -d inference
rm -f pretrained_models.zip

Then, specify the PATHs for the model, the split_file and the dataset in use. Finally, run the script with the following syntax

python inference/inference.py --table ID --dataset 'dataset_name'

where, ID refers to the id of the reported table, and dataset_name refers to the name of the used dataset.


Copyright (c) 2021, Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, Ioannis Patras / CERTH-ITI. All rights reserved. This code is provided for academic, non-commercial use only. Redistribution and use in source and binary forms, with or without modification, are permitted for academic non-commercial use provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation provided with the distribution.

This software is provided by the authors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the authors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.


This work was supported by the EU Horizon 2020 programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1.
  • F-scores and data splits

    F-scores and data splits

    hello, I reproduce the code, when I load the pretrained model, I find that the results are close to the results given in your paper. But I try the same split as "MSVA",I found that the results changed, SumMe's F1 is 51.79 when I used the pretrained model. Does this phenomenon indicate that the result has a certain relationship with the division of the dataset

    edit by mpalaourg: Format issue question and description

    opened by sunguoquan1005 6
  • Unable to generate scalars.csv

    Unable to generate scalars.csv

    I ran the main.py file and then created a folder called summaries and stored logs and results in it. But then when I try running the choose_best_epoch.py I get an error saying: No such file or directory: PGL-SUM/Summaries/PGL-SUM/SumMe/logs/split0/scalars.csv

    I am unable to understand what are scalar.csv here and how should I get this directory. Your help will be highly appreciated. Thank you.

    opened by imgreattt 5
  • Result reproduction

    Result reproduction

    Hello, I got the following results when I ran your model: The verification result of SumMe dataset in table3 is AverageFscore=55.64; The verification result of SumMe dataset in table4 is AverageFscore=55.64. However, using the code you provided to train the model, the model verification result is AverageFscore=38.52. Do you know why? Hope to get your answer, thank you!

    opened by Pwoer-zy 4
  • Question regarding evaluation methodology

    Question regarding evaluation methodology

    Hi Georgios,

    Many thanks to you and the other authors for open-sourcing this work! :)

    I'm using the code in this repo to test out other summarization models as well, and had a question regarding the evaluation methodology, and was hoping you could clarify it for me - when evaluation/evaluate_exp.sh is run, I get the best epoch (and hence corresponding checkpoint) for each split file. The F-Scores for each split and average F-Score reported by this script are on the test videos. Now, if I wish to quickly check the performance of some summarization model like in Table III of your paper, I can look at this average F-Score, right? To rephrase, is the average F-Score returned by this script the same as the one reported in Table III in the paper?

    Best, Jobin

    opened by jobini 4
  • My video has very long duration. Is it possible to inference?

    My video has very long duration. Is it possible to inference?

    Hi, thanks for sharing your work!

    I'm impressed with your model. So I want to generate the summary result with the drama video that I have. This video has about an hour 30 minutes duration. It is about 3~5 gigabytes in size.

    I already tried to test this on this site (http://multimedia2.iti.gr/videosummarization/service/start.html ), but it failed because of size issue. So, do you think if I download and inference your model in person, can I get the summary result of my video successfully? (I know that the dataset used in the your experiment is about 5 minutes long.)

    opened by o0oooo0o 4
  • How to test video

    How to test video

    I want to test my own video, but there is no implementation of googlnet and kts here. I copied googlenet and kts from https://github.com/StevRamos/video_summarization, but found that the characteristics of calculation are quite different from those in h5. Do you know the difference between googlenet extraction in h5 and https://github.com/StevRamos/video_summarization?

    In h5, the sum of googlenet's 1024‘d features is between 10 and 20, but I calculated it to be more than 200.

    opened by Lvqin001 2
  • Could you give me example in training

    Could you give me example in training

    I know this line is to train model. python model/main.py --split_index N --n_epochs E --batch_size B --video_type 'dataset_name' However, I do not how to setup these parameters. Can you give me the sample ?


    opened by ting-chih 2
  • Regarding feature size

    Regarding feature size

    After reading through your README, just one thing leaves a little uncertainty.

    since GoogleNet's output feature size is 1000, which does not quite match to 1024 which is the feature size defined in this project. So I just wonder how you converted the size 1000 to 1024.

    or is it simply removing the last fc layer before things are converted from 1024 to 1000?

    Thanks in advance :)

    opened by Youngwoo-git 1
Evlampios Apostolidis
Evlampios Apostolidis
Deep Semisupervised Multiview Learning With Increasing Views (IEEE TCYB 2021, PyTorch Code)

Deep Semisupervised Multiview Learning With Increasing Views (ISVN, IEEE TCYB) Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, Jie Lin, Huaibai Yan, Dez

null 3 Nov 19, 2022
Implements an infinite sum of poisson-weighted convolutions

An infinite sum of Poisson-weighted convolutions Kyle Cranmer, Aug 2018 If viewing on GitHub, this looks better with nbviewer: click here Consider a v

Kyle Cranmer 26 Dec 7, 2022
Sum-Product Probabilistic Language

Sum-Product Probabilistic Language SPPL is a probabilistic programming language that delivers exact solutions to a broad range of probabilistic infere

MIT Probabilistic Computing Project 57 Nov 17, 2022
Consecutive-Subsequence - Simple software to calculate susequence with highest sum

Simple software to calculate susequence with highest sum This repository contain

Gbadamosi Farouk 1 Jan 31, 2022
Multiple-criteria decision-making (MCDM) with Electre, Promethee, Weighted Sum and Pareto

EasyMCDM - Quick Installation methods Install with PyPI Once you have created your Python environment (Python 3.6+) you can simply type: pip3 install

Labrak Yanis 6 Nov 22, 2022
Offcial repository for the IEEE ICRA 2021 paper Auto-Tuned Sim-to-Real Transfer.

Offcial repository for the IEEE ICRA 2021 paper Auto-Tuned Sim-to-Real Transfer.

null 47 Jun 30, 2022
(IEEE TIP 2021) Regularized Densely-connected Pyramid Network for Salient Instance Segmentation

RDPNet IEEE TIP 2021: Regularized Densely-connected Pyramid Network for Salient Instance Segmentation PyTorch training and testing code are available.

Yu-Huan Wu 41 Oct 21, 2022
[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdhury, Yongxin Yan

Ayan Kumar Bhunia 44 Dec 12, 2022
Danfeng Hong, Lianru Gao, Jing Yao, Bing Zhang, Antonio Plaza, Jocelyn Chanussot. Graph Convolutional Networks for Hyperspectral Image Classification, IEEE TGRS, 2021.

Graph Convolutional Networks for Hyperspectral Image Classification Danfeng Hong, Lianru Gao, Jing Yao, Bing Zhang, Antonio Plaza, Jocelyn Chanussot T

Danfeng Hong 154 Dec 13, 2022
🔥RandLA-Net in Tensorflow (CVPR 2020, Oral & IEEE TPAMI 2021)

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (CVPR 2020) This is the official implementation of RandLA-Net (CVPR2020, Oral

Qingyong 1k Dec 30, 2022
The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

Watermark-Robustness-Toolbox - Official PyTorch Implementation This repository contains the official PyTorch implementation of the following paper to

null 49 Dec 19, 2022
Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted)

NLOS-OT Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted) Description In this reposit

Ruixu Geng(耿瑞旭) 16 Dec 16, 2022
Official Keras Implementation for UNet++ in IEEE Transactions on Medical Imaging and DLMIA 2018

UNet++: A Nested U-Net Architecture for Medical Image Segmentation UNet++ is a new general purpose image segmentation architecture for more accurate i

Zongwei Zhou 1.8k Jan 7, 2023
Learning from Synthetic Shadows for Shadow Detection and Removal [Inoue+, IEEE TCSVT 2020].

Learning from Synthetic Shadows for Shadow Detection and Removal (IEEE TCSVT 2020) Overview This repo is for the paper "Learning from Synthetic Shadow

Naoto Inoue 67 Dec 28, 2022
Y. Zhang, Q. Yao, W. Dai, L. Chen. AutoSF: Searching Scoring Functions for Knowledge Graph Embedding. IEEE International Conference on Data Engineering (ICDE). 2020

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

AutoML Research 64 Dec 17, 2022
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

null 107 Dec 2, 2022
Deep Learning for 3D Point Clouds: A Survey (IEEE TPAMI, 2020)

??Deep Learning for 3D Point Clouds (IEEE TPAMI, 2020)

Qingyong 1.4k Jan 8, 2023
Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification We provide the codes for repr

null 12 Dec 12, 2022
UnpNet - Rethinking 3-D LiDAR Point Cloud Segmentation(IEEE TNNLS)

UnpNet Citation Please cite the following paper if you use this repository in your reseach. @article {PMID:34914599, Title = {Rethinking 3-D LiDAR Po

Shijie Li 4 Jul 15, 2022