Goal of the project : Detecting Temporal Boundaries in Sign Language videos

Overview

MVA RecVis course final project :

Goal of the project : Detecting Temporal Boundaries in Sign Language videos.

Sign language automatic indexing is an important challenge to develop better communication tools for the deaf community. However, annotated datasets for sign langage are limited, and there are few people with skills to anotate such data, which makes it hard to train performant machine learning models. An important challenge is therefore to :

  • Increase available training datasets.
  • Make labeling easier for professionnals to reduce risks of bad annotations.

In this context, techniques have emerged to perform automatic sign segmentation in videos, by marking the boundaries between individual signs in sign language videos. The developpment of such tools offers the potential to alleviate the limited supply of labelled dataset currently available for sign research.

demo

Previous work and personal contribution :

This repository provides code for the Object Recognition & Computer Vision (RecVis) course Final project. For more details please refer the the project report report.pdf. In this project, we reproduced the results obtained on the following paper (by using the code from this repository) :

We used the pre-extracted frame-level features obtained by applying the I3D model on videos to retrain the MS-TCN architecture for frame-level binary classification and reproduce the papers results. The tests folder proposes a notebook for reproducing the original paper results, with a meanF1B = 68.68 on the evaluation set of the BSL Corpus.

We further implemented new models in order to improve this result. We wanted to try attention based models as they have received recently a huge gain of interest in the vision research community. We first tried to train a Vanilla Transformer Encoder from scratch, but the results were not satisfactory.

  • Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: (2018).

We then implemented the ASFormer model (Transformer for Action Segementation), using this code : a hybrid transformer model using some interesting ideas from the MS-TCN architecture. The motivations behind the model and its architecture are detailed in the following paper :

We trained this model on the I3D extracted features and obtained an improvement over the MS-TCN architecture. The results are given in the following table :

ID Model mF1B mF1S
1 MS-TCN 68.68±0.6 47.71±0.8
2 Transformer Encoder 60.28±0.3 42.70±0.2
3 ASFormer 69.79±0.2 49.23±1.2

Contents

Setup

# Clone this repository
git clone https://github.com/loubnabnl/Sign-Segmentation-with-Transformers.git
cd Sign-Segmentation-with-Transformers/
# Create signseg_env environment
conda env create -f environment.yml
conda activate signseg_env

Data and models

You can download the pretrained models (I3D and MS-TCN) (models.zip [302MB]) and data (data.zip [5.5GB]) used in the experiments here or by executing download/download_*.sh. The unzipped data/ and models/ folders should be located on the root directory of the repository (for using the demo downloading the models folder is sufficient).

You can download our best pretrained ASFormer model weights here.

Data:

Please cite the original datasets when using the data: BSL Corpus The authors of github.com/RenzKa/sign-segmentation provided the pre-extracted features and metadata. See here for a detailed description of the data files.

  • Features: data/features/*/*/features.mat
  • Metadata: data/info/*/info.pkl

Models:

  • I3D weights, trained for sign classification: models/i3d/*.pth.tar
  • MS-TCN weights for the demo (see tables below for links to the other models): models/ms-tcn/*.model
  • As_former weights of our best model : models/asformer/*.model

The folder structure should be as below:

sign-segmentation/models/
  i3d/
    i3d_kinetics_bslcp.pth.tar
  ms-tcn/
    mstcn_bslcp_i3d_bslcp.model
  asformer/
    best_asformer_bslcp.model

Demo

The demo folder contains a sample script to estimate the segments of a given sign language video, one can run demo.pyto get a visualization on a sample video.

cd demo
python demo.py

The demo will:

  1. use the models/i3d/i3d_kinetics_bslcp.pth.tar pretrained I3D model to extract features,
  2. use the models/asformer/best_asformer_model.model pretrained ASFormer model to predict the segments out of the features.
  3. save results.

Training

To train I3D please refer to github.com/RenzKa/sign-segmentation. To train ASFormer on the pre-extracted I3D features run main.py, you can change hyperparameters in the arguments inside the file. Or you can run the notebook in the folder test_asformer.

Citation

If you use this code and data, please cite the original papers following:

@inproceedings{Renz2021signsegmentation_a,
    author       = "Katrin Renz and Nicolaj C. Stache and Samuel Albanie and G{\"u}l Varol",
    title        = "Sign Language Segmentation with Temporal Convolutional Networks",
    booktitle    = "ICASSP",
    year         = "2021",
}
@article{yi2021asformer,
  title={Asformer: Transformer for action segmentation},
  author={Yi, Fangqiu and Wen, Hongyu and Jiang, Tingting},
  journal={arXiv preprint arXiv:2110.08568},
  year={2021}
}

License

The license in this repository only covers the code. For data.zip and models.zip we refer to the terms of conditions of original datasets.

Acknowledgements

The code builds on the github.com/RenzKa/sign-segmentation and github.com/ChinaYi/ASFormer repositories.

You might also like...
Sign Language Transformers (CVPR'20)

Sign Language Transformers (CVPR'20) This repo contains the training and evaluation code for the paper Sign Language Transformers: Sign Language Trans

Sign Language Translation with Transformers (COLING'2020, ECCV'20 SLRTP Workshop)

transformer-slt This repository gathers data and code supporting the experiments in the paper Better Sign Language Translation with STMC-Transformer.

Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

Progressive Transformers for End-to-End Sign Language Production Source code for "Progressive Transformers for End-to-End Sign Language Production" (B

Sign Language is detected in realtime using video sequences. Our approach involves MediaPipe Holistic for keypoints extraction and LSTM Model for prediction.
Sign Language is detected in realtime using video sequences. Our approach involves MediaPipe Holistic for keypoints extraction and LSTM Model for prediction.

RealTime Sign Language Detection using Action Recognition Approach Real-Time Sign Language is commonly predicted using models whose architecture consi

Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.
This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

Vietnamese sign lagnuage recognition using MHI and CNN This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm

CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement
CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Temporal Context Aggregation Network - Pytorch This repo holds the pytorch-version codes of paper: "Temporal Context Aggregation Network for Temporal

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity
Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repository is the official PyTorch implementation of Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

Owner
Loubna Ben Allal
MVA (Mathematics, Vision, Learning) student at ENS Paris Saclay.
Loubna Ben Allal
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
This project demonstrates the use of neural networks and computer vision to create a classifier that interprets the Brazilian Sign Language.

LIBRAS-Image-Classifier This project demonstrates the use of neural networks and computer vision to create a classifier that interprets the Brazilian

Aryclenio Xavier Barros 26 Oct 14, 2022
A project to make Amazon Echo respond to sign language using your webcam

Making Alexa respond to Sign Language using Tensorflow.js Try the live demo Read the Blog Post on Tensorflow's Blog Coming Soon Watch the video This p

Abhishek Singh 444 Jan 3, 2023
Source code of "Hold me tight! Influence of discriminative features on deep network boundaries"

Hold me tight! Influence of discriminative features on deep network boundaries This is the source code to reproduce the experiments of the NeurIPS 202

EPFL LTS4 19 Dec 10, 2021
STEAL - Learning Semantic Boundaries from Noisy Annotations (CVPR 2019)

STEAL This is the official inference code for: Devil Is in the Edges: Learning Semantic Boundaries from Noisy Annotations David Acuna, Amlan Kar, Sanj

null 469 Dec 26, 2022
Code for sound field predictions in domains with impedance boundaries. Used for generating results from the paper

Code for sound field predictions in domains with impedance boundaries. Used for generating results from the paper

DTU Acoustic Technology Group 11 Dec 17, 2022
A DeepStack custom model for detecting common objects in dark/night images and videos.

DeepStack_ExDark This repository provides a custom DeepStack model that has been trained and can be used for creating a new object detection API for d

MOSES OLAFENWA 98 Dec 24, 2022
A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

Emma 1 Jan 18, 2022
Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

NLP_0-project Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures1. We are a "democratic" and c

null 3 Mar 16, 2022
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

Isen (Songyao Jiang) 128 Dec 8, 2022