A new video text spotting framework with Transformer

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

License: MIT

Introduction

A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer

Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting

Updates

  • (08/04/2021) Refactoring the code.

  • (10/20/2021) The complete code has been released .

ICDAR2015(video) Tracking challenge

Methods MOTA MOTP IDF1 Mostly Matched Partially Matched Mostly Lost
TransVTSpotter 45.75 73.58 57.56 658 611 647

Models are also available in Baidu Drive by code m4iv.

Notes

  • The training time is on 8 NVIDIA V100 GPUs with batchsize 16.
  • We use the models pre-trained on COCOTextV2.
  • We do not release the recognition code due to the company's regulations.

Demo

Installation

The codebases are built on top of Deformable DETR and TransTrack.

Requirements

  • Linux, CUDA>=9.2, GCC>=5.4
  • Python>=3.7
  • PyTorch ≥ 1.5 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this
  • OpenCV is optional and needed by demo and visualization

Steps

  1. Install and build libs
git clone [email protected]:weijiawu/TransVTSpotter.git
cd TransVTSpotter
cd models/ops
python setup.py build install
cd ../..
pip install -r requirements.txt
  1. Prepare datasets and annotations
# pretrain COCOTextV2
python3 track_tools/convert_COCOText_to_coco.py

# ICDAR15
python3 track_tools/convert_ICDAR15video_to_coco.py

COCOTextV2 dataset is available in COCOTextV2.

python3 track_tools/convert_crowdhuman_to_coco.py

ICDAR2015 dataset is available in icdar2015.

python3 track_tools/convert_mot_to_coco.py
  1. Pre-train on COCOTextV2
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2  --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth

python3 track_tools/Pretrain_model_to_mot.py

The pre-trained model is available COCOTextV2_pretrain.pth, password:59w8. And the MOTA 44% can be found here password:xnlw.

  1. Train TransVTSpotter
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2  --with_box_refine  --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth
  1. Visualize TransVTSpotter
python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py

License

TransVTSpotter is released under MIT License.

Citing

If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

Comments
  • About Recognition model

    About Recognition model

    Hi, the recognition model in your paper is MASTER. l know for some reasons u can't open the recognition code. Could you please tell me whether u use the Vanilla MASTER or the modified one?Thanks!

    opened by imMid-Star 1
  • No res_video_1.json after running

    No res_video_1.json after running "python track_tools/convert_ICDAR15video_to_coco.py"

    Hi,

    Thanks for your great work!

    I am a bit confused after I run the python track_tools/Evaluation_ICDAR15_video/vis_tracking.py Then, I get "No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.json'

    I have seen issue #2 , and confirm I have run the python track_tools/convert_ICDAR15video_to_coco.py But, it seems that the "res_video_1.json" has not been generated successfully. I only find "train.json" and "test.json" under the "annotations_coco_rotate/", should I name one of them to "res_video_1.json" and copy it to "./output/ICDAR15/test/best_json_tracks/res_video_1.json"?

    Plz, help me! Thanks a lot!

    opened by liuruijin17 4
  • RuntimeError: median cannot be called with empty tensor

    RuntimeError: median cannot be called with empty tensor

    Traceback (most recent call last): File "main_track.py", line 363, in main(args) File "main_track.py", line 326, in main model, criterion, data_loader_train, optimizer, device, epoch, args.clip_max_norm) File "TransVTSpotter/engine_track.py", line 41, in train_one_epoch for _ in metric_logger.log_every(range(len(data_loader)), print_freq, header): File "TransVTSpotter/util/misc.py", line 260, in log_every meters=str(self), File "TransVTSpotter/util/misc.py", line 210, in str "{}: {}".format(name, str(meter)) File "TransVTSpotter/util/misc.py", line 109, in str median=self.median, File "TransVTSpotter/util/misc.py", line 88, in median return d.median().item() RuntimeError: median cannot be called with empty tensor

    l think there might be something wrong with the datasets. My path of the datasets is as below: image

    Is that right? Can u give me some examples of the structure of the datasets or the solution to this error? Thanks!

    opened by imMid-Star 2
  • Couldn't get the json file

    Couldn't get the json file

    there was an error "FileNotFoundError: [Errno 2] No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.mp4.json"

    I downloaded the IC15 video dataset and run "python track_tools/convert_ICDAR15video_to_coco.py".

    And I couldn't find files in json or jpg format downloaded from the icdar website https://rrc.cvc.uab.es/?ch=3&com=downloads. Unziped files only have '***.mp4' or '***.xml' and '***.txt'

    How could I get the json annotatation files such as 'res_video_1.mp4.json'?

    opened by D201830280253 2
  • Cannot reproduce results.

    Cannot reproduce results.

    Thank you for the nice work! I'm having problems reproducing the results in your paper. I was hoping you can help.

    I have done the following steps.

    1. Download ICDAR15 video training and official test video dataset.
    2. Prepare training and test dataset folder using: video2frames & convert_ICDAR15video_to_coco.
    3. Download pretrain_coco.pth from your Baidu drive.
    4. Train on ICDAR15 video using python -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/icdar_tiv --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv" --batch_size 2 --with_box_refine --num_queries 300 --epochs 80 --lr_drop 40 --resume ./pths/pretrain_coco.pth.
    5. Generate inferences using trained model on official test set: python main_track.py --eval --output_dir ./output/icdar_tiv_submit --resume ./output/icdar_tiv/checkpoint0079.pth --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv_test" --batch_size 1 --with_box_refine --num_queries 300
    6. Zip up the results in output/icdar_tiv_submit/text/xml_dir.
    7. Submit results to official ICDAR2015.

    The resulting MOTA is 2.08% and very far from the expected ~45%. Note that the "Mostly Matched" is 842 matching reported results, so it seems that the object detection is working, but tracking is failing. Am I missing something from the code? Thanks for any help.

    opened by tonysherbondy 13
Owner
weijiawu
computer version, OCR I am looking for a research intern or visiting chance.
weijiawu
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
A Convolutional Transformer for Keyword Spotting

☢️ Audiomer ☢️ Audiomer: A Convolutional Transformer for Keyword Spotting [ arXiv ] [ Previous SOTA ] [ Model Architecture ] Results on SpeechCommands

null 49 Jan 27, 2022
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
PyTorch implementations of neural network models for keyword spotting

Honk: CNNs for Keyword Spotting Honk is a PyTorch reimplementation of Google's TensorFlow convolutional neural networks for keyword spotting, which ac

Castorini 475 Dec 15, 2022
Keyword spotting on Arm Cortex-M Microcontrollers

Keyword spotting for Microcontrollers This repository consists of the tensorflow models and training scripts used in the paper: Hello Edge: Keyword sp

Arm Software 1k Dec 30, 2022
Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

null 19 Sep 29, 2022
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

null 1 Jan 23, 2022
The all new way to turn your boring vector meshes into the new fad in town; Voxels!

Voxelator The all new way to turn your boring vector meshes into the new fad in town; Voxels! Notes: I have not tested this on a rotated mesh. With fu

null 6 Feb 3, 2022
We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

Facebook Research 42 Dec 9, 2022
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
Mesh Graphormer is a new transformer-based method for human pose and mesh reconsruction from an input image

MeshGraphormer ✨ ✨ This is our research code of Mesh Graphormer. Mesh Graphormer is a new transformer-based method for human pose and mesh reconsructi

Microsoft 251 Jan 8, 2023
Decision Transformer: A brand new Offline RL Pattern

DecisionTransformer_StepbyStep Intro Decision Transformer: A brand new Offline RL Pattern. 这是关于NeurIPS 2021 热门论文Decision Transformer的复现。 ?? 原文地址: Deci

Irving 14 Nov 22, 2022
VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

Preparation Please see dataset/README.md to get more details about our datasets-VIL100 Please see INSTALL.md to install environment and evaluation too

null 82 Dec 15, 2022
Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

DHF1K =========================================================================== Wenguan Wang, J. Shen, M.-M Cheng and A. Borji, Revisiting Video Sal

Wenguan Wang 126 Dec 3, 2022
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

Microsoft 61 Nov 14, 2022
A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network The official code of VisionLAN (ICCV2021). VisionLAN successfully a

null 81 Dec 12, 2022
Plato: A New Framework for Federated Learning Research

a new software framework to facilitate scalable federated learning research.

System Group@Theory Lab 192 Jan 5, 2023