TransVTSpotter: End-to-end Video Text Spotter with Transformer

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

License: MIT

Introduction

A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer

Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting

Updates

  • (08/04/2021) Refactoring the code.

  • (10/20/2021) The complete code has been released .

ICDAR2015(video) Tracking challenge

Methods MOTA MOTP IDF1 Mostly Matched Partially Matched Mostly Lost
TransVTSpotter 45.75 73.58 57.56 658 611 647

Notes

  • The training time is on 8 NVIDIA V100 GPUs with batchsize 16.
  • We use the models pre-trained on COCOTextV2.
  • We do not release the recognition code due to the company's regulations.

Demo

Installation

The codebases are built on top of Deformable DETR and TransTrack.

Requirements

  • Linux, CUDA>=9.2, GCC>=5.4
  • Python>=3.7
  • PyTorch ≥ 1.5 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this
  • OpenCV is optional and needed by demo and visualization

Steps

  1. Install and build libs
git clone [email protected]:weijiawu/TransVTSpotter.git
cd TransVTSpotter
cd models/ops
python setup.py build install
cd ../..
pip install -r requirements.txt
  1. Prepare datasets and annotations
# pretrain COCOTextV2
python3 track_tools/convert_COCOText_to_coco.py

# ICDAR15
python3 track_tools/convert_ICDAR15video_to_coco.py

COCOTextV2 dataset is available in COCOTextV2.

python3 track_tools/convert_crowdhuman_to_coco.py

ICDAR2015 dataset is available in icdar2015.

python3 track_tools/convert_mot_to_coco.py
  1. Pre-train on COCOTextV2
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2  --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth

python3 track_tools/Pretrain_model_to_mot.py

The pre-trained model is available Baidu Netdisk, password:59w8. Google Netdisk

And the MOTA 44% can be found here password:xnlw. Google Netdisk

  1. Train TransVTSpotter
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2  --with_box_refine  --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth
  1. Visualize TransVTSpotter
python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py

License

TransVTSpotter is released under MIT License.

Citing

If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{wu2021opentext,
  title={A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer},
  author={Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou},
  journal={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
  year={2021}
}
Comments
  • About Recognition model

    About Recognition model

    Hi, the recognition model in your paper is MASTER. l know for some reasons u can't open the recognition code. Could you please tell me whether u use the Vanilla MASTER or the modified one?Thanks!

    opened by imMid-Star 1
  • No res_video_1.json after running

    No res_video_1.json after running "python track_tools/convert_ICDAR15video_to_coco.py"

    Hi,

    Thanks for your great work!

    I am a bit confused after I run the python track_tools/Evaluation_ICDAR15_video/vis_tracking.py Then, I get "No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.json'

    I have seen issue #2 , and confirm I have run the python track_tools/convert_ICDAR15video_to_coco.py But, it seems that the "res_video_1.json" has not been generated successfully. I only find "train.json" and "test.json" under the "annotations_coco_rotate/", should I name one of them to "res_video_1.json" and copy it to "./output/ICDAR15/test/best_json_tracks/res_video_1.json"?

    Plz, help me! Thanks a lot!

    opened by liuruijin17 4
  • RuntimeError: median cannot be called with empty tensor

    RuntimeError: median cannot be called with empty tensor

    Traceback (most recent call last): File "main_track.py", line 363, in main(args) File "main_track.py", line 326, in main model, criterion, data_loader_train, optimizer, device, epoch, args.clip_max_norm) File "TransVTSpotter/engine_track.py", line 41, in train_one_epoch for _ in metric_logger.log_every(range(len(data_loader)), print_freq, header): File "TransVTSpotter/util/misc.py", line 260, in log_every meters=str(self), File "TransVTSpotter/util/misc.py", line 210, in str "{}: {}".format(name, str(meter)) File "TransVTSpotter/util/misc.py", line 109, in str median=self.median, File "TransVTSpotter/util/misc.py", line 88, in median return d.median().item() RuntimeError: median cannot be called with empty tensor

    l think there might be something wrong with the datasets. My path of the datasets is as below: image

    Is that right? Can u give me some examples of the structure of the datasets or the solution to this error? Thanks!

    opened by imMid-Star 2
  • Couldn't get the json file

    Couldn't get the json file

    there was an error "FileNotFoundError: [Errno 2] No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.mp4.json"

    I downloaded the IC15 video dataset and run "python track_tools/convert_ICDAR15video_to_coco.py".

    And I couldn't find files in json or jpg format downloaded from the icdar website https://rrc.cvc.uab.es/?ch=3&com=downloads. Unziped files only have '***.mp4' or '***.xml' and '***.txt'

    How could I get the json annotatation files such as 'res_video_1.mp4.json'?

    opened by D201830280253 2
  • Cannot reproduce results.

    Cannot reproduce results.

    Thank you for the nice work! I'm having problems reproducing the results in your paper. I was hoping you can help.

    I have done the following steps.

    1. Download ICDAR15 video training and official test video dataset.
    2. Prepare training and test dataset folder using: video2frames & convert_ICDAR15video_to_coco.
    3. Download pretrain_coco.pth from your Baidu drive.
    4. Train on ICDAR15 video using python -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/icdar_tiv --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv" --batch_size 2 --with_box_refine --num_queries 300 --epochs 80 --lr_drop 40 --resume ./pths/pretrain_coco.pth.
    5. Generate inferences using trained model on official test set: python main_track.py --eval --output_dir ./output/icdar_tiv_submit --resume ./output/icdar_tiv/checkpoint0079.pth --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv_test" --batch_size 1 --with_box_refine --num_queries 300
    6. Zip up the results in output/icdar_tiv_submit/text/xml_dir.
    7. Submit results to official ICDAR2015.

    The resulting MOTA is 2.08% and very far from the expected ~45%. Note that the "Mostly Matched" is 842 matching reported results, so it seems that the object detection is working, but tracking is failing. Am I missing something from the code? Thanks for any help.

    opened by tonysherbondy 13
Owner
weijiawu
computer version, OCR I am looking for a research intern or visiting chance.
weijiawu
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

TransFuser This repository contains the code for the CVPR 2021 paper Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. If you find our

null 695 Jan 5, 2023
"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

SOLQ: Segmenting Objects by Learning Queries This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

MEGVII Research 179 Jan 2, 2023
This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

null 348 Jan 7, 2023
Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

3DETR: An End-to-End Transformer Model for 3D Object Detection PyTorch implementation and models for 3DETR. 3DETR (3D DEtection TRansformer) is a simp

Facebook Research 487 Dec 31, 2022
METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

Zi-Yi Dou 257 Jan 6, 2023
Pytorch library for end-to-end transformer models training and serving

Pytorch library for end-to-end transformer models training and serving

Mikhail Grankin 768 Jan 1, 2023
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
An end-to-end PyTorch framework for image and video classification

What's New: March 2021: Added RegNetZ models November 2020: Vision Transformers now available, with training recipes! 2020-11-20: Classy Vision v0.5 R

Facebook Research 1.5k Dec 31, 2022
Towards End-to-end Video-based Eye Tracking

Towards End-to-end Video-based Eye Tracking The code accompanying our ECCV 2020 publication and dataset, EVE. Authors: Seonwook Park, Emre Aksan, Xuco

Seonwook Park 76 Dec 12, 2022
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

null 225 Dec 25, 2022
Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

null 16 Nov 28, 2022
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocusV2 This repo contains the official code and pre-trained models for AdaFo

null 79 Dec 26, 2022
Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Media Computing Group @ Nankai University 537 Jan 7, 2023
Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

null 19 Sep 29, 2022
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech Jaehyeon Kim, Jungil Kong, and Juhee Son In our rece

Jaehyeon Kim 1.7k Jan 8, 2023
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

null 1 Jan 23, 2022
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
An end-to-end machine learning web app to predict rugby scores (Pandas, SQLite, Keras, Flask, Docker)

Rugby score prediction An end-to-end machine learning web app to predict rugby scores Overview An demo project to provide a high-level overview of the

null 34 May 24, 2022