TransVTSpotter: End-to-end Video Text Spotter with Transformer

weijiawu

Last update: Dec 26, 2022

Related tags

Deep Learning TransVTSpotter

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

Introduction

A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer

Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting

Updates

(08/04/2021) Refactoring the code.
(10/20/2021) The complete code has been released .

ICDAR2015(video) Tracking challenge

Methods	MOTA	MOTP	IDF1	Mostly Matched	Partially Matched	Mostly Lost
TransVTSpotter	45.75	73.58	57.56	658	611	647

Notes

The training time is on 8 NVIDIA V100 GPUs with batchsize 16.
We use the models pre-trained on COCOTextV2.
We do not release the recognition code due to the company's regulations.

Demo

Installation

The codebases are built on top of Deformable DETR and TransTrack.

Requirements

Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7
PyTorch ≥ 1.5 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this
OpenCV is optional and needed by demo and visualization

Steps

Install and build libs

git clone [email protected]:weijiawu/TransVTSpotter.git
cd TransVTSpotter
cd models/ops
python setup.py build install
cd ../..
pip install -r requirements.txt

Prepare datasets and annotations

# pretrain COCOTextV2
python3 track_tools/convert_COCOText_to_coco.py

# ICDAR15
python3 track_tools/convert_ICDAR15video_to_coco.py

COCOTextV2 dataset is available in COCOTextV2.

python3 track_tools/convert_crowdhuman_to_coco.py

ICDAR2015 dataset is available in icdar2015.

python3 track_tools/convert_mot_to_coco.py

Pre-train on COCOTextV2

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2  --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth

python3 track_tools/Pretrain_model_to_mot.py

The pre-trained model is available Baidu Netdisk， password:59w8. Google Netdisk

And the MOTA 44% can be found here password:xnlw. Google Netdisk

Train TransVTSpotter

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2  --with_box_refine  --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth

Visualize TransVTSpotter

python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py

License

TransVTSpotter is released under MIT License.

Citing

If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{wu2021opentext,
  title={A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer},
  author={Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou},
  journal={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
  year={2021}
}

Comments

About Recognition model

Hi, the recognition model in your paper is MASTER. l know for some reasons u can't open the recognition code. Could you please tell me whether u use the Vanilla MASTER or the modified one？Thanks！

opened by imMid-Star 1
No res_video_1.json after running "python track_tools/convert_ICDAR15video_to_coco.py"

Hi,

Thanks for your great work!

I am a bit confused after I run the python track_tools/Evaluation_ICDAR15_video/vis_tracking.py Then, I get "No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.json'

I have seen issue #2 , and confirm I have run the python track_tools/convert_ICDAR15video_to_coco.py But, it seems that the "res_video_1.json" has not been generated successfully. I only find "train.json" and "test.json" under the "annotations_coco_rotate/", should I name one of them to "res_video_1.json" and copy it to "./output/ICDAR15/test/best_json_tracks/res_video_1.json"?

Plz, help me! Thanks a lot!

opened by liuruijin17 4
RuntimeError: median cannot be called with empty tensor

Traceback (most recent call last): File "main_track.py", line 363, in main(args) File "main_track.py", line 326, in main model, criterion, data_loader_train, optimizer, device, epoch, args.clip_max_norm) File "TransVTSpotter/engine_track.py", line 41, in train_one_epoch for _ in metric_logger.log_every(range(len(data_loader)), print_freq, header): File "TransVTSpotter/util/misc.py", line 260, in log_every meters=str(self), File "TransVTSpotter/util/misc.py", line 210, in str "{}: {}".format(name, str(meter)) File "TransVTSpotter/util/misc.py", line 109, in str median=self.median, File "TransVTSpotter/util/misc.py", line 88, in median return d.median().item() RuntimeError: median cannot be called with empty tensor

l think there might be something wrong with the datasets. My path of the datasets is as below:

Is that right? Can u give me some examples of the structure of the datasets or the solution to this error? Thanks!

opened by imMid-Star 2
Couldn't get the json file

there was an error "FileNotFoundError: [Errno 2] No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.mp4.json"

I downloaded the IC15 video dataset and run "python track_tools/convert_ICDAR15video_to_coco.py".

And I couldn't find files in json or jpg format downloaded from the icdar website https://rrc.cvc.uab.es/?ch=3&com=downloads. Unziped files only have '***.mp4' or '***.xml' and '***.txt'

How could I get the json annotatation files such as 'res_video_1.mp4.json'?

opened by D201830280253 2
Cannot reproduce results.
Thank you for the nice work! I'm having problems reproducing the results in your paper. I was hoping you can help.

I have done the following steps.

Download ICDAR15 video training and official test video dataset.

Prepare training and test dataset folder using: video2frames & convert_ICDAR15video_to_coco.

Download pretrain_coco.pth from your Baidu drive.

Train on ICDAR15 video using python -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/icdar_tiv --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv" --batch_size 2 --with_box_refine --num_queries 300 --epochs 80 --lr_drop 40 --resume ./pths/pretrain_coco.pth.

Generate inferences using trained model on official test set: python main_track.py --eval --output_dir ./output/icdar_tiv_submit --resume ./output/icdar_tiv/checkpoint0079.pth --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv_test" --batch_size 1 --with_box_refine --num_queries 300

Zip up the results in output/icdar_tiv_submit/text/xml_dir.

Submit results to official ICDAR2015.

The resulting MOTA is 2.08% and very far from the expected ~45%. Note that the "Mostly Matched" is 842 matching reported results, so it seems that the object detection is working, but tracking is failing. Am I missing something from the code? Thanks for any help.
opened by tonysherbondy 13

Owner

weijiawu

computer version, OCR I am looking for a research intern or visiting chance.

GitHub

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

225 Nov 13, 2022

[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

TransFuser This repository contains the code for the CVPR 2021 paper Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. If you find our

695 Jan 5, 2023

"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

SOLQ: Segmenting Objects by Learning Queries This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

179 Jan 2, 2023

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

348 Jan 7, 2023

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

3DETR: An End-to-End Transformer Model for 3D Object Detection PyTorch implementation and models for 3DETR. 3DETR (3D DEtection TRansformer) is a simp

487 Dec 31, 2022

METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

257 Jan 6, 2023

Pytorch library for end-to-end transformer models training and serving

768 Jan 1, 2023

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

An end-to-end PyTorch framework for image and video classification

What's New: March 2021: Added RegNetZ models November 2020: Vision Transformers now available, with training recipes! 2020-11-20: Classy Vision v0.5 R

1.5k Dec 31, 2022

Towards End-to-end Video-based Eye Tracking

Towards End-to-end Video-based Eye Tracking The code accompanying our ECCV 2020 publication and dataset, EVE. Authors: Seonwook Park, Emre Aksan, Xuco

76 Dec 12, 2022

A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

225 Dec 25, 2022

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

16 Nov 28, 2022

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocusV2 This repo contains the official code and pre-trained models for AdaFo

79 Dec 26, 2022

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Media Computing Group @ Nankai University

537 Jan 7, 2023

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

19 Sep 29, 2022

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech Jaehyeon Kim, Jungil Kong, and Juhee Son In our rece

1.7k Jan 8, 2023

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

1 Dec 13, 2021

An end-to-end machine learning web app to predict rugby scores (Pandas, SQLite, Keras, Flask, Docker)

Rugby score prediction An end-to-end machine learning web app to predict rugby scores Overview An demo project to provide a high-level overview of the

34 May 24, 2022

TransVTSpotter: End-to-end Video Text Spotter with Transformer

Related tags

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

Introduction

Updates

ICDAR2015(video) Tracking challenge

Notes

Demo

Installation

Requirements

Steps

License

Citing

Comments

About Recognition model

No res_video_1.json after running "python track_tools/convert_ICDAR15video_to_coco.py"

RuntimeError: median cannot be called with empty tensor

Couldn't get the json file

Cannot reproduce results.

Owner

weijiawu

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

METER: Multimodal End-to-end TransformER

Pytorch library for end-to-end transformer models training and serving

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

An end-to-end PyTorch framework for image and video classification

Towards End-to-end Video-based Eye Tracking

A Joint Video and Image Encoder for End-to-End Retrieval

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

An end-to-end machine learning web app to predict rugby scores (Pandas, SQLite, Keras, Flask, Docker)