A new video text spotting framework with Transformer

weijiawu

Last update: Jan 3, 2023

Related tags

Deep Learning TransVTSpotter

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

Introduction

A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer

Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting

Updates

(08/04/2021) Refactoring the code.
(10/20/2021) The complete code has been released .

ICDAR2015(video) Tracking challenge

Methods	MOTA	MOTP	IDF1	Mostly Matched	Partially Matched	Mostly Lost
TransVTSpotter	45.75	73.58	57.56	658	611	647

Models are also available in Baidu Drive by code m4iv.

Notes

The training time is on 8 NVIDIA V100 GPUs with batchsize 16.
We use the models pre-trained on COCOTextV2.
We do not release the recognition code due to the company's regulations.

Demo

Installation

The codebases are built on top of Deformable DETR and TransTrack.

Requirements

Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7
PyTorch ≥ 1.5 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this
OpenCV is optional and needed by demo and visualization

Steps

Install and build libs

git clone [email protected]:weijiawu/TransVTSpotter.git
cd TransVTSpotter
cd models/ops
python setup.py build install
cd ../..
pip install -r requirements.txt

Prepare datasets and annotations

# pretrain COCOTextV2
python3 track_tools/convert_COCOText_to_coco.py

# ICDAR15
python3 track_tools/convert_ICDAR15video_to_coco.py

COCOTextV2 dataset is available in COCOTextV2.

python3 track_tools/convert_crowdhuman_to_coco.py

ICDAR2015 dataset is available in icdar2015.

python3 track_tools/convert_mot_to_coco.py

Pre-train on COCOTextV2

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2  --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth

python3 track_tools/Pretrain_model_to_mot.py

The pre-trained model is available COCOTextV2_pretrain.pth， password:59w8. And the MOTA 44% can be found here password:xnlw.

Train TransVTSpotter

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2  --with_box_refine  --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth

Visualize TransVTSpotter

python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py

License

TransVTSpotter is released under MIT License.

Citing

If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

Comments

About Recognition model

Hi, the recognition model in your paper is MASTER. l know for some reasons u can't open the recognition code. Could you please tell me whether u use the Vanilla MASTER or the modified one？Thanks！

opened by imMid-Star 1
No res_video_1.json after running "python track_tools/convert_ICDAR15video_to_coco.py"

Hi,

Thanks for your great work!

I am a bit confused after I run the python track_tools/Evaluation_ICDAR15_video/vis_tracking.py Then, I get "No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.json'

I have seen issue #2 , and confirm I have run the python track_tools/convert_ICDAR15video_to_coco.py But, it seems that the "res_video_1.json" has not been generated successfully. I only find "train.json" and "test.json" under the "annotations_coco_rotate/", should I name one of them to "res_video_1.json" and copy it to "./output/ICDAR15/test/best_json_tracks/res_video_1.json"?

Plz, help me! Thanks a lot!

opened by liuruijin17 4
RuntimeError: median cannot be called with empty tensor

Traceback (most recent call last): File "main_track.py", line 363, in main(args) File "main_track.py", line 326, in main model, criterion, data_loader_train, optimizer, device, epoch, args.clip_max_norm) File "TransVTSpotter/engine_track.py", line 41, in train_one_epoch for _ in metric_logger.log_every(range(len(data_loader)), print_freq, header): File "TransVTSpotter/util/misc.py", line 260, in log_every meters=str(self), File "TransVTSpotter/util/misc.py", line 210, in str "{}: {}".format(name, str(meter)) File "TransVTSpotter/util/misc.py", line 109, in str median=self.median, File "TransVTSpotter/util/misc.py", line 88, in median return d.median().item() RuntimeError: median cannot be called with empty tensor

l think there might be something wrong with the datasets. My path of the datasets is as below:

Is that right? Can u give me some examples of the structure of the datasets or the solution to this error? Thanks!

opened by imMid-Star 2
Couldn't get the json file

there was an error "FileNotFoundError: [Errno 2] No such file or directory: './output/ICDAR15/test/best_json_tracks/res_video_1.mp4.json"

I downloaded the IC15 video dataset and run "python track_tools/convert_ICDAR15video_to_coco.py".

And I couldn't find files in json or jpg format downloaded from the icdar website https://rrc.cvc.uab.es/?ch=3&com=downloads. Unziped files only have '***.mp4' or '***.xml' and '***.txt'

How could I get the json annotatation files such as 'res_video_1.mp4.json'?

opened by D201830280253 2
Cannot reproduce results.
Thank you for the nice work! I'm having problems reproducing the results in your paper. I was hoping you can help.

I have done the following steps.

Download ICDAR15 video training and official test video dataset.

Prepare training and test dataset folder using: video2frames & convert_ICDAR15video_to_coco.

Download pretrain_coco.pth from your Baidu drive.

Train on ICDAR15 video using python -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/icdar_tiv --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv" --batch_size 2 --with_box_refine --num_queries 300 --epochs 80 --lr_drop 40 --resume ./pths/pretrain_coco.pth.

Generate inferences using trained model on official test set: python main_track.py --eval --output_dir ./output/icdar_tiv_submit --resume ./output/icdar_tiv/checkpoint0079.pth --dataset_file text --coco_path "${MY_DATA_DIR}/icdar_tiv_test" --batch_size 1 --with_box_refine --num_queries 300

Zip up the results in output/icdar_tiv_submit/text/xml_dir.

Submit results to official ICDAR2015.

The resulting MOTA is 2.08% and very far from the expected ~45%. Note that the "Mostly Matched" is 842 matching reported results, so it seems that the object detection is working, but tracking is failing. Am I missing something from the code? Thanks for any help.
opened by tonysherbondy 13

Owner

weijiawu

computer version, OCR I am looking for a research intern or visiting chance.

GitHub

Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

8 May 11, 2022

A Convolutional Transformer for Keyword Spotting

☢️ Audiomer ☢️ Audiomer: A Convolutional Transformer for Keyword Spotting [ arXiv ] [ Previous SOTA ] [ Model Architecture ] Results on SpeechCommands

49 Jan 27, 2022

Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

8 May 11, 2022

PyTorch implementations of neural network models for keyword spotting

Honk: CNNs for Keyword Spotting Honk is a PyTorch reimplementation of Google's TensorFlow convolutional neural networks for keyword spotting, which ac

475 Dec 15, 2022

Keyword spotting on Arm Cortex-M Microcontrollers

Keyword spotting for Microcontrollers This repository consists of the tensorflow models and training scripts used in the paper: Hello Edge: Keyword sp

1k Dec 30, 2022

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

19 Sep 29, 2022

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022

The all new way to turn your boring vector meshes into the new fad in town; Voxels!

Voxelator The all new way to turn your boring vector meshes into the new fad in town; Voxels! Notes: I have not tested this on a rotated mesh. With fu

6 Feb 3, 2022

⚡ Fast • 🪶 Lightweight • 0️⃣ Dependency • 🔌 Pluggable • 😈 TLS interception • 🔒 DNS-over-HTTPS • 🔥 Poor Man's VPN • ⏪ Reverse & ⏩ Forward • 👮🏿 "Proxy Server" framework • 🌐 "Web Server" framework • ➵ ➶ ➷ ➠ "PubSub" framework • 👷 "Work" acceptor & executor framework

Table of Contents Features Install Using PIP Stable version Development version Using Docker Stable version Development version Using HomeBrew Stable

2.2k Jan 8, 2023

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

42 Dec 9, 2022

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network The official code of VisionLAN (ICCV2021). VisionLAN successfully a

81 Dec 12, 2022

Plato: A New Framework for Federated Learning Research

a new software framework to facilitate scalable federated learning research.

192 Jan 5, 2023

A new video text spotting framework with Transformer

Related tags

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

Introduction

Updates

ICDAR2015(video) Tracking challenge

Notes

Demo

Installation

Requirements

Steps

License

Citing

Comments

About Recognition model

No res_video_1.json after running "python track_tools/convert_ICDAR15video_to_coco.py"

RuntimeError: median cannot be called with empty tensor

Couldn't get the json file

Cannot reproduce results.

Owner

weijiawu

Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

A Convolutional Transformer for Keyword Spotting

Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

PyTorch implementations of neural network models for keyword spotting

Keyword spotting on Arm Cortex-M Microcontrollers

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

The all new way to turn your boring vector meshes into the new fad in town; Voxels!

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Mesh Graphormer is a new transformer-based method for human pose and mesh reconsruction from an input image

Decision Transformer: A brand new Offline RL Pattern

VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Plato: A New Framework for Federated Learning Research