PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Rowel Atienza

Last update: Dec 27, 2022

Related tags

Overview

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021)

ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). It has a comparable accuracy with state-of-the-art STR models although it uses significantly less number of parameters and FLOPS. ViTSTR is also fast due to the parallel computation inherent to ViT architecture.

Paper

Arxiv

ViTSTR is built using a fork of CLOVA AI Deep Text Recognition Benchmark whose original documentation is at the bottom. Below we document how to train and evaluate ViTSTR-Tiny and ViTSTR-small.

Install requirements

pip3 install -r requirements.txt

Dataset

Download lmdb dataset. See CLOVA AI original documentation below.

Quick validation using a pre-trained model

ViTSTR-Small

CUDA_VISIBLE_DEVICES=0 python3 test.py --eval_data data_lmdb_release/evaluation 
--benchmark_all_eval --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer
--sensitive --data_filtering_off  --imgH 224 --imgW 224
--TransformerModel=vitstr_small_patch16_224 --saved_model 
https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_224_aug.pth

Available model weights:

Tiny	Small	Base
`vitstr_tiny_patch16_224`	`vitstr_small_patch16_224`	`vitstr_base_patch16_224`
ViTSTR-Tiny	ViTSTR-Small	ViTSTR-Base
ViTSTR-Tiny+Aug	ViTSTR-Small+Aug	ViTSTR-Base+Aug

Benchmarks (Top 1% accuracy)

Model	IIIT	SVT	IC03	IC03	IC13	IC13	IC15	IC15	SVTP	CT	Acc	Std
	3000	647	860	867	857	1015	1811	2077	645	288	%	%
TRBA (Baseline)	87.7	87.4	94.5	94.2	93.4	92.1	77.3	71.6	78.1	75.5	84.3	0.1
ViTSTR-Tiny	83.7	83.2	92.8	92.5	90.8	89.3	72.0	66.4	74.5	65.0	80.3	0.2
ViTSTR-Tiny+Aug	85.1	85.0	93.4	93.2	90.9	89.7	74.7	68.9	78.3	74.2	82.1	0.1
ViTSTR-Small	85.6	85.3	93.9	93.6	91.7	90.6	75.3	69.5	78.1	71.3	82.6	0.3
ViTSTR-Small+Aug	86.6	87.3	94.2	94.2	92.1	91.2	77.9	71.7	81.4	77.9	84.2	0.1
ViTSTR-Base	86.9	87.2	93.8	93.4	92.1	91.3	76.8	71.1	80.0	74.7	83.7	0.1
ViTSTR-Base+Aug	88.4	87.7	94.7	94.3	93.2	92.4	78.5	72.6	81.8	81.3	85.2	0.1

Comparison with other STR models

Accuracy vs Number of Parameters

Accuracy vs Speed (2080Ti GPU)

Accuracy vs FLOPS

Train

ViTSTR-Tiny without data augmentation

RANDOM=$$

CUDA_VISIBLE_DEVICES=0 python3 train.py --train_data data_lmdb_release/training
--valid_data data_lmdb_release/evaluation --select_data MJ-ST 
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer 
--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 
--manualSeed=$RANDOM  --sensitive

Multi-GPU training

ViTSTR-Small on a 4-GPU machine

It is recommended to train larger networks like ViTSTR-Small and ViTSTR-Base on a multi-GPU machine. To keep a fixed batch size at 192, use the --batch_size option. Divide 192 by the number of GPUs. For example, to train ViTSTR-Small on a 4-GPU machine, this would be --batch_size=48.

python3 train.py --train_data data_lmdb_release/training 
--valid_data data_lmdb_release/evaluation --select_data MJ-ST 
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer 
--TransformerModel=vitstr_small_patch16_224 --imgH 224 --imgW 224 
--manualSeed=$RANDOM --sensitive --batch_size=48

Data augmentation

ViTSTR-Tiny using rand augment

It is recommended to use more workers (eg from default of 4, use 32 instead) since the data augmentation process is CPU intensive. In determining the number of workers, a simple rule of thumb to follow is it can be set to a value between 25% to 50% of the total number of CPU cores. For example, for a system with 64 CPU cores, the number of workers can be set to 32 to use 50% of all cores. For multi-GPU systems, the number of workers must be divided by the number of GPUs. For example, for 32 workers in a 4-GPU system, --workers=8. For convenience, simply use --workers=-1, 50% of all cores will be used. Lastly, instead of using a constant learning rate, a cosine scheduler improves the performance of the model during training.

Below is a sample configuration for a 4-GPU system using batch size of 192.

python3 train.py --train_data data_lmdb_release/training
--valid_data data_lmdb_release/evaluation --select_data MJ-ST 
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer 
--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 
--manualSeed=$RANDOM  --sensitive
--batch_size=48 --isrand_aug --workers=-1 --scheduler

Test

ViTSTR-Tiny. Find the path to best_accuracy.pth checkpoint file (usually in saved_model folder).

CUDA_VISIBLE_DEVICES=0 python3 test.py --eval_data data_lmdb_release/evaluation 
--benchmark_all_eval --Transformation None --FeatureExtraction None  
--SequenceModeling None --Prediction None --Transformer 
--TransformerModel=vitstr_tiny_patch16_224 
--sensitive --data_filtering_off  --imgH 224 --imgW 224
--saved_model <path_to/best_accuracy.pth>

Citation

If you find this work useful, please cite:

@inproceedings{atienza2021vitstr,
  title={Vision Transformer for Fast and Efficient Scene Text Recognition},
  author={Atienza, Rowel},
  booktitle = {International Conference on Document Analysis and Recognition (ICDAR)},
  year={2021},
  pubstate={published},
  tppubtype={inproceedings}
}

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Official PyTorch implementation of our four-stage STR framework, that most existing STR models fit into.
Using this framework allows for the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets.
Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules.

Honors

Based on this framework, we recorded the 1st place of ICDAR2013 focused scene text, ICDAR2019 ArT and 3rd place of ICDAR2017 COCO-Text, ICDAR2019 ReCTS (task1).
The difference between our paper and ICDAR challenge is summarized here.

Updates

Aug 3, 2020: added guideline to use Baidu warpctc which reproduces CTC results of our paper.
Dec 27, 2019: added FLOPS in our paper, and minor updates such as log_dataset.txt and ICDAR2019-NormalizedED.
Oct 22, 2019: added confidence score, and arranged the output form of training logs.
Jul 31, 2019: The paper is accepted at International Conference on Computer Vision (ICCV), Seoul 2019, as an oral talk.
Jul 25, 2019: The code for floating-point 16 calculation, check @YacobBY's pull request
Jul 16, 2019: added ST_spe.zip dataset, word images contain special characters in SynthText (ST) dataset, see this issue
Jun 24, 2019: added gt.txt of failure cases that contains path and label of each image, see image_release_190624.zip
May 17, 2019: uploaded resources in Baidu Netdisk also, added Run demo. (check @sharavsambuu's colab demo also)
May 9, 2019: PyTorch version updated from 1.0.1 to 1.1.0, use torch.nn.CTCLoss instead of torch-baidu-ctc, and various minor updated.

Getting Started

Dependency

This work was tested with PyTorch 1.3.1, CUDA 10.1, python 3.6 and Ubuntu 16.04.
You may need pip3 install torch==1.3.1.
In the paper, expriments were performed with PyTorch 0.4.1, CUDA 9.0.
requirements : lmdb, pillow, torchvision, nltk, natsort

pip3 install lmdb pillow torchvision nltk natsort

Download lmdb dataset for traininig and evaluation from here

data_lmdb_release.zip contains below.
training datasets : MJSynth (MJ)[1] and SynthText (ST)[2]
validation datasets : the union of the training sets IC13[3], IC15[4], IIIT[5], and SVT[6].
evaluation datasets : benchmark evaluation datasets, consist of IIIT[5], SVT[6], IC03[7], IC13[3], IC15[4], SVTP[8], and CUTE[9].

Run demo with pretrained model

Download pretrained model from here
Add image files to test into demo_image/
Run demo.py (add --sensitive option if you use case-sensitive model)

CUDA_VISIBLE_DEVICES=0 python3 demo.py \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--image_folder demo_image/ \
--saved_model TPS-ResNet-BiLSTM-Attn.pth

prediction results

demo images	TRBA (TPS-ResNet-BiLSTM-Attn)	TRBA (case-sensitive version)
	available	Available
	shakeshack	SHARESHACK
	london	Londen
	greenstead	Greenstead
	toast	TOAST
	merry	MERRY
	underground	underground
	ronaldo	RONALDO
	bally	BALLY
	university	UNIVERSITY

Training and evaluation

Train CRNN[10] model

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC

Test CRNN[10] model. If you want to evaluate IC15-2077, check data filtering part.

CUDA_VISIBLE_DEVICES=0 python3 test.py \
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC \
--saved_model saved_models/None-VGG-BiLSTM-CTC-Seed1111/best_accuracy.pth

Try to train and test our best accuracy model TRBA (TPS-ResNet-BiLSTM-Attn) also. (download pretrained model)

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

CUDA_VISIBLE_DEVICES=0 python3 test.py \
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--saved_model saved_models/TPS-ResNet-BiLSTM-Attn-Seed1111/best_accuracy.pth

Arguments

--train_data: folder path to training lmdb dataset.
--valid_data: folder path to validation lmdb dataset.
--eval_data: folder path to evaluation (with test.py) lmdb dataset.
--select_data: select training data. default is MJ-ST, which means MJ and ST used as training data.
--batch_ratio: assign ratio for each selected data in the batch. default is 0.5-0.5, which means 50% of the batch is filled with MJ and the other 50% of the batch is filled ST.
--data_filtering_off: skip data filtering when creating LmdbDataset.
--Transformation: select Transformation module [None | TPS].
--FeatureExtraction: select FeatureExtraction module [VGG | RCNN | ResNet].
--SequenceModeling: select SequenceModeling module [None | BiLSTM].
--Prediction: select Prediction module [CTC | Attn].
--saved_model: assign saved model to evaluation.
--benchmark_all_eval: evaluate with 10 evaluation dataset versions, same with Table 1 in our paper.

Download failure cases and cleansed label from here

image_release.zip contains failure case images and benchmark evaluation images with cleansed label.

When you need to train on your own dataset or Non-Latin language datasets.

Create your own lmdb dataset.

pip3 install fire
python3 create_lmdb_dataset.py --inputPath data/ --gtFile data/gt.txt --outputPath result/

The structure of data folder as below.

data
├── gt.txt
└── test
    ├── word_1.png
    ├── word_2.png
    ├── word_3.png
    └── ...

At this time, gt.txt should be {imagepath}\t{label}\n
For example

test/word_1.png Tiredness
test/word_2.png kills
test/word_3.png A
...

Modify --select_data, --batch_ratio, and opt.character, see this issue.

Acknowledgements

This implementation has been based on these repository crnn.pytorch, ocr_attention.

Reference

[1] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scenetext recognition. In Workshop on Deep Learning, NIPS, 2014.
[2] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data fortext localisation in natural images. In CVPR, 2016.
[3] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Big-orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, andL. P. De Las Heras. ICDAR 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013.
[4] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.Chandrasekhar, S. Lu, et al. ICDAR 2015 competition on ro-bust reading. In ICDAR, pages 1156–1160, 2015.
[5] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012.
[6] K. Wang, B. Babenko, and S. Belongie. End-to-end scenetext recognition. In ICCV, pages 1457–1464, 2011.
[7] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, andR. Young. ICDAR 2003 robust reading competitions. In ICDAR, pages 682–687, 2003.
[8] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, pages 569–576, 2013.
[9] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. In ESWA, volume 41, pages 8027–8048, 2014.
[10] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, volume 39, pages2298–2304. 2017.

Citation

Please consider citing this work in your publications if it helps your research.

@inproceedings{baek2019STRcomparisons,
  title={What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis},
  author={Baek, Jeonghun and Kim, Geewook and Lee, Junyeop and Park, Sungrae and Han, Dongyoon and Yun, Sangdoo and Oh, Seong Joon and Lee, Hwalsuk},
  booktitle = {International Conference on Computer Vision (ICCV)},
  year={2019},
  pubstate={published},
  tppubtype={inproceedings}
}

Contact

Feel free to contact us if there is any question:
for code/paper Jeonghun Baek [email protected]; for collaboration [email protected] (our team leader).

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments

pretrained-model loading with errors

Hello, I used single GPU env with python == 3.8, torch==1.8.1 and torchvision==0.9.1 I followed the github hint with the following command:

python3 infer.py --gpu --image demo_image/demo_2.jpg --model vitstr_small_patch16_224.pth

It returned an error with

AttributeError: 'collections.OrderedDict' object has no attribute 'to'

it seems that the function model = torch.load(checkpoint) in infer.py returns an ordered dict instead of the model object. One way to solve the problem is:

ordered_dict = torch.load(checkpoint)
model.load(ordered_dict )

But I do not know the hyper params of vitstr_small_patch16_224.pth when it is training. so it is very hard form me to initialize the model object with correct hyper params. I would like to ask would it possible to may the hyper params of the pretrained models public?

I also tried the pt models

python3 infer.py --gpu --image demo_image/demo_2.jpg --model vitstr_small_patch16_jit.pt

it gives the following error:

  File "E:\ProgramFiles\anaconda3\envs\vitstr\lib\site-packages\spyder_kernels\py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File "e:\projects\deep-text-recognition-benchmark-master\infer.py", line 147, in <module>
    data = infer(args)

  File "e:\projects\deep-text-recognition-benchmark-master\infer.py", line 121, in infer
    model = torch.load(checkpoint)

  File "E:\ProgramFiles\anaconda3\envs\vitstr\lib\site-packages\torch\serialization.py", line 591, in load
    return torch.jit.load(opened_file)

  File "E:\ProgramFiles\anaconda3\envs\vitstr\lib\site-packages\torch\jit\_serialization.py", line 163, in load
    cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: 
Unknown type name 'NoneType':
Serialized   File "code/__torch__/modules/vitstr.py", line 12
  embed_dim : int
  num_tokens : int
  dist_token : NoneType
               ~~~~~~~~ <--- HERE
  head_dist : NoneType
  patch_embed : __torch__.timm.models.layers.patch_embed.PatchEmbed

any way to load the model correctly please? may thanks

opened by Ao-Lee 8

model state loading issue

I tried to rerun the model with the vitstr tiny version weights but I got Missing and Unexpected key(s) in state_dict issues while loading the model state.

opened by rouarouatbi 3
About the difference between the number of training iters in the paper and this Repo

Thanks for your great work and source code ! The training epoch numbers in the paper Table 2 are 300, but there are 300000 iters in source code . Data augmentations in the code are very thorough, I think a longer training process is necessary. Which one is your experimental strategy? I do not know if you have done similar experiments that how many iters of training performance will be basically stable under your strong data augment setting. I look forward to your reply！

opened by superPangpang 3
A question about [GO[ token

criterion = torch.nn.CrossEntropyLoss(ignore_index=0).to(device) # ignore [GO] token = ignore index 0

why you ignore GO token when setup loss?

Thank you

opened by zhaiyukun 2
About input size

Hi, thank you for your work. This is a very meaningful job. I am curious if the input size is the same as TRBA (32 x 100). Have you tried training with 32 x 100 input-sized images?

opened by terryoo 2
Question about [GO] and [s]

Hi, thanks for your amazing work. When you convert the label using class TokenLabelConverter, you pad the label with [GO] which is ignored during loss calculation, however in paper, figure 4 shows that the label is padded with [s]. Does this make any difference on accuracy?

opened by sparrow0629 1
How to draw the attention map of ViTSTR?

Hello, thank you very much for being able to open the source code, it is a very rewarding work. I am a graduate student and I want to do a little bit of my own experiment based on ViTSTR. Now I want to draw an attention map similar to the one shown in Fig. 9 of your paper, can you give me some help? Thank you very much!

opened by lexiaoyuan 1
Poor performance on some images
Thank you for the awesome research!

I ran the code for demo images and it worked perfectly. But when I run the code on few sample images, the model seems to be incoherent.

It would be great if you answer few of my questions,

Does the model perform end-to-end STR or does the model require a cropped image (using for ex: EAST or TextFuseNet text detectors)? Example: 1st and 2nd images below (where 1st image is cropped version of 2nd image), same case with 5th and 6th image

Does the model perform multi line text recognition?

Why You Should Try the Real Data for the Scene Text Recognition paper mentions in section 4.7 a scope of improvement using OpenImage v5 dataset on this research, have you tried this?

Examples:

I used vitstr_base_patch16_224_aug.pth model for prediction.

| Image | Prediction | | ----------- | ----------- | | | middleborough | | | midleerooogg | | | qatm | | | aoe | | | castlecampbell | | | coaeeea |
opened by dudeperf3ct 1
About the parameter `--valid_data` in the training command mentioned in README.md

Hi, thanks for your work! When training, should the parameter --valid_data in the command be followed by data_lmdb_release/validation? But I found it written as data_lmdb_release/evaluation in README.md. Looking forward to your reply!

opened by lexiaoyuan 1
Code refactoring(model.py, dataset.py) and add backslash to commands in README.md
model.py

remove unused library(torch, math)

add space after self.vitstr

dataset.py

rename function isless to is_less according to PEP8.

I think this function should be above the classes for code readability. But I didn't modify it.

README.md

add backslash(\) so that commands can execute right away in shell.
opened by oikosohn 0
Code refactoring for dataset.py and dataset.py.
model.py

remove unused library(torch, math)

add space after self.vitstr

dataset.py

rename function isless to is_less

I think this function should be above the classes.
opened by oikosohn 0
Available Model weights.

Hi, thanks for the nice work. I'm trying to get the available model weights for vitstr_base_patch16_224_aug to work with the infer.py script. So far it is not working, because the model is not build properly. Could you please give me an advice how to load the model pretrained from given checkpoint? Thanks.

opened by schreiterjp 1
CTC error

Hi. Appreciate your contribution, but I have a problem When using the CTC：

CUDA_VISIBLE_DEVICES=4 python3 train.py --batch_ratio 1 --Transformation None --FeatureExtraction None --SequenceModeling None --Prediction CTC --Transformer --TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 --manualSeed=27720

error: Traceback (most recent call last): File "train.py", line 320, in train(opt) File "train.py", line 175, in train preds = model(image, text) UnboundLocalError: local variable 'text' referenced before assignment

opened by LeeBronOff23 1
train error

CUDA_VISIBLE_DEVICES=0 python train.py --train_data mydata/mytrain --valid_data mydata/mytrain --select_data / --batch_ratio 1 --Transformation None --FeatureExtraction None --SequenceModeling None --Prediction None --Transformer --TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 --manualSeed=$RANDOM --sensitive

Traceback (most recent call last): File "train.py", line 310, in train(opt) File "train.py", line 72, in train model = Model(opt) File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/model.py", line 47, in init self.vitstr= create_vitstr(num_tokens=opt.num_class, model=opt.TransformerModel) File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/modules/vitstr.py", line 42, in create_vitstr checkpoint_path=checkpoint_path) File "/home/passwd123/anaconda3/envs/pytorch_zls/lib/python3.7/site-packages/timm/models/factory.py", line 71, in create_model model = create_fn(pretrained=pretrained, pretrained_cfg=pretrained_cfg, **kwargs) File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/modules/vitstr.py", line 159, in vitstr_tiny_patch16_224 patch_size=16, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4, qkv_bias=True, **kwargs) File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/modules/vitstr.py", line 55, in init super().init(*args, **kwargs) TypeError: init() got an unexpected keyword argument 'pretrained_cfg'

opened by chungluensing 1

Rand Aug

Hello @roatienza!

Thanks for this great repo!

I am trying to train using rand_aug but I am facing some issues. I get an error on blur.py when trying to convert from BGR to Grayscale. It seems the image has just one channel.

`error: Caught error in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/fmobrj/anaconda3/envs/vitstr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/fmobrj/anaconda3/envs/vitstr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 500, in __call__
    image_tensors = [transform(image) for image in images]
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 500, in <listcomp>
    image_tensors = [transform(image) for image in images]
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 336, in __call__
    img = self.rand_aug(img)
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 357, in rand_aug
    img = op(img, mag=mag)
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/augmentation/blur.py", line 104, in __call__
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
cv2.error: OpenCV(3.4.18) /io/opencv/modules/imgproc/src/color.simd_helpers.hpp:88: error: (-2:Unspecified error) in function 'cv::impl::{anonymous}::CvtHelper<VScn, VDcn, VDepth, sizePolicy>::CvtHelper(cv::InputArray, cv::OutputArray, int) [with VScn = cv::impl::{anonymous}::Set<3, 4>; VDcn = cv::impl::{anonymous}::Set<3, 4>; VDepth = cv::impl::{anonymous}::Set<0, 2, 5>; cv::impl::{anonymous}::SizePolicy sizePolicy = cv::impl::<unnamed>::NONE; cv::InputArray = const cv::_InputArray&; cv::OutputArray = const cv::_OutputArray&]'
> Invalid number of channels in input image:
>     'VScn::contains(scn)'
> where
>     'scn' is 1`

opened by fmobrj 1

Releases(v0.1.0)

v0.1.0(May 3, 2021)

ViTSTR public version as part of ICDAR 2021 paper on Vision Transformer for Fast and Efficient Scene Text Recognition
Source code(tar.gz)
Source code(zip)
vitstr_base_patch16_224.pth(326.13 MB)
vitstr_base_patch16_224_aug.pth(326.13 MB)
vitstr_small_patch16_224.pth(82.09 MB)
vitstr_small_patch16_224_aug.pth(82.09 MB)
vitstr_small_patch16_224_aug_infer.pth(82.10 MB)
vitstr_small_patch16_jit.pt(82.10 MB)
vitstr_small_patch16_quant.pt(21.25 MB)
vitstr_tiny_patch16_224.pth(20.82 MB)
vitstr_tiny_patch16_224_aug.pth(20.82 MB)
v0.0.1-experimental(Dec 26, 2020)

This is the experimental version. Not intended for use.
Source code(tar.gz)
Source code(zip)
deit_tiny_patch16_224-a1311bcf.pth(21.86 MB)

Owner

Rowel Atienza

GitHub

Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

182 Dec 30, 2022

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

183 Jan 3, 2023

Official implementation of SynthTIGER (Synthetic Text Image GEneratoR) ICDAR 2021

?? SynthTIGER: Synthetic Text Image GEneratoR Official implementation of SynthTIGER | Paper | Datasets Moonbin Yim1, Yoonsik Kim1, Han-cheol Cho1, Sun

256 Jan 5, 2023

Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

255 Dec 29, 2022

Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Automatic Number Plate Recognition Automatic Number Plate Recognition (ANPR) is the process of reading the characters on the plate with various optica

52 Dec 22, 2022

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

152 Dec 28, 2022

2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

TableMASTER-mmocr Contents About The Project Method Description Dependency Getting Started Prerequisites Installation Usage Data preprocess Train Infe

298 Dec 21, 2022

1st Solution For ICDAR 2021 Competition on Mathematical Formula Detection

This project releases our 1st place solution on ICDAR 2021 Competition on Mathematical Formula Detection. We implement our solution based on MMDetection, which is an open source object detection toolbox based on PyTorch.

94 Dec 25, 2022

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

12.6k Jan 9, 2023

Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

66 Nov 16, 2022

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Make-A-Scene - PyTorch Pytorch implementation (inofficial) of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (https://arxiv.org/

259 Dec 28, 2022

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Code for "Primitive Representation Learning for Scene Text Recognition" (CVPR 2021)

Primitive Representation Learning Network (PREN) This repository contains the code for our paper accepted by CVPR 2021 Primitive Representation Learni

76 Jan 2, 2023

The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition.

OverlapTransformer The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for

136 Jan 3, 2023

Scene-Text-Detection-and-Recognition (Pytorch)

Scene-Text-Detection-and-Recognition (Pytorch) Competition URL: https://tbrain.t

9 Jan 2, 2023

text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

text recognition toolbox 1. 项目介绍该项目是基于pytorch深度学习框架，以统一的改写方式实现了以下6篇经典的文字识别论文，论文的详情如下。该项目会持续进行更新，欢迎大家提出问题以及对代码进行贡献。模型论文标题发表年份模型方法划分 CRNN 《An End-t

168 Dec 24, 2022

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object compositions and views.

151 Dec 26, 2022

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

35 Nov 20, 2022

Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

922 Jan 1, 2023

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Related tags

Overview

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021)

Paper

Install requirements

Dataset

Quick validation using a pre-trained model

Benchmarks (Top 1% accuracy)

Comparison with other STR models

Accuracy vs Number of Parameters

Accuracy vs Speed (2080Ti GPU)

Accuracy vs FLOPS

Train

Multi-GPU training

Data augmentation

Test

Citation

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Honors

Updates

Getting Started

Dependency

Download lmdb dataset for traininig and evaluation from here

Run demo with pretrained model

prediction results

Training and evaluation

Arguments

Download failure cases and cleansed label from here

When you need to train on your own dataset or Non-Latin language datasets.

Acknowledgements

Reference

Links

Citation

Contact

License

Comments

Releases(v0.1.0)

v0.1.0(May 3, 2021)

v0.0.1-experimental(Dec 26, 2020)

Owner

Rowel Atienza

Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

Official implementation of SynthTIGER (Synthetic Text Image GEneratoR) ICDAR 2021

Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

1st Solution For ICDAR 2021 Competition on Mathematical Formula Detection

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Code for "Primitive Representation Learning for Scene Text Recognition" (CVPR 2021)

The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition.

Scene-Text-Detection-and-Recognition (Pytorch)

text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"