Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Related tags

Deep Learning MATRN
Overview

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

| paper |

Official PyTorch implementation for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features (MATRN).

This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

Datasets

We use lmdb dataset for training and evaluation dataset. The datasets can be downloaded in clova (for validation and evaluation) and ABINet (for training and evaluation).

  • Training datasets
  • Validation datasets
  • Evaluation datasets
  • Tree structure of data directory
    data
    ├── charset_36.txt
    ├── evaluation
    │   ├── CUTE80
    │   ├── IC13_857
    │   ├── IC13_1015
    │   ├── IC15_1811
    │   ├── IC15_2077
    │   ├── IIIT5k_3000
    │   ├── SVT
    │   └── SVTP
    ├── training
    │   ├── MJ
    │   │   ├── MJ_test
    │   │   ├── MJ_train
    │   │   └── MJ_valid
    │   └── ST
    ├── validation
    ├── WikiText-103.csv
    └── WikiText-103_eval_d1.csv
    

Requirements

pip install torch==1.7.1 torchvision==0.8.2 fastai==1.0.60 lmdb pillow opencv-python

Pretrained Models

  • Download pretrained model of MATRN from this link. Performances of the pretrained models are:
Model IIIT SVT IC13S IC13L IC15S IC15L SVTP CUTE
MATRN 96.7 94.9 97.9 95.8 86.6 82.9 90.5 94.1

Training and Evaluation

  • Training
python main.py --config=configs/train_matrn.yaml
  • Evaluation
python main.py --config=configs/train_matrn.yaml --phase test --image_only

Additional flags:

  • --checkpoint /path/to/checkpoint set the path of evaluation model
  • --test_root /path/to/dataset set the path of evaluation dataset
  • --model_eval [alignment|vision|language] which sub-model to evaluate
  • --image_only disable dumping visualization of attention masks

Acknowledgements

This implementation has been based on ABINet.

Citation

Please cite this work in your publications if it helps your research.

@article{na2021multi,
  title={Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features},
  author={Na, Byeonghu and Kim, Yoonsik and Park, Sungrae},
  journal={arXiv preprint arXiv:2111.15263},
  year={2021}
}
Comments
  • Predict More Characters

    Predict More Characters

    Hello there!

    • Great work. I'd like to ask how to train the align model with more characters. The current implementation can only recognize 36 characters (0~9, a~z). I want to recognize 90 characters (0~9, a~z, A~Z, and some symbols).
    • I tried to modify some code and now I can train on 90 characters. However, I am facing a problem that I can not load the pre-trained language model and vision model, as they are trained on 36 characters. Is there any way to modify the code so that I can load the pre-trained weights?
    opened by Mountchicken 4
  • Question about the performance of pre-trained model that the link contains.

    Question about the performance of pre-trained model that the link contains.

    First, thank you very much for your work, it is very impressive. However, when I evaluate with the pre-trained model provided by the link, I get results that are lower than the performance of the report. What is the reason for this? Thank you very much for your answer. And my results on 6 datasets of IIIT5k_3000, SVT, SVTP, IC13_857, IC15_1811, CUTE80 are as follows:

    [2022-03-03 23:28:19,374 main.py:276 INFO train-matrn] validation time: 62.44528245925903 / batch size: 384 [2022-03-03 23:28:19,374 main.py:281 INFO train-matrn] eval loss = 1.435, ccr = 0.957, cwr = 0.904, ted = 1542.000, ned = 297, ted/w = 0.213.

    you results of the same six datasets average cwr is 93.450.

    Thank you very much again!

    opened by Zhou2019 4
  • Question about the usage of text input

    Question about the usage of text input

    I noticed that texts(index encoding) is passed to the forward function, but not used anywhere. Just curious that, are you going to "ADD" text embedding to the final output? My guess is that, you probably have tried it out, but got limited performance improvements. I've been thinking about the usage of text embedding for a while, but it's too hard to convince myself to add text embedding to the training pipeline. As in the inference time, no text information will be given. Please correct me if my guess is wrong. Thanks.

    opened by laoShuaiGe 3
  • Question about reproducing.

    Question about reproducing.

    Thanks for your great work.

    Can you tell me how long the model needs to be trained under the configuration of 4 NVIDIA GeForce RTX 3090GPUs to converge to the results in the paper? If it is convenient, could you provide your training logs?

    I'm in the process of reproducing it now, but I found that the loss became jittery after a period of training, I don't know if I configured it wrong or if it's inherently so, slowly converging to the result in a long period of jitters. So I hope the author will provide a training log, if possible(thanks a lot).

    Thank you very much!

    opened by mrazhou 2
  • About Code in Line 81 of main.py

    About Code in Line 81 of main.py

    Hi! Thanks for your great work. When I debug your code, I find Line81 and Line 82 in file main.py seems like they're written backwards. I have no idea whether this is correct. Thanks :D https://github.com/byeonghu-na/MATRN/blob/f4d43a92555c93df67dbb8c597483e9a5c3fed14/main.py#L81 https://github.com/byeonghu-na/MATRN/blob/f4d43a92555c93df67dbb8c597483e9a5c3fed14/main.py#L82

    opened by Gmbition 1
  • Error about the model when using resnet as the backbone.

    Error about the model when using resnet as the backbone.

    Hello author, the following error occurred in the model when I used Resnet as the model backbone instead of ResTransformer. No such error occurred when I ran ABINet using ResNet as the backbone of the model.

    RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
     [torch.FloatTensor [10, 256, 512]], which is output 0 of ViewBackward, is at version 28; expected version 0 instead. 
    Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
    

    I found that backbone must have transformer in your model, if there is no transformer behind CNN, there will be mistakes, but I can't find more specific reasons. I only changed the backbone in the train_matrn.yaml configuration file.

    Thanks for your reply!

    opened by Zhou2019 1
  • Question about code

    Question about code

    Thank you for sharing the code!

    The 34-th line in modules/model_matrn_iter.py: the self.semantic_visual has no attribute about pe. So I get the error "torch.nn.modules.module.ModuleAttributeError: 'BaseSemanticVisual_backbone_feature' object has no attribute 'pe'"

    So as the 39-th and 44-th lines.

    Is there something wrong here?

    opened by Sisi0518 1
Owner
null
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 3, 2023
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

International Business Machines 43 Dec 26, 2022
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 385 Jan 6, 2023
Conformer: Local Features Coupling Global Representations for Visual Recognition

Conformer: Local Features Coupling Global Representations for Visual Recognition (arxiv) This repository is built upon DeiT and timm Usage First, inst

Zhiliang Peng 378 Jan 8, 2023
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

ROSITA News & Updates (24/08/2021) Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model. (15/08/2021) Releas

Vision and Language Group@ MIL 48 Dec 23, 2022
Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

Wenwen Yu 255 Dec 29, 2022
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Rowel Atienza 152 Dec 28, 2022
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

Facebook Research 42 Dec 9, 2022
Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

Layne_Huang 7 Nov 14, 2022
Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification We provide the codes for repr

null 12 Dec 12, 2022
Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [Paper] [Colab is coming soon] Approach Example Usage To r

null 6 Dec 1, 2021
[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

TransFuser This repository contains the code for the CVPR 2021 paper Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. If you find our

null 695 Jan 5, 2023
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

A 3D multi-modal medical image segmentation library in PyTorch We strongly believe in open and reproducible deep learning research. Our goal is to imp

Adaloglou Nikolas 1.2k Dec 27, 2022