Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

Overview

SwinTextSpotter

This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022). The paper is available at this link.

Models

SWINTS-swin-english-pretrain [config] | model_Google Drive | model_BaiduYun PW: 954t

SWINTS-swin-Total-Text [config] | model_Google Drive | model_BaiduYun PW: tf0i

SWINTS-swin-ctw [config] | model_Google Drive | model_BaiduYun PW: 4etq

SWINTS-swin-icdar2015 [config] | model_Google Drive | model_BaiduYun PW: 3n82

SWINTS-swin-ReCTS [config] | model_Google Drive | model_BaiduYun PW: a4be

SWINTS-swin-vintext [config] | model_Google Drive | model_BaiduYun PW: slmp

Installation

  • Python=3.8
  • PyTorch=1.8.0, torchvision=0.9.0, cudatoolkit=11.1
  • OpenCV for visualization

Steps

  1. Install the repository (we recommend to use Anaconda for installation.)
conda create -n SWINTS python=3.8 -y
conda activate SWINTS
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install opencv-python
pip install scipy
pip install shapely
pip install rapidfuzz
pip install timm
pip install Polygon3
git clone https://github.com/mxin262/SwinTextSpotter.git
cd SwinTextSpotter
python setup.py build develop
  1. dataset path
datasets
|_ totaltext
|  |_ train_images
|  |_ test_images
|  |_ totaltext_train.json
|  |_ weak_voc_new.txt
|  |_ weak_voc_pair_list.txt
|_ mlt2017
|  |_ train_images
|  |_ annotations/icdar_2017_mlt.json
.......

Downloaded images

Downloaded label[Google Drive] [BaiduYun] PW: 46vd

Downloader lexicion[Google Drive] and place it to corresponding dataset.

You can also prepare your custom dataset following the example scripts. [example scripts]

Totaltext

To evaluate on Total Text, CTW1500, ICDAR2015, first download the zipped annotations with

cd datasets
mkdir evaluation
cd evaluation
wget -O gt_ctw1500.zip https://cloudstor.aarnet.edu.au/plus/s/xU3yeM3GnidiSTr/download
wget -O gt_totaltext.zip https://cloudstor.aarnet.edu.au/plus/s/SFHvin8BLUM4cNd/download
wget -O gt_icdar2015.zip https://drive.google.com/file/d/1wrq_-qIyb_8dhYVlDzLZTTajQzbic82Z/view?usp=sharing
wget -O gt_vintext.zip https://drive.google.com/file/d/11lNH0uKfWJ7Wc74PGshWCOgSxgEnUPEV/view?usp=sharing
  1. Pretrain SWINTS (e.g., with Swin-Transformer backbone)
python projects/SWINTS/train_net.py \
  --num-gpus 8 \
  --config-file projects/SWINTS/configs/SWINTS-swin-pretrain.yaml
  1. Fine-tune model on the mixed real dataset
python projects/SWINTS/train_net.py \
  --num-gpus 8 \
  --config-file projects/SWINTS/configs/SWINTS-swin-mixtrain.yaml
  1. Fine-tune model
python projects/SWINTS/train_net.py \
  --num-gpus 8 \
  --config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml
  1. Evaluate SWINTS (e.g., with Swin-Transformer backbone)
python projects/SWINTS/train_net.py \
  --config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml \
  --eval-only MODEL.WEIGHTS ./output/model_final.pth
  1. Visualize the detection and recognition results (e.g., with ResNet50 backbone)
python demo/demo.py \
  --config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml \
  --input input1.jpg \
  --output ./output \
  --confidence-threshold 0.4 \
  --opts MODEL.WEIGHTS ./output/model_final.pth

Example results:

Acknowlegement

Adelaidet, Detectron2, ISTR, SwinT_detectron2, Focal-Transformer and MaskTextSpotterV3.

Citation

If our paper helps your research, please cite it in your publications:

@article{huang2022swints,
  title = {SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition},
  author = {Mingxin Huang and YuLiang liu and Zhenghao Peng and Chongyu Liu and Dahua Lin and Shenggao Zhu and Nicholas Yuan and Kai Ding and Lianwen Jin},
  journal={arXiv preprint arXiv:2203.10209},
  year = {2022}
}

Copyright

For commercial purpose usage, please contact Dr. Lianwen Jin: [email protected]

Copyright 2019, Deep Learning and Vision Computing Lab, South China China University of Technology. http://www.dlvc-lab.net

Comments
  • ReCTS model doesn't work

    ReCTS model doesn't work

    Parameter setting: weights: rects_model_final.pth config: ./projects/SWINTS/configs/SWINTS-swin-chn_finetune.yaml predect data: ReCTS/icdar2019_rects_images1/* run script: python3 demo/demo.py --opts MODEL.WEIGHTS ./weights/rects_model_final.pth

    Using the above parameters does not predict results, but 'tt_model_final.pth' is right

    opened by oyrq 5
  • FileNotFoundError: [Errno 2] No such file or directory: 'datasets/totaltext/weak_voc_new.txt'

    FileNotFoundError: [Errno 2] No such file or directory: 'datasets/totaltext/weak_voc_new.txt'

    Thank you for your kindly code sharing!

    And when I run : python projects/SWINTS/train_net.py --config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml --eval-only MODEL.WEIGHTS model.pth

    The error occurs: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/totaltext/weak_voc_new.txt'

    But I can not find this file according to dataset download.

    opened by madajie9 4
  • How can I get gt_masks for my custom dataset?

    How can I get gt_masks for my custom dataset?

    I want to train SwinTextSpotter on my custom dataset (in totaltext format). But when I start to train, I was facing with this error: "AttributeError: Cannot find field 'gt_masks' in the given Instances!" in line 230 of file projects/SWINTS/swints/swints.py How can I get gt_masks for my custom dataset?

    Thank you for your hard work!

    opened by khiemledev 4
  • 训练过程中 rec loss 不下降

    训练过程中 rec loss 不下降

    执行如下指令:python projects/SWINTS/train_net.py --num-gpus 4 --config-file projects/SWINTS/configs/SWINTS-swin-pretrain.yaml

    4 A100 80G batchsize=24,rec loss始终在6-7左右

    [08/03 14:38:52] detectron2 INFO: Rank of current process: 0. World size: 4 [08/03 14:38:53] detectron2 INFO: Environment info:


    sys.platform linux Python 3.8.1 (default, Jan 8 2020, 22:29:32) [GCC 7.3.0] numpy 1.23.1 detectron2 0.4 @/home/SwinTextSpotter/detectron2 Compiler GCC 7.5 CUDA compiler CUDA 11.3 detectron2 arch flags 6.1 DETECTRON2_ENV_MODULE PyTorch 1.10.0+cu113 @/miniconda/lib/python3.8/site-packages/torch PyTorch debug build False GPU available True GPU 0,1,2,3 NVIDIA A100 80GB PCIe (arch=8.0) CUDA_HOME /usr/local/cuda Pillow 9.2.0 torchvision 0.11.0+cu113 @/miniconda/lib/python3.8/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220512 iopath 0.1.7 cv2 4.6.0


    PyTorch built with:

    • GCC 7.3
    • C++ Version: 201402
    • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
    • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
    • OpenMP 201511 (a.k.a. OpenMP 4.5)
    • LAPACK is enabled (usually provided by MKL)
    • NNPACK is enabled
    • CPU capability usage: AVX512
    • CUDA Runtime 11.3
    • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
    • CuDNN 8.2
    • Magma 2.5.2
    • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

    [08/03 14:38:53] detectron2 INFO: Command line arguments: Namespace(config_file='projects/SWINTS/configs/SWINTS-swin-pretrain.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False) [08/03 14:38:53] detectron2 INFO: Contents of args.config_file=projects/SWINTS/configs/SWINTS-swin-pretrain.yaml: BASE: "Base-SWINTS_swin.yaml" MODEL: WEIGHTS: "ckpt/swin_imagenet_pretrain.pth" SWINTS: NUM_PROPOSALS: 300 NUM_CLASSES: 2 DATASETS: TRAIN: ("totaltext_train","icdar_2015_train","icdar_2013_train","icdar_2017_validation_mlt","icdar_2017_mlt","icdar_curvesynthtext_train1","icdar_curvesynthtext_train2",) TEST: ("totaltext_test",) SOLVER: STEPS: (120000,140000) MAX_ITER: 150000 CHECKPOINT_PERIOD: 5000 INPUT: FORMAT: "RGB"

    [08/03 14:38:53] detectron2 INFO: Running with full config: CUDNN_BENCHMARK: False DATALOADER: ASPECT_RATIO_GROUPING: True FILTER_EMPTY_ANNOTATIONS: True NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: () PROPOSAL_FILES_TRAIN: () TEST: ('totaltext_test',) TRAIN: ('totaltext_train', 'icdar_2015_train', 'icdar_2013_train', 'icdar_2017_validation_mlt', 'icdar_2017_mlt', 'icdar_curvesynthtext_train1', 'icdar_curvesynthtext_train2') GLOBAL: HACK: 1.0 INPUT: CROP: CROP_INSTANCE: False ENABLED: True SIZE: [0.1, 0.1] TYPE: relative_range FORMAT: RGB MASK_FORMAT: polygon MAX_SIZE_TEST: 1824 MAX_SIZE_TRAIN: 1600 MIN_SIZE_TEST: 1000 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800, 832, 864, 896) MIN_SIZE_TRAIN_SAMPLING: choice RANDOM_FLIP: horizontal MODEL: ANCHOR_GENERATOR: ANGLES: [[-90, 0, 90]] ASPECT_RATIOS: [[0.5, 1.0, 2.0]] NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES: [[32, 64, 128, 256, 512]] BACKBONE: FREEZE_AT: -1 NAME: build_swint_fpn_backbone DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: ['stage2', 'stage3', 'stage4', 'stage5'] NORM: OUT_CHANNELS: 256 TOP_LEVELS: 2 KEYPOINT_ON: False LOAD_PROPOSALS: False MASK_ON: True META_ARCHITECTURE: SWINTS PANOPTIC_FPN: COMBINE: ENABLED: True INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: [123.675, 116.28, 103.53] PIXEL_STD: [58.395, 57.12, 57.375] PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN REC_HEAD: BATCH_SIZE: 128 NUM_CLASSES: 107 POOLER_RESOLUTION: (28, 28) RESOLUTION: (32, 32) RESNETS: DEFORM_MODULATED: False DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: [False, False, False, False] DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: ['res4'] RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.4, 0.5] NMS_THRESH_TEST: 0.5 NORM: NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0)) IOUS: (0.5, 0.6, 0.7) ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) CLS_AGNOSTIC_BBOX_REG: False CONV_DIM: 256 FC_DIM: 1024 NAME: NORM: NUM_CONV: 0 NUM_FC: 0 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 2 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] NAME: Res5ROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: True SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512) LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: False CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: NUM_CONV: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) BOUNDARY_THRESH: -1 CONV_DIMS: [-1] HEAD_NAME: StandardRPNHead IN_FEATURES: ['res4'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.3, 0.7] LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 6000 PRE_NMS_TOPK_TRAIN: 12000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 SWINT: APE: False DEPTHS: [2, 2, 6, 2] DROP_PATH_RATE: 0.2 EMBED_DIM: 96 MLP_RATIO: 4 NUM_HEADS: [3, 6, 12, 24] OUT_FEATURES: ['stage2', 'stage3', 'stage4', 'stage5'] WINDOW_SIZE: 7 SWINTS: ACTIVATION: relu ALPHA: 0.25 CLASS_WEIGHT: 2.0 DEEP_SUPERVISION: True DIM_DYNAMIC: 64 DIM_FEEDFORWARD: 2048 DROPOUT: 0.0 GAMMA: 2.0 GIOU_WEIGHT: 2.0 HIDDEN_DIM: 256 IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] L1_WEIGHT: 5.0 MASK_DIM: 60 MASK_WEIGHT: 2.0 NHEADS: 8 NO_OBJECT_WEIGHT: 0.1 NUM_CLASSES: 2 NUM_CLS: 3 NUM_DYNAMIC: 2 NUM_HEADS: 6 NUM_MASK: 3 NUM_PROPOSALS: 300 NUM_REG: 3 PATH_COMPONENTS: ./projects/SWINTS/LME/coco_2017_train_class_agnosticTrue_whitenTrue_sigmoidTrue_60_siz28.npz PRIOR_PROB: 0.01 REC_WEIGHT: 1.0 TEST_NUM_PROPOSALS: 100 WEIGHTS: ckpt/swin_imagenet_pretrain.pth OUTPUT_DIR: ./output SEED: 40244023 SOLVER: AMP: ENABLED: False BACKBONE_MULTIPLIER: 1.0 BASE_LR: 7.5e-05 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 5000 CLIP_GRADIENTS: CLIP_TYPE: full_model CLIP_VALUE: 1.0 ENABLED: True NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 24 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 150000 MOMENTUM: 0.9 NESTEROV: False OPTIMIZER: ADAMW REFERENCE_WORLD_SIZE: 0 STEPS: (120000, 140000) WARMUP_FACTOR: 0.01 WARMUP_ITERS: 1000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0.0001 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: False FLIP: True MAX_SIZE: 4000 MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200) DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 100000 EXPECTED_RESULTS: [] INFERENCE_TH_TEST: 0.4 KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: False NUM_ITER: 200 USE_NMS_IN_TSET: True VERSION: 2 VIS_PERIOD: 0 [08/03 14:38:53] detectron2 INFO: Full config saved to ./output/config.yaml [08/03 14:38:55] d2.engine.defaults INFO: Model:

    [08/03 14:38:56] d2.data.datasets.coco INFO: Loaded 1255 images in COCO format from datasets/totaltext/totaltext_train.json [08/03 14:38:56] d2.data.datasets.coco INFO: Loaded 1000 images in COCO format from datasets/icdar2015/icdar_2015_train.json [08/03 14:38:56] d2.data.datasets.coco INFO: Loaded 229 images in COCO format from datasets/icdar2013/annotations/icdar_2013.json [08/03 14:38:56] d2.data.datasets.coco INFO: Loaded 1797 images in COCO format from datasets/mlt2017/annotations/icdar_2017_validation_mlt.json [08/03 14:38:57] d2.data.datasets.coco INFO: Loaded 7160 images in COCO format from datasets/mlt2017/annotations/icdar_2017_mlt.json [08/03 14:39:18] d2.data.datasets.coco INFO: Loading datasets/syntext1/annotations/syntext_word_eng.json takes 21.03 seconds. [08/03 14:39:19] d2.data.datasets.coco INFO: Loaded 94723 images in COCO format from datasets/syntext1/annotations/syntext_word_eng.json [08/03 14:39:35] d2.data.datasets.coco INFO: Loading datasets/syntext2/annotations/ecms_v1_maxlen25.json takes 7.61 seconds. [08/03 14:39:35] d2.data.datasets.coco INFO: Loaded 54327 images in COCO format from datasets/syntext2/annotations/ecms_v1_maxlen25.json [08/03 14:39:36] d2.data.build INFO: Removed 2230 images with no usable annotations. 158261 images left. [08/03 14:39:42] d2.data.build INFO: Distribution of instances among all 1 categories: | category | #instances | |:----------:|:-------------| | text | 1868890 | | | | [08/03 14:39:42] d2.data.build INFO: Using training sampler TrainingSampler [08/03 14:39:42] d2.data.common INFO: Serializing 158261 elements to byte tensors and concatenating them all ... [08/03 14:39:48] d2.data.common INFO: Serialized dataset takes 460.55 MiB [08/03 14:39:50] fvcore.common.checkpoint INFO: [Checkpointer] Loading from ckpt/swin_imagenet_pretrain.pth ...

    [08/03 14:39:51] d2.engine.train_loop INFO: Starting training from iteration 0

    [08/04 14:53:46 d2.utils.events]: eta: 1 day, 14:59:42 iter: 56279 total_loss: 16.8 loss_ce: 0.1334 loss_giou: 0.29 loss_bbox: 0.1194 loss_feat: 0.5878 loss_dice: 0.119 loss_rec: 6.447 loss_ce_0: 0.6154 loss_giou_0: 0.7695 loss_bbox_0: 0.313 loss_feat_0: 1.204 loss_dice_0: 0.2254 loss_ce_1: 0.3727 loss_giou_1: 0.3973 loss_bbox_1: 0.1591 loss_feat_1: 0.8152 loss_dice_1: 0.1459 loss_ce_2: 0.31 loss_giou_2: 0.3258 loss_bbox_2: 0.1341 loss_feat_2: 0.6907 loss_dice_2: 0.1311 loss_ce_3: 0.2053 loss_giou_3: 0.3006 loss_bbox_3: 0.122 loss_feat_3: 0.6204 loss_dice_3: 0.1216 loss_ce_4: 0.1435 loss_giou_4: 0.2943 loss_bbox_4: 0.1182 loss_feat_4: 0.6 loss_dice_4: 0.1188 time: 1.5465 data_time: 0.0930 lr: 7.5e-05 max_mem: 66803M [08/04 14:54:16 d2.utils.events]: eta: 1 day, 14:59:52 iter: 56299 total_loss: 17.43 loss_ce: 0.1425 loss_giou: 0.2743 loss_bbox: 0.1041 loss_feat: 0.5848 loss_dice: 0.1222 loss_rec: 6.74 loss_ce_0: 0.6106 loss_giou_0: 0.7541 loss_bbox_0: 0.3087 loss_feat_0: 1.213 loss_dice_0: 0.2271 loss_ce_1: 0.378 loss_giou_1: 0.3861 loss_bbox_1: 0.1526 loss_feat_1: 0.8113 loss_dice_1: 0.1538 loss_ce_2: 0.3031 loss_giou_2: 0.3231 loss_bbox_2: 0.1218 loss_feat_2: 0.6802 loss_dice_2: 0.1366 loss_ce_3: 0.2059 loss_giou_3: 0.2907 loss_bbox_3: 0.1099 loss_feat_3: 0.6195 loss_dice_3: 0.127 loss_ce_4: 0.1465 loss_giou_4: 0.2804 loss_bbox_4: 0.1056 loss_feat_4: 0.5916 loss_dice_4: 0.123 time: 1.5464 data_time: 0.0578 lr: 7.5e-05 max_mem: 66803M [08/04 14:54:47 d2.utils.events]: eta: 1 day, 14:59:22 iter: 56319 total_loss: 16.86 loss_ce: 0.1324 loss_giou: 0.2875 loss_bbox: 0.1126 loss_feat: 0.5857 loss_dice: 0.1232 loss_rec: 6.414 loss_ce_0: 0.6167 loss_giou_0: 0.7572 loss_bbox_0: 0.3231 loss_feat_0: 1.195 loss_dice_0: 0.2153 loss_ce_1: 0.3705 loss_giou_1: 0.3863 loss_bbox_1: 0.1542 loss_feat_1: 0.8011 loss_dice_1: 0.1507 loss_ce_2: 0.2973 loss_giou_2: 0.32 loss_bbox_2: 0.126 loss_feat_2: 0.6879 loss_dice_2: 0.1377 loss_ce_3: 0.1973 loss_giou_3: 0.2947 loss_bbox_3: 0.1173 loss_feat_3: 0.6282 loss_dice_3: 0.1278 loss_ce_4: 0.1405 loss_giou_4: 0.2881 loss_bbox_4: 0.113 loss_feat_4: 0.5965 loss_dice_4: 0.1255 time: 1.5464 data_time: 0.0538 lr: 7.5e-05 max_mem: 66803M

    opened by Shualite 3
  • About using the pretrained model to spot Vietnamese

    About using the pretrained model to spot Vietnamese

    Hello, i want to use your pretrained model which was fine tuned on vintext to spot Vietnamese. But seems like it's predicting characters in the English alphabet. I don't see where to change the alphabet in the config files, where should I take a look at to modify this?

    opened by namdo281 3
  • Change character list

    Change character list

    Hello everyone, I want to change character list to match my custom dataset characters. How can I do this? I have tried to change CTLBELS, chars whenever character list appear in the code but I end up with this errror "Assertion cur_target >= 0 && cur_target < n_classes failed".

    Thank you!

    opened by khiemledev 3
  • Lacking the introduction of char dictionary for recognition

    Lacking the introduction of char dictionary for recognition

    The README doesn't show any detail about char dictionary for various datasets. I print embedding size in rects_model_final.pth and tt_model_final.pth, getting 5000+, 107, respectively.

    opened by TangLinJie 3
  • rects_model_final.pth nothing

    rects_model_final.pth nothing

    hi: python demo/demo.py --config-file projects/SWINTS/configs/SWINTS-swin-chn_finetune.yaml --input /xxx/x.jpg --output /xxxx --confidence-threshold 0.1 --opts MODEL.WEIGHTS /xxxxx/rects_model_final.pth but i got nothing using 随机中文街景图片

    when i use tt_model_final.pth for 随机英文街景图片,the result is good. Dose it rects model upload wrong?I also see someone ask the same question, but now solve. Please, tell me how to solve it, very thank.

    opened by Patickk 2
  • About data preparation

    About data preparation

    Does the annotation of data need to fix the number of annotation points? Or does it support any number of annotation points? Is there any special documentation on the data annotation specification? Thank you very much~

    opened by Randy-1009 2
  • published model for inference demo?

    published model for inference demo?

    I'm not sure if I understand how to use the code for inference on a single test picture, let's say input1.jpg. Have you published your pretrained model? If so, how do I adapt the line:

    python demo/demo.py --config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml --input input1.jpg --output ./output --confidence-threshold 0.4 --opts MODEL.WEIGHTS ./output/model_final.pth
    
    opened by Horace89 2
  • train recognition only

    train recognition only

    @mxin262 I want to train the recognition part only. could you please let me know how can I train the recognition part only by freezing the detection part to improve the recognition accuracy?

    opened by lerndeep 2
  • RuntimeError: CUDA out of memory.

    RuntimeError: CUDA out of memory.

    训练的时候出现错误“RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.92 GiB total capacity; 6.76 GiB already allocated; 66.50 MiB free; 6.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF" 将BANTCH_SIZE设置为1依旧显示该错误,请问作者有解决的方案吗?

    opened by juneane 0
  • 关于rects数据集的问题

    关于rects数据集的问题

    你好,我想问一下,我使用rects数据集来测试demo.py的结果,发现输出的output里没有汉字,我测试totaltext上的英文数据集就有显示英文的

    以下是我输入的指令:使用的model为作者您提供的rects的pth文件 python demo/demo.py --config-file projects/SWINTS/configs/SWINTS-swin-chn_finetune.yaml --input input1.jpg --output ./output --confidence-threshold 0.4 --opts MODEL.WEIGHTS ./output/rects_model_final.pth

    SWINTS-swin-chn_finetune.yaml的配置: BASE: "Base-SWINTS_swin.yaml" MODEL: #WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl" WEIGHTS: "./output/rects_model_final.pth" SWINTS: NUM_PROPOSALS: 300 NUM_CLASSES: 2 REC_HEAD: POOLER_RESOLUTION: (16,40) RESOLUTION: (32, 80) BATCH_SIZE: 128 NUM_CLASSES: 5463 DATASETS: TRAIN: ("rects",) TEST: ("totaltext_test",) SOLVER: STEPS: (140000,160000) MAX_ITER: 180000 CHECKPOINT_PERIOD: 10000 INPUT: FORMAT: "RGB"

    测试的图片:rects里的测试图 input1

    产生的检测结果:显示检测到了7个实例 image

    output结果:输出的jpg文件无汉字 image

    opened by yxyxnrh 2
  • 测试demo.py的时候出现了这种情况,请问作者有解决方法吗

    测试demo.py的时候出现了这种情况,请问作者有解决方法吗

    Traceback (most recent call last): File "demo/demo.py", line 99, in predictions, visualized_output = demo.run_on_image(img, args.confidence_threshold, path) File "C:\Users\Administrator\SwinTextSpotter\demo\predictor.py", line 67, in run_on_image vis_output = visualizer.draw_instance_predictions(predictions=instances, path=path) File "c:\users\administrator\swintextspotter\detectron2\utils\visualizer.py", line 413, in draw_instance_predictions labels = _create_text_labels(classes, scores, self.metadata.get("thing_classes", None)) File "c:\users\administrator\swintextspotter\detectron2\utils\visualizer.py", line 271, in _create_text_labels labels = [class_names[i] for i in classes] File "c:\users\administrator\swintextspotter\detectron2\utils\visualizer.py", line 271, in labels = [class_names[i] for i in classes] IndexError: list index out of range

    屏幕截图 2022-09-18 093209 屏幕截图 2022-09-18 093223

    opened by yxyxnrh 1
  • Tensor shapes conflict when training on VinText

    Tensor shapes conflict when training on VinText

    I tried to train the model on VinText dataset and got this traceback after several iterations:

    Traceback (most recent call last):
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/detectron2/engine/train_loop.py", line 140, in train
        self.run_step()
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/detectron2/engine/defaults.py", line 441, in run_step
        self._trainer.run_step()
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/detectron2/engine/train_loop.py", line 234, in run_step
        loss_dict = self.model(data)
      File "/home/ccbien/miniconda3/envs/scene_text/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/projects/SWINTS/swints/swints.py", line 184, in forward
        loss_dict = self.criterion(output, targets, self.mask_encoding)
      File "/home/ccbien/miniconda3/envs/scene_text/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/projects/SWINTS/swints/loss.py", line 153, in forward
        losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes, mask_encoding))
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/projects/SWINTS/swints/loss.py", line 135, in get_loss
        return loss_map[loss](outputs, targets, indices, num_boxes, mask_encoding, **kwargs)
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/projects/SWINTS/swints/loss.py", line 75, in loss_boxes
        raise e
      File "/home/ccbien/projects/SceneText/exp/SwinTextSpotter/projects/SWINTS/swints/loss.py", line 69, in loss_boxes
        src_boxes_ = src_boxes / image_size
    RuntimeError: The size of tensor a (300) must match the size of tensor b (377) at non-singleton dimension 0
    

    Config:

    _BASE_: "Base-SWINTS_swin.yaml"
    MODEL:
      SWINTS:
        NUM_PROPOSALS: 300
        NUM_CLASSES: 2
      REC_HEAD:
        BATCH_SIZE: 1
    DATASETS:
      TRAIN: ("vintext_train", "vintext_val")
      TEST:  ("vintext_test",)
    SOLVER:
      IMS_PER_BATCH: 1
      STEPS: (360000,420000)
      MAX_ITER: 100000
      CHECKPOINT_PERIOD: 10000
    INPUT:
      FORMAT: "RGB"
    

    The training progress was going well until reaching the bad sample:

    src_boxes.shape = torch.Size([300, 4])
    image_size.shape = torch.Size([377, 4])
    

    Here src_boxes.shape is consistent with NUM_PROPOSALS in the config, so I guess there are some issues of the ground truth annotations (downloaded originally from README.MD).

    opened by ccbien 1
  • LSVT training available?

    LSVT training available?

    Hi, I am trying to train LSVT dataset with this repo.

    Seems, you've been trained LSVT, https://github.com/mxin262/SwinTextSpotter/blob/e238a4b5d0c127480a838c6245c1e5e9eb2f9d59/detectron2/data/datasets/builtin.py#L58 but there is no transcription for that.

    is LSVT data trainable with this code? then can you provide config files and lexicon for that?

    opened by jeong-tae 19
  • demo.py predict for one image need 6s

    demo.py predict for one image need 6s

    hi, I have another question,why one picture needs 6s to predict? Comparing PAN++ 85FPS,What is our speed? I didnt find it in paper. Please tell me, thanks!

    opened by Patickk 2
Owner
mxin262
mxin262
Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Automatic Number Plate Recognition Automatic Number Plate Recognition (ANPR) is the process of reading the characters on the plate with various optica

Meftun AKARSU 47 Sep 19, 2022
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 170 Sep 30, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 73 Sep 22, 2022
Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object compositions and views.

null 135 Sep 28, 2022
Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

Wenwen Yu 251 Sep 23, 2022
Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Make-A-Scene - PyTorch Pytorch implementation (inofficial) of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (https://arxiv.org/

Casual GAN Papers 239 Sep 26, 2022
PyTorch implementations of neural network models for keyword spotting

Honk: CNNs for Keyword Spotting Honk is a PyTorch reimplementation of Google's TensorFlow convolutional neural networks for keyword spotting, which ac

Castorini 468 Sep 8, 2022
[CVPR 2022 Oral] Crafting Better Contrastive Views for Siamese Representation Learning

Crafting Better Contrastive Views for Siamese Representation Learning (CVPR 2022 Oral) 2022-03-29: The paper was selected as a CVPR 2022 Oral paper! 2

null 228 Sep 28, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 32 Aug 15, 2022
[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion Code for Multi-Temporal Scene Classification and Scene Ch

Lixiang Ru 32 Sep 23, 2022
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
Scene-Text-Detection-and-Recognition (Pytorch)

Scene-Text-Detection-and-Recognition (Pytorch) Competition URL: https://tbrain.t

Gi-Luen Huang 7 Aug 9, 2022
Code for "Primitive Representation Learning for Scene Text Recognition" (CVPR 2021)

Primitive Representation Learning Network (PREN) This repository contains the code for our paper accepted by CVPR 2021 Primitive Representation Learni

Ruijie Yan 75 Sep 15, 2022
HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022 [Project page | Video] Getting sta

null 46 Sep 19, 2022
Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

News 05/10/2022 To make the comparison on ScanNet easier, we provide all quantitative and qualitative results of baselines here, including COLMAP, COL

ZJU3DV 333 Sep 30, 2022
PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Neural Scene Flow Fields PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 20

Zhengqi Li 538 Sep 20, 2022
Official PyTorch implementation of Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, Yu

UT-Austin Robot Perception and Learning Lab 53 Sep 15, 2022
Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

Cheng Zhang 60 Sep 26, 2022