Generate text line images for training deep learning OCR model (e.g. CRNN)

Last update: Jan 6, 2023

Related tags

Text Data & NLP text_renderer

Overview

Text Renderer

Generate text line images for training deep learning OCR model (e.g. CRNN).

Modular design. You can easily add different components: Corpus, Effect, Layout.
Integrate with imgaug, see imgaug_example for usage.
Support render multi corpus on image with different effects. Layout is responsible for the layout between multiple corpora
Support apply effects on different stages of rendering process corpus_effects, layout_effects, render_effects.
Generate vertical text.
Support generate lmdb dataset which compatible with PaddleOCR, see Dataset
A web font viewer.
Corpus sampler: helpful to perform character balance

Documentation

Run Example

Run following command to generate images using example data:

git clone https://github.com/oh-my-ocr/text_renderer
cd text_renderer
python3 setup.py develop
pip3 install -r docker/requirements.txt
python3 main.py \
    --config example_data/example.py \
    --dataset img \
    --num_processes 2 \
    --log_period 10

The data is generated in the example_data/output directory. A labels.json file contains all annotations in follow format:

{
  "labels": {
    "000000000": "test",
    "000000001": "text2"
  },
  "sizes": {
    "000000000": [
      120,
      32 
    ],
    "000000001": [
      128,
      32 
    ]
  },
  "num-samples": 2
}

You can also use --dataset lmdb to store image in lmdb file, lmdb file contains follow keys:

num-samples
image-000000000
label-000000000
size-000000000

You can check config file example_data/example.py to learn how to use text_renderer, or follow the Quick Start to learn how to setup configuration

Quick Start

Prepare file resources

Font files: .ttf、.otf、.ttc
Background images of any size, either from your business scenario or from publicly available datasets (COCO, VOC)
Corpus: text_renderer offers a wide variety of text sampling methods, to use these methods, you need to consider the preparation of the corpus from two perspectives：

The corpus must be in the target language for which you want to perform OCR recognition
The corpus should meets your actual business needs, such as education field, medical field, etc.

Charset file [Optional but recommend]: OCR models in real-world scenarios (e.g. CRNN) usually support only a limited character set, so it's better to filter out characters outside the character set during data generation. You can do this by setting the chars_file parameter

You can download pre-prepared file resources for this Quick Start from here:

Save these resource files in the same directory:

workspace
├── bg
│ └── background.png
├── corpus
│ └── eng_text.txt
└── font
    └── simsun.ttf

Create config file

Create a config.py file in workspace directory. One configuration file must have a configs variable, it's a list of GeneratorCfg.

The complete configuration file is as follows:

import os
from pathlib import Path

from text_renderer.effect import *
from text_renderer.corpus import *
from text_renderer.config import (
    RenderCfg,
    NormPerspectiveTransformCfg,
    GeneratorCfg,
    SimpleTextColorCfg,
)

CURRENT_DIR = Path(os.path.abspath(os.path.dirname(__file__)))


def story_data():
    return GeneratorCfg(
        num_image=10,
        save_dir=CURRENT_DIR / "output",
        render_cfg=RenderCfg(
            bg_dir=CURRENT_DIR / "bg",
            height=32,
            perspective_transform=NormPerspectiveTransformCfg(20, 20, 1.5),
            corpus=WordCorpus(
                WordCorpusCfg(
                    text_paths=[CURRENT_DIR / "corpus" / "eng_text.txt"],
                    font_dir=CURRENT_DIR / "font",
                    font_size=(20, 30),
                    num_word=(2, 3),
                ),
            ),
            corpus_effects=Effects(Line(0.9, thickness=(2, 5))),
            gray=False,
            text_color_cfg=SimpleTextColorCfg(),
        ),
    )


configs = [story_data()]

In the above configuration we have done the following things:

Specify the location of the resource file
Specified text sampling method: 2 or 3 words are randomly selected from the corpus
Configured some effects for generation
- Perspective transformation NormPerspectiveTransformCfg
- Random Line Effect
- Fix output image height to 32
- Generate color image. gray=False, SimpleTextColorCfg()
Specifies font-related parameters: font_size, font_dir

Run

Run main.py, it only has 4 arguments:

config：Python config file path
dataset: Dataset format img or lmdb
num_processes: Number of processes used
log_period: Period of log printing. (0, 100)

All Effect/Layout Examples

Find all effect/layout config example at link

bg_and_text_mask: Three images of the same width are merged together horizontally, it can be used to train GAN model like EraseNet

	Name	Example
0	bg_and_text_mask
1	char_spacing_compact
2	char_spacing_large
3	color_image
4	curve
5	dropout_horizontal
6	dropout_rand
7	dropout_vertical
8	emboss
9	extra_text_line_layout
10	line_bottom
11	line_bottom_left
12	line_bottom_right
13	line_horizontal_middle
14	line_left
15	line_right
16	line_top
17	line_top_left
18	line_top_right
19	line_vertical_middle
20	padding
21	perspective_transform
22	same_line_layout_different_font_size
23	vertical_text

Contribution

Corpus: Feel free to contribute more corpus generators to the project, It does not necessarily need to be a generic corpus generator, but can also be a business-specific generator, such as generating ID numbers

Run in Docker

Build image

docker build -f docker/Dockerfile -t text_renderer .

Config file is provided by CONFIG environment. In example.py file, data is generated in example_data/output directory, so we map this directory to the host.

docker run --rm \
-v `pwd`/example_data/docker_output/:/app/example_data/output \
--env CONFIG=/app/example_data/example.py \
--env DATASET=img \
--env NUM_PROCESSES=2 \
--env LOG_PERIOD=10 \
text_renderer

Font Viewer

Start font viewer

streamlit run tools/font_viewer.py -- web /path/to/fonts_dir

Build docs

cd docs
make html
open _build/html/index.html

Citing text_renderer

If you use text_renderer in your research, please consider use the following BibTeX entry.

@misc{text_renderer,
  author =       {oh-my-ocr},
  title =        {text_renderer},
  howpublished = {\url{https://github.com/oh-my-ocr/text_renderer}},
  year =         {2021}
}

Comments

关于生成图片中文本的颜色
请问我想生成白色的文本，但是好像没有成功，是为什么呢？我在example.py的基础上改动了： def base_cfg( name: str, corpus, corpus_effects=None, layout_effects=None, layout=None, gray=True ): return GeneratorCfg( num_image=50, save_dir=OUT_DIR / name, render_cfg=RenderCfg( bg_dir=BG_DIR, perspective_transform=perspective_transform, gray=gray, layout_effects=layout_effects, layout=layout, corpus=corpus, corpus_effects=corpus_effects, text_color_cfg=FixedTextColorCfg(), # SimpleTextColorCfg(), ), ) 并且把FixedTextColorCfg改了： class FixedTextColorCfg(TextColorCfg): # For generate effect/layout example def get_color(self, bg_img: PILImage) -> Tuple[int, int, int, int]: alpha = 255 text_color = (255, 255, 255, alpha)

return text_color
opened by ShulinHE 6
How to keep the background image size?

The generated image size automatically adapts to characters, but I want to keep the original size of the background. How to keep the background size? How to change the generated image background size?

opened by wendaogongyi 3
lmdb2img Compatible with PaddleOCR
step:

generated lmdb filedata.mdb lock.mdb

python main.py --config example_data\example.py --dataset lmdb --num_processes 2 --log_period 50

but How to convert Compatible with PaddleOCR? is this right？

python tools/lmdb2img.py inputfiles1 outputfiles2
opened by chccc1994 3
Can I customize the content of the generated json file？

I want to turn the generated json file into the annotation file of COCO dataset like the picture. Can I customize the content of the generated json file？

opened by wendaogongyi 1
Apply principle of least surprise to OneOf
OneOf([ DropOutRand(p=0.1), Line(p=0.4), ])

Expected: Select one of DropOutRand or Line and invoke it with the given probability (10% or 40%). Actual: Select one of DroupOutRand or Line and invoke it with 100% probability.

This PR changes Actual to Expected, and allows OneOf to be used as an Effect.
opened by ELanning 1
字体白边是什么原因呢？
def get_color(self, bg_img: PILImage) -> Tuple[int, int, int, int]: alpha = 255 # text_color = (255, 50, 0, alpha) # RGB text_color = (self.text_color_gray,self.text_color_gray, self.text_color_gray, alpha) # RGB
return text_color

如上图，使用的固定颜色字体（FixedTextColorCfg），没有设置char_spacing ,也没有在宽高上变化，字体只设置了一种，大小也只有一个尺寸。修改：对FixedTextColorCfg 的get_color 方法做了修改，使用构造方法传入一个颜色值，然bgr通道上的值都等于这个值。测试：只有在FixedTextColorCfg构造方法上传入的颜色为255时，这种白边才不会显示，其他颜色都存在。搜索了所有的issue，没发现这个答案。期待你的回答。
opened by wolfog 0
Unable to apply curve to my generated text

Hi,

Im trying to generate some images to train a small OCR model. I've configured everything based on the default config.py file and modified small stuff to fit my needs. I would like to add the curve effect to the text but i've been unable to do so.

Any help on how to add it to the config.py file?

Thanks in advance

opened by EBustoD 0
Who is using text_renderer?
Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

PaddleOCR

《三年磨一剑——微信OCR图片文字提取》

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

太保科技
opened by Sanster 0

Generate text line images for training deep learning OCR model (e.g. CRNN)

Related tags

Overview

Text Renderer

Run Example

Quick Start

Prepare file resources

Create config file

Run

All Effect/Layout Examples

Contribution

Run in Docker

Font Viewer

Build docs

Citing text_renderer

Comments

Owner

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

Shirt Bot is a discord bot which uses GPT-3 to generate text

📔️ Generate a text-based journal from a template file.

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Creating an LSTM model to generate music

HuggingTweets - Train a model to generate tweets

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

Command Line Text-To-Speech using Google TTS

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"