IMGUR5K handwriting set. It is a handwritten in-the-wild dataset, which contains challenging real world handwritten samples from different writers.The dataset is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.

Facebook Research

Last update: Dec 26, 2022

Related tags

Computer Vision IMGUR5K-Handwriting-Dataset

Overview

IMGUR5K Handwriting Dataset

To run the code for downloading the urls and generate corresponding annotations :

Usage: python download_imgur5k.py --dataset_info_dir <dir_with_annotaion_and_hashes> --output_dir <path_to_store_images>

Requirements

IMGUR5K download code works with

Python3

Downloading images of IMGUR5K

Run the command and set <path_to_store_images> to the target image directory

How IMGUR5K download works

The code checks the validity of urls by checking the hash of the url with the groundtruth md5 hash. If the image is pristine, the annotations are added to the generated annotations file and the respective splits.

Full documentation

IMGUR5K is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.

REQUIRED FILES:

download_imgur5k.py : Code to download the URLs for the dataset building.
<dataset_info_dir>/imgur5k_data.lst : File containing URLs with annotations and bounding box
<dataset_info_dir>/imgur5k_hashes.lst : File containins URL indexes with groundtruth md5 hash.
<dataset_info_dir>/train_index_ids.lst : File containins URL indexes belonging to train split.
<dataset_info_dir>/val_index_ids.lst : File containins URL indexes belonging to val split.
<dataset_info_dir>/test_index_ids.lst : File containins URL indexes belonging to test split.

Output:

<path_to_store_images>/.jpg :
- Images dowloaded to output_dir
imgur5k_annotations.json :
- json file with image annotation mappings -> dowloaded to dataset_info_dir
  - Format: { "index_id" : {indexes}, "index_to_annotation_map" : { annotations ids for an index}, "annotation_id": { each annotation's info } }
  - Annotation ID: bounding_box in xywha format
  - Bounding boxes with '.' mean the annotations were not done for various reasons
imgur5k_annotations_train.json :
- json file with image annotation mappings of TRAIN split only -> dowloaded to dataset_info_dir
imgur5k_annotations_val.json :
- json file with image annotation mappings of VAL split only -> dowloaded to dataset_info_dir
imgur5k_annotations_test.json :
- json file with image annotation mappings of TEST split only -> dowloaded to dataset_info_dir

[All imgur5k_annotations_*.json's format is similar to the format of imgur5k_annotations.json]

NOTE: Apart from the ~5K images employed in TextStyleBrush paper, ~4K more images are added to the dataset to foster the research in Handwritten Recognition.

Contribution

See the CONTRIBUTING file for how to help out.

License

IMGUR5K is Creative Commons Attribution-NonCommercial 4.0 International Public licensed, as found in the LICENSE file.

Comments

Can't understand box format

Hello, thank you for sharing the dataset! It looks like hard work has been done! 👋👋👋 I have one question about annotations: I can't understand the format of the bboxes: bounding_box in xywha format? What means "a" at the end? I know three popular formats: pascal_voc [x_min, y_min, x_max, y_max]; coco [x_min, y_min, width, height]; yolo [x_center, y_center, width, height], x_center and y_center are the normalized coordinates; https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/ But I can't find any "a" in them. Can you please help me to understand this moment!

opened by Kakoedlinnoeslovo 7
Fix the `UnicodeDecodeError` in `np.loadtxt`

On Windows 10 np.loadtxt raises UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1983: character maps to <undefined>. On CentOS 7: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 858: ordinal not in range(128)

Specifying the UTF-8 encoding fixes the problem. Tested on Windows 10 and CentOS 7.

In contributing guidelines you ask for tests. But I'm not sure what kind of tests should I add since there are no other tests in the repo. Happy to do that if you clarify this point. I completed the CLA.
CLA Signed

opened by dmitrijsk 2
Fix the `UnicodeDecodeError` in `np.loadtxt`

(This is a corrected version of PR #8)

On Windows 10 np.loadtxt raises UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1983: character maps to <undefined>. On CentOS 7: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 858: ordinal not in range(128)

Specifying the UTF-8 encoding fixes the problem. Tested on Windows 10 and CentOS 7.

In contributing guidelines you ask for tests. But I'm not sure what kind of tests should I add since there are no other tests in the repo. Happy to do that if you clarify this point. I completed the CLA.
CLA Signed

opened by dmitrijsk 1
Some images were removed from imgur

2021.11.24

For IMG: lRgjZ, ref hash: c64945bd74c067f29e01f2f3b5eeff60 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: NeVsJy7, ref hash: 924eb5398cea242b01f43e73b1a12811 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 7IUPrpZ, ref hash: 3e4f912a1e9d91c35c68c0880826e680 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: hdZPXS2, ref hash: 40f094f7bf1e56ed56cc2fcb8adcff14 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: Nokn65F, ref hash: 8ef5355846c5806d280a7ef563bc3f45 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: hwq46gA, ref hash: 8c27046ac37905291bbd9cb2cec72241 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: xBG71ye, ref hash: 9a7ea2e2e5c1ee3f5da627af7d253f09 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: KSSvYJH, ref hash: f7f71a1646fbdba638eca0365f09cff6 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: nUnLGVR, ref hash: 0e8e826cb85b53f5a459c0f0eed36d4f != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: PFTXe3d, ref hash: 1f372a9c9fc035ff0b57a4f21e070b9c != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: fa3NgTS, ref hash: 616f5801a0bdbdacf26e78536641a860 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: pyXUSxO, ref hash: ea7b303e76d47ce8555286f67bccad5b != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: GjgtyBl, ref hash: b6980d39ce80b2a3085cd89c537327b7 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: Cs0smsA, ref hash: ac942db4d0071e882db20dbca2de8d5d != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: kHgtG4H, ref hash: 532bf487cee2a3266f6985ce322626f2 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: kyRKrOy, ref hash: 79dab6bff97aa22fb8aac47676dd150f != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 91V1uHF, ref hash: 6fd3da585984de869c9e3f85ab96fd72 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 1IYlYlq, ref hash: 641076db3f95efea3fb35782777dabaf != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: sAOdjXq, ref hash: 730fd6033c8f255d4f1774b2d049922e != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: mlmRA89, ref hash: cb2b6705e71a3f8fb4ca29640b3de230 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 3TIryzT, ref hash: fef7718a45ee39d5ab324a0d792f8ee0 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: pldp0ke, ref hash: 7cbd0528faa5018e08d4e08834ebc8ab != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: s7WGXwr, ref hash: 6c25277ca43925cd93eac806fb646937 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: HIAwuPd, ref hash: 150c87bb0dc4d7819abf46807eafbf39 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: DGafbuR, ref hash: 212a52ab552f75a6d4655e07865188c0 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: PPqWkdx, ref hash: d8c4a27288f0c4db3a716dc3fd06dee2 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 03IJytp, ref hash: 9ff28f403eac64b136006a5c86a49c84 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: pjRXC0f, ref hash: 36016c1784a21f092f26e78c27c7d064 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 6De62VB, ref hash: 3ad6e31174112f63b633db85644238a0 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 00Wo8nQ, ref hash: 7cedb0a7914a5336d2de9a21a58eb788 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: bX1Ajfi, ref hash: d7cfd20cddfe6a9fee3b9bea5e1f6564 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: Idip0tp, ref hash: 8785533373eb588fd1e49a7537894692 != cur hash: d835884373f4d6c8f24742ceabe74946

opened by ymmshi 1
Licensing for commercial use?

Hello,

we would like to use the dataset as additional training data for our OCR model. However, the current license does not allow to use the data for commercial purpose. Is it possible to license the dataset as a company for such purposes? If so, who can we contact in that regard? Or is this not possible because of the imgur origin of the data?

opened by Luux 1
Missing Numpy Dependency in Documentation

Numpy is missing as a required dependency in the readme documentation

https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py#L24

opened by ColeMurray 1
Add Parallel Execution of Image Download

To improve runtime performance, add parallelism to image downloading

Note: The totExec count is slightly different than the original in the case of an image with content len < 100
CLA Signed

opened by ColeMurray 0
Add Parallel Execution of Image Download

To improve runtime performance, add parallelism to image downloading

Note: The tot_evals count is slightly different as we are now counting content with len(100). This avoids introducing additional logic to distinguish between mismatch hash vs bad content.
CLA Signed

opened by ColeMurray 0
[Feature Request] Add Parallel Download for Image Urls

To improve user download speed, utilize Python's threading or multiprocessing library. https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py#L93

Are we open to adding this to the downloader?

opened by ColeMurray 1

Owner

Facebook Research

GitHub

Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

91 Nov 22, 2022

Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

27 Jan 8, 2023

Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

LayoutAnalysisEvaluator Layout Analysis Evaluator for: ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records ICD

17 Dec 8, 2022

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics

21 Dec 8, 2021

Python package for handwriting and sketching in Jupyter cells

ipysketch A Python package for handwriting and sketching in Jupyter notebooks. Usage A movie is worth a thousand pictures is worth a million words...

16 Jan 5, 2023

Handwriting Recognition System based on a deep Convolutional Recurrent Neural Network architecture

Handwriting Recognition System This repository is the Tensorflow implementation of the Handwriting Recognition System described in Handwriting Recogni

346 Jan 7, 2023

Convert Text-to Handwriting Using Python

Convert Text-to Handwriting Using Python Description In this project we'll use python library that's "pywhatkit" for converting text to handwriting. t

8 Nov 19, 2022

This tool will help you convert your text to handwriting xD

So your teacher asked you to upload written assignments? Hate writing assigments? This tool will help you convert your text to handwriting xD

4.2k Jan 7, 2023

Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

671 Dec 27, 2022

Detect handwritten words in a text-line (classic image processing method).

Word segmentation Implementation of scale space technique for word segmentation as proposed by R. Manmatha and N. Srimal. Even though the paper is fro

190 Jan 3, 2023

Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

Handwritten Line Text Recognition using Deep Learning with Tensorflow Description Use Convolutional Recurrent Neural Network to recognize the Handwrit

224 Jan 7, 2023

This is used to convert a string to an Image with Handwritten Characters.

Text-to-Handwriting-using-python This is used to convert a string to an Image with Handwritten Characters. text_to_handwriting(string: str, save_to: s

3 Aug 15, 2022

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

1.8k Dec 28, 2022

Related tags

Overview

IMGUR5K Handwriting Dataset

Requirements

Downloading images of IMGUR5K

How IMGUR5K download works

Full documentation

Contribution

License

Comments

Can't understand box format

Fix the `UnicodeDecodeError` in `np.loadtxt`

Fix the `UnicodeDecodeError` in `np.loadtxt`

Some images were removed from imgur

Licensing for commercial use?

Missing Numpy Dependency in Documentation

Add Parallel Execution of Image Download

Add Parallel Execution of Image Download

[Feature Request] Add Parallel Download for Image Urls

Owner

Facebook Research

Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

Python package for handwriting and sketching in Jupyter cells

Handwriting Recognition System based on a deep Convolutional Recurrent Neural Network architecture

Convert Text-to Handwriting Using Python

This tool will help you convert your text to handwriting xD

Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Detect handwritten words in a text-line (classic image processing method).

Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

This is used to convert a string to an Image with Handwritten Characters.

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

Handwritten Number Recognition using CNN and Character Segmentation

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Handwritten Text Recognition (HTR) using TensorFlow 2.x

Handwritten Text Recognition (HTR) system implemented with TensorFlow.

OCR software for recognition of handwritten text