IMGUR5K handwriting set. It is a handwritten in-the-wild dataset, which contains challenging real world handwritten samples from different writers.The dataset is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.

Overview

Word Images

IMGUR5K Handwriting Dataset

To run the code for downloading the urls and generate corresponding annotations :

Usage: python download_imgur5k.py --dataset_info_dir <dir_with_annotaion_and_hashes> --output_dir <path_to_store_images>

Requirements

IMGUR5K download code works with

  • Python3

Downloading images of IMGUR5K

Run the command and set <path_to_store_images> to the target image directory

How IMGUR5K download works

The code checks the validity of urls by checking the hash of the url with the groundtruth md5 hash. If the image is pristine, the annotations are added to the generated annotations file and the respective splits.

Full documentation

IMGUR5K is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.

REQUIRED FILES:

  • download_imgur5k.py : Code to download the URLs for the dataset building.
  • <dataset_info_dir>/imgur5k_data.lst : File containing URLs with annotations and bounding box
  • <dataset_info_dir>/imgur5k_hashes.lst : File containins URL indexes with groundtruth md5 hash.
  • <dataset_info_dir>/train_index_ids.lst : File containins URL indexes belonging to train split.
  • <dataset_info_dir>/val_index_ids.lst : File containins URL indexes belonging to val split.
  • <dataset_info_dir>/test_index_ids.lst : File containins URL indexes belonging to test split.

Output:

  • <path_to_store_images>/.jpg :
    • Images dowloaded to output_dir
  • imgur5k_annotations.json :
    • json file with image annotation mappings -> dowloaded to dataset_info_dir
      • Format: { "index_id" : {indexes}, "index_to_annotation_map" : { annotations ids for an index}, "annotation_id": { each annotation's info } }
      • Annotation ID: bounding_box in xywha format
      • Bounding boxes with '.' mean the annotations were not done for various reasons
  • imgur5k_annotations_train.json :
    • json file with image annotation mappings of TRAIN split only -> dowloaded to dataset_info_dir
  • imgur5k_annotations_val.json :
    • json file with image annotation mappings of VAL split only -> dowloaded to dataset_info_dir
  • imgur5k_annotations_test.json :
    • json file with image annotation mappings of TEST split only -> dowloaded to dataset_info_dir

[All imgur5k_annotations_*.json's format is similar to the format of imgur5k_annotations.json]

NOTE: Apart from the ~5K images employed in TextStyleBrush paper, ~4K more images are added to the dataset to foster the research in Handwritten Recognition.

Contribution

See the CONTRIBUTING file for how to help out.

License

IMGUR5K is Creative Commons Attribution-NonCommercial 4.0 International Public licensed, as found in the LICENSE file.

Comments
  • Can't understand box format

    Can't understand box format

    Hello, thank you for sharing the dataset! It looks like hard work has been done! 👋👋👋 I have one question about annotations: I can't understand the format of the bboxes: bounding_box in xywha format? What means "a" at the end? I know three popular formats: pascal_voc [x_min, y_min, x_max, y_max]; coco [x_min, y_min, width, height]; yolo [x_center, y_center, width, height], x_center and y_center are the normalized coordinates; https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/ But I can't find any "a" in them. Can you please help me to understand this moment!

    opened by Kakoedlinnoeslovo 7
  • Fix the `UnicodeDecodeError` in `np.loadtxt`

    Fix the `UnicodeDecodeError` in `np.loadtxt`

    On Windows 10 np.loadtxt raises UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1983: character maps to <undefined>. On CentOS 7: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 858: ordinal not in range(128)

    Specifying the UTF-8 encoding fixes the problem. Tested on Windows 10 and CentOS 7.

    In contributing guidelines you ask for tests. But I'm not sure what kind of tests should I add since there are no other tests in the repo. Happy to do that if you clarify this point. I completed the CLA.

    CLA Signed 
    opened by dmitrijsk 2
  • Fix the `UnicodeDecodeError` in `np.loadtxt`

    Fix the `UnicodeDecodeError` in `np.loadtxt`

    (This is a corrected version of PR #8)

    On Windows 10 np.loadtxt raises UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1983: character maps to <undefined>. On CentOS 7: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 858: ordinal not in range(128)

    Specifying the UTF-8 encoding fixes the problem. Tested on Windows 10 and CentOS 7.

    In contributing guidelines you ask for tests. But I'm not sure what kind of tests should I add since there are no other tests in the repo. Happy to do that if you clarify this point. I completed the CLA.

    CLA Signed 
    opened by dmitrijsk 1
  • Some images were removed from imgur

    Some images were removed from imgur

    2021.11.24

    For IMG: lRgjZ, ref hash: c64945bd74c067f29e01f2f3b5eeff60 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: NeVsJy7, ref hash: 924eb5398cea242b01f43e73b1a12811 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 7IUPrpZ, ref hash: 3e4f912a1e9d91c35c68c0880826e680 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: hdZPXS2, ref hash: 40f094f7bf1e56ed56cc2fcb8adcff14 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: Nokn65F, ref hash: 8ef5355846c5806d280a7ef563bc3f45 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: hwq46gA, ref hash: 8c27046ac37905291bbd9cb2cec72241 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: xBG71ye, ref hash: 9a7ea2e2e5c1ee3f5da627af7d253f09 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: KSSvYJH, ref hash: f7f71a1646fbdba638eca0365f09cff6 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: nUnLGVR, ref hash: 0e8e826cb85b53f5a459c0f0eed36d4f != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: PFTXe3d, ref hash: 1f372a9c9fc035ff0b57a4f21e070b9c != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: fa3NgTS, ref hash: 616f5801a0bdbdacf26e78536641a860 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: pyXUSxO, ref hash: ea7b303e76d47ce8555286f67bccad5b != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: GjgtyBl, ref hash: b6980d39ce80b2a3085cd89c537327b7 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: Cs0smsA, ref hash: ac942db4d0071e882db20dbca2de8d5d != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: kHgtG4H, ref hash: 532bf487cee2a3266f6985ce322626f2 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: kyRKrOy, ref hash: 79dab6bff97aa22fb8aac47676dd150f != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 91V1uHF, ref hash: 6fd3da585984de869c9e3f85ab96fd72 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 1IYlYlq, ref hash: 641076db3f95efea3fb35782777dabaf != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: sAOdjXq, ref hash: 730fd6033c8f255d4f1774b2d049922e != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: mlmRA89, ref hash: cb2b6705e71a3f8fb4ca29640b3de230 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 3TIryzT, ref hash: fef7718a45ee39d5ab324a0d792f8ee0 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: pldp0ke, ref hash: 7cbd0528faa5018e08d4e08834ebc8ab != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: s7WGXwr, ref hash: 6c25277ca43925cd93eac806fb646937 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: HIAwuPd, ref hash: 150c87bb0dc4d7819abf46807eafbf39 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: DGafbuR, ref hash: 212a52ab552f75a6d4655e07865188c0 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: PPqWkdx, ref hash: d8c4a27288f0c4db3a716dc3fd06dee2 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 03IJytp, ref hash: 9ff28f403eac64b136006a5c86a49c84 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: pjRXC0f, ref hash: 36016c1784a21f092f26e78c27c7d064 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 6De62VB, ref hash: 3ad6e31174112f63b633db85644238a0 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: 00Wo8nQ, ref hash: 7cedb0a7914a5336d2de9a21a58eb788 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: bX1Ajfi, ref hash: d7cfd20cddfe6a9fee3b9bea5e1f6564 != cur hash: d835884373f4d6c8f24742ceabe74946 For IMG: Idip0tp, ref hash: 8785533373eb588fd1e49a7537894692 != cur hash: d835884373f4d6c8f24742ceabe74946

    opened by ymmshi 1
  • Licensing for commercial use?

    Licensing for commercial use?

    Hello,

    we would like to use the dataset as additional training data for our OCR model. However, the current license does not allow to use the data for commercial purpose. Is it possible to license the dataset as a company for such purposes? If so, who can we contact in that regard? Or is this not possible because of the imgur origin of the data?

    opened by Luux 1
  • Missing Numpy Dependency in Documentation

    Missing Numpy Dependency in Documentation

    Numpy is missing as a required dependency in the readme documentation

    https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py#L24

    opened by ColeMurray 1
  • Add Parallel Execution of Image Download

    Add Parallel Execution of Image Download

    To improve runtime performance, add parallelism to image downloading

    Note: The totExec count is slightly different than the original in the case of an image with content len < 100

    CLA Signed 
    opened by ColeMurray 0
  • Add Parallel Execution of Image Download

    Add Parallel Execution of Image Download

    To improve runtime performance, add parallelism to image downloading

    Note: The tot_evals count is slightly different as we are now counting content with len(100). This avoids introducing additional logic to distinguish between mismatch hash vs bad content.

    CLA Signed 
    opened by ColeMurray 0
  • [Feature Request] Add Parallel Download for Image Urls

    [Feature Request] Add Parallel Download for Image Urls

    To improve user download speed, utilize Python's threading or multiprocessing library. https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/blob/main/download_imgur5k.py#L93

    Are we open to adding this to the downloader?

    opened by ColeMurray 1
Owner
Facebook Research
Facebook Research
Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

Xi Yang 91 Nov 22, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

null 27 Jan 8, 2023
Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

LayoutAnalysisEvaluator Layout Analysis Evaluator for: ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records ICD

null 17 Dec 8, 2022
This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

Amazon Web Services - Labs 422 Jan 3, 2023
ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics 21 Dec 8, 2021
Python package for handwriting and sketching in Jupyter cells

ipysketch A Python package for handwriting and sketching in Jupyter notebooks. Usage A movie is worth a thousand pictures is worth a million words...

Matthias Baer 16 Jan 5, 2023
Handwriting Recognition System based on a deep Convolutional Recurrent Neural Network architecture

Handwriting Recognition System This repository is the Tensorflow implementation of the Handwriting Recognition System described in Handwriting Recogni

Edgard Chammas 346 Jan 7, 2023
Convert Text-to Handwriting Using Python

Convert Text-to Handwriting Using Python Description In this project we'll use python library that's "pywhatkit" for converting text to handwriting. t

null 8 Nov 19, 2022
This tool will help you convert your text to handwriting xD

So your teacher asked you to upload written assignments? Hate writing assigments? This tool will help you convert your text to handwriting xD

Saurabh Daware 4.2k Jan 7, 2023
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
Detect handwritten words in a text-line (classic image processing method).

Word segmentation Implementation of scale space technique for word segmentation as proposed by R. Manmatha and N. Srimal. Even though the paper is fro

Harald Scheidl 190 Jan 3, 2023
Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

Handwritten Line Text Recognition using Deep Learning with Tensorflow Description Use Convolutional Recurrent Neural Network to recognize the Handwrit

sushant097 224 Jan 7, 2023
This is used to convert a string to an Image with Handwritten Characters.

Text-to-Handwriting-using-python This is used to convert a string to an Image with Handwritten Characters. text_to_handwriting(string: str, save_to: s

Akashdeep Mahata 3 Aug 15, 2022
Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

Ankush Gupta 1.8k Dec 28, 2022
Handwritten Number Recognition using CNN and Character Segmentation

Handwritten-Number-Recognition-With-Image-Segmentation Info About this repository This Repository is aimed at reading handwritten images of numbers an

Sparsha Saha 17 Aug 25, 2022
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Handwritten Text Recognition (HTR) using TensorFlow 2.x

Handwritten Text Recognition (HTR) system implemented using TensorFlow 2.x and trained on the Bentham/IAM/Rimes/Saint Gall/Washington offline HTR data

Arthur Flôr 160 Dec 21, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Handwritten Text Recognition with TensorFlow Update 2021: more robust model, faster dataloader, word beam search decoder also available for Windows Up

Harald Scheidl 1.5k Jan 7, 2023
OCR software for recognition of handwritten text

Handwriting OCR The project tries to create software for recognition of a handwritten text from photos (also for Czech language). It uses computer vis

Břetislav Hájek 562 Jan 3, 2023