Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

Related tags

Deep Learning GT-commentaries-OCR

Overview

OCR Ground Truth for Historical Commentaries

The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public domain subset of scholarly commentaries on Sophocles' Ajax. Its main goal is to enable the evaluation of the OCR quality on printed materials that contain a mix of Latin and polytonic Greek scripts. It consists of five 19C commentaries written in German, English, and Latin, for a total of 3,356 GT lines.

Data

GT4HistComment are contained in data/, where each sub-folder corresponds to a different publication (i.e. commentary). For each each commentary we provide the following data:

<commentary_id>/GT-pairs: pairs of image/text files for each GT line
<commentary_id>/imgs: original images on which the OCR was performed
<commentary_id>/<commentary_id>_olr.tsv: OLR annotations with image region coordinates and layout type ground truth label

The OCR output produced by the Kraken + Ciaconna pipeline was manually corrected by a pool of annotators using the Lace platform. In order to ensure the quality of the ground truth datasets, an additional verification of all transcriptions made in Lace was carried out by an annotator on line-by-line pairs of image and corresponding text.

Commentary overview

ID	Commentator	Year	Languages	Image source
bsb10234118	Lobeck [1]	1835	Greek, Latin	BSB
sophokle1v3soph	Schneidewin [2]	1853	Greek, German	Internet Archive
cu31924087948174	Campbell [3]	1881	Greek, English	Internet Archive
sophoclesplaysa05campgoog	Jebb [4]	1896	Greek, English	Internet Archive
Wecklein1894	Wecklein [5]	1894 [5]	Greek. German	internal

Stats

Line, word and char counts for each commentary are indicated in the following table. Detailled counts for each region can be found here.

ID	Commentator	Type	lines	words	all chars	greek chars
bsb10234118	Lobeck	training	574	2943	16081	5344
bsb10234118	Lobeck	groundtruth	202	1491	7917	2786
sophokle1v3soph	Schneidewin	training	583	2970	16112	3269
sophokle1v3soph	Schneidewin	groundtruth	382	1599	8436	2191
cu31924087948174	Campbell	groundtruth	464	2987	14291	3566
sophoclesplaysa05campgoog	Jebb	training	561	4102	19141	5314
sophoclesplaysa05campgoog	Jebb	groundtruth	324	2418	10986	2805
Wecklein1894	Wecklein	groundtruth	211	1912	9556	3268

Commentary editions used:

[1] Lobeck, Christian August. 1835. Sophoclis Aiax. Leipzig: Weidmann.
[2] Sophokles. 1853. Sophokles Erklaert von F. W. Schneidewin. Erstes Baendchen: Aias. Philoktetes. Edited by Friedrich Wilhelm Schneidewin. Leipzig: Weidmann.
[3] Lewis Campbell. 1881. Sophocles. Oxford : Clarendon Press.
[4] Wecklein, Nikolaus. 1894. Sophokleus Aias. München: Lindauer.
[5] Jebb, Richard Claverhouse. 1896. Sophocles: The Plays and Fragments. London: Cambridge University Press.

Citation

If you use this dataset in your research, please cite the following publication:

@inproceedings{romanello_optical_2021,
  title = {Optical {{Character Recognition}} of 19th {{Century Classical Commentaries}}: The {{Current State}} of {{Affairs}}},
  booktitle = {The 6th {{International Workshop}} on {{Historical Document Imaging}} and {{Processing}} ({{HIP}} '21)},
  author = {Romanello, Matteo and Sven, Najem-Meyer and Robertson, Bruce},
  year = {2021},
  publisher = {{Association for Computing Machinery}},
  address = {{Lausanne}},
  doi = {10.1145/3476887.3476911}
}

Acknowledgements

Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.

Contributors: Carla Amaya (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL), Bruce Robertson (Mount Allison University).

Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

Visual 3D Detection Package: This repo aims to provide flexible and reproducible visual 3D detection on KITTI dataset. We expect scripts starting from

305 Dec 19, 2022

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints Official implementation for Reducing Footskate in Human Motion Recon

38 Nov 1, 2022

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

368 Dec 6, 2022

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

GndNet: Fast Ground plane Estimation and Point Cloud Segmentation for Autonomous Vehicles. Authors: Anshul Paigwar, Ozgur Erkent, David Sierra Gonzale

114 Dec 29, 2022

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python THIS PROJECT IS CURRENTLY A WORK IN PROGRESS AND THUS THIS REPOSITORY I

14 Dec 31, 2022

Using LSTM to detect spoofing attacks in an Air-Ground network

Using LSTM to detect spoofing attacks in an Air-Ground network Specifications IDE: Spider Packages: Tensorflow 2.1.0 Keras NumPy Scikit-learn Matplotl

1 Nov 20, 2021

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

ObjectDrawer-ToolBox is a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system, Object Drawer.

77 Jan 5, 2023

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

139 Dec 25, 2022

A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

57 Jan 5, 2023

Comments

adds line-, word- and char-counts to README.md

Adds a table to README.md as suggested by reviewer 1. The table also link to a more complete table, itself a public version of spreadsheet OCR evaluation and stats!detailed_counts. Note that the publishable version is an external reference to our private version, meaning that actualising the latter will also update the former.

opened by sven-nm 0
Pages à exclure - OCR

La page contient les schémas métriques des passages. De ce fait l'OCR ne les reconnaît pas, de plus la correction de l'OCR n'a pas été achevée.

Voici les pages à exclure : sophoclesplaysa05campgoog_0072.png (Jebb, p. 72)

opened by camaya28 0

Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

Related tags

Overview

OCR Ground Truth for Historical Commentaries

Data

Commentary overview

Stats

Commentary editions used:

Citation

Acknowledgements

You might also like...

Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Using LSTM to detect spoofing attacks in an Air-Ground network

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

A two-stage U-Net for high-fidelity denoising of historical recordings

Comments

adds line-, word- and char-counts to README.md

Pages à exclure - OCR

Releases(v1.0)

v1.0(Sep 24, 2021)

Owner

Ajax Multi-Commentary

GeneralOCR is open source Optical Character Recognition based on PyTorch.

Add-on for importing and auto setup of character creator 3 character exports.

a pytorch implementation of auto-punctuation learned character by character

a pytorch implementation of auto-punctuation learned character by character

text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

Pytorch Implementations of large number classical backbone CNNs, data enhancement, torch loss, attention, visualization and some common algorithms.

This is a simple backtesting framework to help you test your crypto currency trading. It includes a way to download and store historical crypto data and to execute a trading strategy.

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Indonesian Car License Plate Character Recognition using Tensorflow, Keras and OpenCV.

Classical OCR DCNN reproduction based on PaddlePaddle framework.