OCR Ground Truth for Historical Commentaries
The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public domain subset of scholarly commentaries on Sophocles' Ajax. Its main goal is to enable the evaluation of the OCR quality on printed materials that contain a mix of Latin and polytonic Greek scripts. It consists of five 19C commentaries written in German, English, and Latin, for a total of 3,356 GT lines.
Data
GT4HistComment are contained in data/
, where each sub-folder corresponds to a different publication (i.e. commentary). For each each commentary we provide the following data:
<commentary_id>/GT-pairs
: pairs of image/text files for each GT line<commentary_id>/imgs
: original images on which the OCR was performed<commentary_id>/<commentary_id>_olr.tsv
: OLR annotations with image region coordinates and layout type ground truth label
The OCR output produced by the Kraken + Ciaconna pipeline was manually corrected by a pool of annotators using the Lace platform. In order to ensure the quality of the ground truth datasets, an additional verification of all transcriptions made in Lace was carried out by an annotator on line-by-line pairs of image and corresponding text.
Commentary overview
ID | Commentator | Year | Languages | Image source | Line example |
---|---|---|---|---|---|
bsb10234118 | Lobeck [1] | 1835 | Greek, Latin | BSB | |
sophokle1v3soph | Schneidewin [2] | 1853 | Greek, German | Internet Archive | |
cu31924087948174 | Campbell [3] | 1881 | Greek, English | Internet Archive | |
sophoclesplaysa05campgoog | Jebb [4] | 1896 | Greek, English | Internet Archive | |
Wecklein1894 | Wecklein [5] | 1894 [5] | Greek. German | internal |
Stats
Line, word and char counts for each commentary are indicated in the following table. Detailled counts for each region can be found here.
ID | Commentator | Type | lines | words | all chars | greek chars |
---|---|---|---|---|---|---|
bsb10234118 | Lobeck | training | 574 | 2943 | 16081 | 5344 |
bsb10234118 | Lobeck | groundtruth | 202 | 1491 | 7917 | 2786 |
sophokle1v3soph | Schneidewin | training | 583 | 2970 | 16112 | 3269 |
sophokle1v3soph | Schneidewin | groundtruth | 382 | 1599 | 8436 | 2191 |
cu31924087948174 | Campbell | groundtruth | 464 | 2987 | 14291 | 3566 |
sophoclesplaysa05campgoog | Jebb | training | 561 | 4102 | 19141 | 5314 |
sophoclesplaysa05campgoog | Jebb | groundtruth | 324 | 2418 | 10986 | 2805 |
Wecklein1894 | Wecklein | groundtruth | 211 | 1912 | 9556 | 3268 |
Commentary editions used:
- [1] Lobeck, Christian August. 1835. Sophoclis Aiax. Leipzig: Weidmann.
- [2] Sophokles. 1853. Sophokles Erklaert von F. W. Schneidewin. Erstes Baendchen: Aias. Philoktetes. Edited by Friedrich Wilhelm Schneidewin. Leipzig: Weidmann.
- [3] Lewis Campbell. 1881. Sophocles. Oxford : Clarendon Press.
- [4] Wecklein, Nikolaus. 1894. Sophokleus Aias. München: Lindauer.
- [5] Jebb, Richard Claverhouse. 1896. Sophocles: The Plays and Fragments. London: Cambridge University Press.
Citation
If you use this dataset in your research, please cite the following publication:
@inproceedings{romanello_optical_2021,
title = {Optical {{Character Recognition}} of 19th {{Century Classical Commentaries}}: The {{Current State}} of {{Affairs}}},
booktitle = {The 6th {{International Workshop}} on {{Historical Document Imaging}} and {{Processing}} ({{HIP}} '21)},
author = {Romanello, Matteo and Sven, Najem-Meyer and Robertson, Bruce},
year = {2021},
publisher = {{Association for Computing Machinery}},
address = {{Lausanne}},
doi = {10.1145/3476887.3476911}
}
Acknowledgements
Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.
Contributors: Carla Amaya (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL), Bruce Robertson (Mount Allison University).