A dataset for online Arabic calligraphy

ARBML

Last update: Dec 28, 2022

Related tags

Overview

Calliar

Calliar is a dataset for Arabic calligraphy. The dataset consists of 2500 json files that contain strokes manually annotated for Arabic calligraphy. This repository contains the dataset for the following paper :

Calliar: An Online Handwritten Dataset for Arabic Calligraphy
Zaid Alyafeai, Maged S. Al-shaibani, Mustafa Ghaleb, Yousif Ahmed Al-Wajih
https://arxiv.org/abs/2106.10745

Abstract: Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.

Stats

Dataset	# of Samples	# of Words	# of Chars	# of Strokes
Train	2,000	6,065	24,722	36,561
Valid	250	738	2,946	4,410
Test	250	753	3,052	4,601

Dataset Formats

Mainly, we have two basic formats.

.json

Each .json file contains a list of strokes. Each list is a dictionary of the stroke character and the list of points. Each composite character like ت is mapped into a list of primitive strokes i.e ..ٮ . Refer to the paper and to chars.py for more details on the mapping.

.npz

The compressed format of the dataset dataset.npz is only 8.6 MB and uses the Ramer-Douglas-Peucker Algorithm to decrease the number of points per stroke. The python library rdp was used for such task. The .npz format follows the same approach as QuickDraw.

Visualization

The vis.py file contains a list of python methods for easily visualizing the dataset. Here are two examples for drawing a sample json file and creating an animation.

import glob
import matplotlib.pyplot as plt 
import json 
from IPython.core.display import display, HTML, Video
from vis import *

## show an image of the strokes 
drawing = json.load(open(json_path))
print(get_annotation(json_path))
data = convert_3d(drawing)
draw_strokes(data, stroke_width = 2, crop = True)

## create an animation. 
create_animation(json_path)
Video("tmp/video.mp4")

Samples

Animation

video_twitter.mp4

video_twitter_1.mp4

video_twitter_2.mp4

video_twitter_3.mp4

Citation

@misc{alyafeai2021calliar,
      title={Calliar: An Online Handwritten Dataset for Arabic Calligraphy}, 
      author={Zaid Alyafeai and Maged S. Al-shaibani and Mustafa Ghaleb and Yousif Ahmed Al-Wajih},
      year={2021},
      eprint={2106.10745},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

Repository size is quite large

The repo size is quite large. I suggest moving the dataset-related materials to another source. Either Google storage bucket or even goole drive. What do you think @zaidalyafeai ?

opened by MagedSaeed 3
Codebase refactoring

Codebase Refactoring

I think the repo need some refactoring.

Group the two main parts of the repo

Currently, the repo contains two different parts, the Calliar main part and Calliar server. As for not, only Calliar server files are grouped under one folder, while the main part is on the first level. A simple solution would be to group Calliar, except the Calliar server folder, into one folder.

Group the related parts in the Calliar.

Currently, I can see certain files that can be grouped in order to enhance the quality and maintainability of the repo, e.g, the demos and the dataset.

Vis.py

The file contains some dead code and can be improved in some aspects.

Calliar server

Ignored files

.vscode, pychase and .ipynb_checkpoints need to not be tracked in Github

The static folder

The files can be grouped if they are similar types.

opened by yousef337 2
Pre-commit Utilization

Description

Introducing pre-commit to the project would help make the codebase more clean and consistent. An already existed configuration can be utilised in Masader-Webserver

opened by yousef337 2
manage images root dir
This PR tries to achieve the following:

introduce IMAGES_DIR to be the root dir under the media folder. (media is a standard folder for showing media related stuff)

introduce local_setting.py file for any environment-dependent variable.

fixing the bug from the server side when there is no images to draw.

@zaidalyafeai Please review, and test the changes before the merge.
opened by MagedSaeed 1
refactor the repo

@zaidalyafeai

Please take a look at this structural PR. I have tried to cover most of the points @yousef337 highlighted in this issue [https://github.com/ARBML/Calliar/issues/4]

Hopefully, this PR lays down a better structure for the project and opens doors for a better collaboration experience.

I run a local server and things seem to work just as expected. Please make sure, from your side, that the development server is running fine before merging with the main branch.

opened by MagedSaeed 0