Bnagla hand written document digiiztion

Mushfiqur Rahman

Last update: Dec 10, 2021

Related tags

Text Data & NLP bangla_doc_digitizaion

Overview

Bnagla hand written document digiiztion

This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields of specific information. We target this area and crop this region.

We only focus on extracting amount information (in currency) which is important in tax return. Our approach first select characters and separates numbers from non-number characters. The final classification results of each character are merged to get full amount.

Result

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

1 Feb 11, 2022

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 8, 2023

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

11.3k Feb 18, 2021

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 7, 2021

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

11 Nov 16, 2022

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

121 Dec 27, 2022

Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

1 Oct 31, 2021

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

6 May 22, 2022

Owner

Mushfiqur Rahman

Greater world Shorter time ....

GitHub

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

1 Jan 28, 2022

Bnagla hand written document digiiztion

Related tags

Overview

Bnagla hand written document digiiztion

Result

Contributing

License

You might also like...

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Telegram AI chat bot written in Python using Pyrogram

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

Owner

Mushfiqur Rahman

Beautiful visualizations of how language differs among document types.

Beautiful visualizations of how language differs among document types.

SDL: Synthetic Document Layout dataset

Document processing using transformers

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

CDLA: A Chinese document layout analysis (CDLA) dataset

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

A toolkit for document-level event extraction, containing some SOTA model implementations

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2