a Deep Learning Framework for Text


Build Status PyPI version SWH License


DeLFT (Deep Learning Framework for Text) is a Keras and TensorFlow framework for text processing, focusing on sequence labelling (e.g. named entity tagging, information extraction) and text classification (e.g. comment classification). This library re-implements standard state-of-the-art Deep Learning architectures relevant to text processing tasks.

DeLFT has three main purposes:

  1. Usefulness, by targeting the most common textual content used by humans to communicate, which is not just simple text as considered usually by existing Deep Learning works in NLP, but rich text where tokens are associated to layout information (font. style, etc.), positions in structured documents, and possibly other lexical or symbolic contextual information. Such rich text is also usually coming from large documents like PDF or HTML, and not just text segments like sentences or paragraphs.

  2. Reproducibility and benchmarking, by implementing several state-of-the-art algorithms for both sequence labelling and text classification tasks, including the usage of ELMo contextualised embeddings and BERT transformer architecture, offering the capacity to validate reported results and to benchmark several methods under the same conditions and criteria.

  3. Production level, by offering optimzed performance, robustness and integration possibilities, which can support better engineering decisions and successful production-level applications.

Some key elements include:

  • Reduction of the size of RNN models, in particular by removing word embeddings from them. For instance, the model for the toxic comment classifier went down from a size of 230 MB with embeddings to 1.8 MB. In practice the size of all the models of DeLFT is less than 2 MB, except for Ontonotes 5.0 NER model which is 4.7 MB.

  • Implementation of a generic support of features.

  • Usage of dynamic data generator so that the training data do not need to stand completely in memory.

  • Efficiently loading and management of an unlimited volume of pre-trained embeddings.

  • A comprehensive evaluation framework with the standard metrics for sequence labeling and classification tasks, including n-fold cross validation.

A native Java integration of the library has been realized in GROBID via JEP.

DeLFT has been tested with python 3.5 and 3.6, Keras 2.2 and Tensorflow 1.7+ as backend. As always, GPU(s) are required for decent training time: a GeForce GTX 1050 Ti for instance is absolutely fine without ELMo contextual embeddings. Using ELMo or BERT Base model is fine with a GeForce GTX 1080 Ti.


Get the github repo:

git clone https://github.com/kermitt2/delft
cd delft

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env
source env/bin/activate

Install the dependencies:

pip3 install -r requirements.txt

DeLFT uses tensorflow 1.12 as backend, and will exploit your available GPU with the condition that CUDA (>=8.0) is properly installed.

You need then to download some pre-trained word embeddings and notify their path into the embedding registry. We suggest for exploiting the provided models:

  • glove Common Crawl (2.2M vocab., cased, 300 dim. vectors): glove-840B

  • fasttext Common Crawl (2M vocab., cased, 300 dim. vectors): fasttext-crawl

  • word2vec GoogleNews (3M vocab., cased, 300 dim. vectors): word2vec

  • fasttext_wiki_fr (1.1M, NOT cased, 300 dim. vectors) for French: wiki.fr

  • ELMo trained on 5.5B word corpus (will produce 1024 dim. vectors) for English: options and weights

  • BERT for English, we are using BERT-Base, Cased, 12-layer, 768-hidden, 12-heads , 110M parameters: available here

  • SciBERT for English and scientific content: SciBERT-cased

Then edit the file embedding-registry.json and modify the value for path according to the path where you have saved the corresponding embeddings. The embedding files must be unzipped.

    "embeddings": [
            "name": "glove-840B",
            "path": "/PATH/TO/THE/UNZIPPED/EMBEDDINGS/FILE/glove.840B.300d.txt",
            "type": "glove",
            "format": "vec",
            "lang": "en",
            "item": "word"

You're ready to use DeLFT.

Management of embeddings

The first time DeLFT starts and accesses pre-trained embeddings, these embeddings are serialised and stored in a LMDB database, a very efficient embedded database using memory-mapped file (already used in the Machine Learning world by Caffe and Torch for managing large training data). The next time these embeddings will be accessed, they will be immediately available.

Our approach solves the bottleneck problem pointed for instance here in a much better way than quantising+compression or pruning. After being compiled and stored at the first access, any volume of embeddings vectors can be used immediately without any loading, with a negligible usage of memory, without any accuracy loss and with a negligible impact on runtime when using SSD. In practice, we can exploit for instance embeddings for dozen languages simultaneously, without any memory and runtime issues - a requirement for any ambitious industrial deployment of a neural NLP system.

For instance, in a traditional approach glove-840B takes around 2 minutes to load and 4GB in memory. Managed with LMDB, after a first load time of around 4 minutes, glove-840B can be accessed immediately and takes only a couple MB in memory, for an impact on runtime negligible (around 1% slower) for any further command line calls.

By default, the LMDB databases are stored under the subdirectory data/db. The size of a database is roughly equivalent to the size of the original uncompressed embeddings file. To modify this path, edit the file embedding-registry.json and change the value of the attribute embedding-lmdb-path.

To get FastText .bin format support please uncomment the package fasttextmirror==0.8.22 in requirements.txt or requirements-gpu.txt according to your system's configuration. Please note that the .bin format is not supported on Windows platforms. Installing the FastText .bin format support introduces the following additional dependencies:

  • (gcc-4.8 or newer) or (clang-3.3 or newer)
  • Python version 2.7 or >=3.4
  • pybind11

While FastText .bin format are supported by DeLFT (including using ngrams for OOV words), this format will be loaded entirely in memory and does not take advantage of our memory-efficient management of embeddings.

I have plenty of memory on my machine, I don't care about load time because I need to grab a coffee every ten minutes, I only process one language at the time, so I am not interested in taking advantage of the LMDB emebedding management !

Ok, ok, then set the embedding-lmdb-path value to "None" in the file embedding-registry.json, the embeddings will be loaded in memory as immutable data, like in the usual Keras scripts.

Sequence Labelling

Available models

The following DL architectures are supported by DeLFT:

  • BidLSTM-CRF with words and characters input following:

     [1] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. "Neural Architectures for Named Entity Recognition". Proceedings of NAACL 2016. https://arxiv.org/abs/1603.01360

  • BidLSTM_CRF_FEATURES same as above, with generic feature channel (feature matrix can be provided in the usual CRF++/Wapiti/YamCha format).

  • BidLSTM-CNN with words, characters and custom casing features input, see:

     [2] Jason P. C. Chiu, Eric Nichols. "Named Entity Recognition with Bidirectional LSTM-CNNs". 2016. https://arxiv.org/abs/1511.08308

  • BidLSTM-CNN-CRF with words, characters and custom casing features input following:

     [3] Xuezhe Ma and Eduard Hovy. "End-to-end Sequence Labelling via Bi-directional LSTM-CNNs-CRF". 2016. https://arxiv.org/abs/1603.01354

  • BidGRU-CRF, similar to:

     [4] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. "Semi-supervised sequence tagging with bidirectional language models". 2017. https://arxiv.org/pdf/1705.00108

  • BERT transformer architecture, with fine-tuning and a CRF as activation layer, adapted to sequence labeling. Any pre-trained TensorFlow BERT models can be used (e.g. SciBERT or BioBERT for scientific and medical texts).

     [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018. https://arxiv.org/abs/1810.04805

In addition, the following contextual embeddings can be used in combination to the RNN architectures:

  • ELMo contextualised embeddings, which lead to the state of the art (92.22% F1 on CoNLL2003 NER dataset, averaged over five runs), when combined with BidLSTM-CRF with , see:

     [5] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. "Deep contextualized word representations". 2018. https://arxiv.org/abs/1802.05365

  • BERT feature extraction to be used as contextual embeddings (as ELMo alternative), as explained in section 5.4 of:

     [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018. https://arxiv.org/abs/1810.04805

Note that all our annotation data for sequence labelling follows the IOB2 scheme and we did not find any advantages to add alternative labelling scheme after experiments.




We have reimplemented in DeLFT the main neural architectures for NER of the last four years and performed a reproducibility analysis of the these systems with comparable evaluation criterias. Unfortunaltely, in publications, systems are usually compared directly with reported results obtained in different settings, which can bias scores by more than 1.0 point and completely invalidate both comparison and interpretation of results.

You can read more about our reproducibility study of neural NER in this blog article. This effort is similar to the work of (Yang and Zhang, 2018) (see also NCRFpp) but has also been extended to BERT for a fair comparison of RNN for sequence labeling, and can also be related to the motivations of (Pressel et al., 2018) MEAD.

All reported scores bellow are f-score for the CoNLL-2003 NER dataset. We report first the f-score averaged over 10 training runs, and second the best f-score over these 10 training runs. All the DeLFT trained models are included in this repository.

Architecture Implementation Glove only (avg / best) Glove + valid. set (avg / best) ELMo + Glove (avg / best) ELMo + Glove + valid. set (avg / best)
BidLSTM-CRF DeLFT 90.75 / 91.35 91.13 / 91.60 92.47 / 92.71 92.69 / 93.09
(Lample and al., 2016) - / 90.94
BidLSTM-CNN-CRF DeLFT 90.73 / 91.07 91.01 / 91.26 92.30 / 92.57 92.67 / 93.04
(Ma & Hovy, 2016) - / 91.21
(Peters & al. 2018) 92.22** / -
BidLSTM-CNN DeLFT 89.23 / 89.47 89.35 / 89.87 91.66 / 92.00 92.01 / 92.16
(Chiu & Nichols, 2016) 90.88*** / -
BidGRU-CRF DeLFT 90.38 / 90.72 90.28 / 90.69 92.03 / 92.44 92.43 / 92.71
(Peters & al. 2017) 91.93* / -

Results with BERT fine-tuning, including a final CRF activation layer, instead of a softmax (a CRF activation layer improves f-score in average by +0.30 for sequence labelling task):

Architecture Implementation f-score
bert-base-en DeLFT 90.9
bert-base-en+CRF DeLFT 91.2
bert-base-en (Devlin & al. 2018) 92.4

For DeLFT, the average is obtained with 10 training runs (see full results) and for (Devlin & al. 2018) averaged with 5 runs. As noted here, the original CoNLL-2003 NER results with BERT reported by the Google Research paper are not reproducible, and the score obtained by DeLFT is very similar to those obtained by all the systems having reproduced this experiment (the original paper probably reported token-level metrics instead of the usual entity-level metrics, giving in our humble opinion a misleading conclusion about the performance of transformers for sequence labelling tasks).

* reported f-score using Senna word embeddings and not Glove.

** f-score is averaged over 5 training runs.

*** reported f-score with Senna word embeddings (Collobert 50d) averaged over 10 runs, including case features and not including lexical features. DeLFT implementation of the same architecture includes the capitalization features too, but uses the more efficient GloVe 300d embeddings.

Command Line Interface

Different datasets and languages are supported. They can be specified by the command line parameters. The general usage of the CLI is as follow:

usage: nerTagger.py [-h] [--fold-count FOLD_COUNT] [--lang LANG]
                    [--dataset-type DATASET_TYPE]
                    [--architecture ARCHITECTURE] [--use-ELMo] [--use-BERT]
                    [--data-path DATA_PATH] [--file-in FILE_IN]
                    [--file-out FILE_OUT]

Neural Named Entity Recognizers

positional arguments:
  action                one of [train, train_eval, eval, tag]

optional arguments:
  -h, --help            show this help message and exit
  --fold-count FOLD_COUNT
                        number of folds or re-runs to be used when training
  --lang LANG           language of the model as ISO 639-1 code
  --dataset-type DATASET_TYPE
                        dataset to be used for training the model
                        Use the validation set for training together with the
                        training set
  --architecture ARCHITECTURE
                        type of model architecture to be used, one of
                        ['BidLSTM_CRF', 'BidLSTM_CRF_FEATURES', 'BidLSTM_CNN_CRF', 
                        'BidLSTM_CNN_CRF', 'BidGRU_CRF', 'BidLSTM_CNN', 
                        'BidLSTM_CRF_CASING', 'bert-base-en', 'bert-base-en', 
                        'scibert', 'biobert']
  --use-ELMo            Use ELMo contextual embeddings
  --use-BERT            Use BERT extracted features (embeddings)
  --data-path DATA_PATH
                        path to the corpus of documents for training (only use
                        currently with Ontonotes corpus in orginal XML format)
  --file-in FILE_IN     path to a text file to annotate
  --file-out FILE_OUT   path for outputting the resulting JSON NER anotations
  --embedding EMBEDDING
                        The desired pre-trained word embeddings using their
                        descriptions in the file embedding-registry.json. Be
                        sure to use here the same name as in the registry
                        ('glove-840B', 'fasttext-crawl', 'word2vec'), and that
                        the path in the registry to the embedding file is
                        correct on your system.

More explanations and examples are presented in the following sections.

CONLL 2003

DeLFT comes with various trained models for the CoNLL-2003 NER dataset.

By default, the BidLSTM-CRF architecture is used. With this available model, glove-840B word embeddings, and optimisation of hyperparameters, the current f1 score on CoNLL 2003 testb set is 91.35 (best run over 10 training, using train set for training and testa for validation), as compared to the 90.94 reported in [1], or 90.75 when averaged over 10 training. Best model f1 score becomes 91.60 when using both train and testa (validation set) for training (best run over 10 training), as it is done by (Chiu & Nichols, 2016) or some recent works like (Peters and al., 2017).

Using BidLSTM-CRF model with ELMo embeddings, following [5] and some parameter optimisations and warm-up, make the predictions around 30 times slower but improve the f1 score on CoNLL 2003 currently to 92.47 (averaged over 10 training, 92.71 for best model, using train set for training and testa for validation), or 92.69 (averaged over 10 training, 93.09 best model) when training with the validation set (as in the paper Peters and al., 2017).

For re-training a model, the CoNLL-2003 NER dataset (eng.train, eng.testa, eng.testb) must be present under data/sequenceLabelling/CoNLL-2003/ in IOB2 tagging sceheme (look here for instance ;) and here. The CONLL 2003 dataset (English) is the default dataset and English is the default language, but you can also indicate it explicitly as parameter with --dataset-type conll2003 and specifying explicitly the language --lang en.

For training and evaluating following the traditional approach (training with the train set without validation set, and evaluating on test set), use:

python3 nerTagger.py --dataset-type conll2003 train_eval

To use ELMo contextual embeddings, add the parameter --use-ELMo. This will slow down considerably (30 times) the first epoch of the training, then the contextual embeddings will be cached and the rest of the training will be similar to usual embeddings in term of training time. Alternatively add --use-BERT to use BERT extracted features as contextual embeddings to the RNN architecture.

python3 nerTagger.py --dataset-type conll2003 --use-ELMo train_eval

Some recent works like (Chiu & Nichols, 2016) and (Peters and al., 2017) also train with the validation set, leading obviously to a better accuracy (still they compare their scores with scores previously reported trained differently, which is arguably a bit unfair - this aspect is mentioned in (Ma & Hovy, 2016)). To train with both train and validation sets, use the parameter --train-with-validation-set:

python3 nerTagger.py --dataset-type conll2003 --train-with-validation-set train_eval

Note that, by default, the BidLSTM-CRF model is used. (Documentation on selecting other models and setting hyperparameters to be included here !)

For evaluating against CoNLL 2003 testb set with the existing model:

python3 nerTagger.py --dataset-type conll2003 eval

    Evaluation on test set:
        f1 (micro): 91.35
                 precision    recall  f1-score   support

            ORG     0.8795    0.9007    0.8899      1661
            PER     0.9647    0.9623    0.9635      1617
           MISC     0.8261    0.8120    0.8190       702
            LOC     0.9260    0.9305    0.9282      1668

    avg / total     0.9109    0.9161    0.9135      5648

If the model has been trained also with the validation set (--train-with-validation-set), similarly to (Chiu & Nichols, 2016) or (Peters and al., 2017), results are significantly better:

    Evaluation on test set:
        f1 (micro): 91.60
                 precision    recall    f1-score    support

            LOC     0.9219    0.9418    0.9318      1668
           MISC     0.8277    0.8077    0.8176       702
            PER     0.9594    0.9635    0.9614      1617
            ORG     0.9029    0.8904    0.8966      1661

    avg / total     0.9158    0.9163    0.9160      5648

Using ELMo with the best model obtained over 10 training (not using the validation set for training, only for early stop):

    Evaluation on test set:
        f1 (micro): 92.71
                      precision    recall  f1-score   support

                 PER     0.9787    0.9672    0.9729      1617
                 LOC     0.9368    0.9418    0.9393      1668
                MISC     0.8237    0.8319    0.8278       702
                 ORG     0.9072    0.9181    0.9126      1661

    all (micro avg.)     0.9257    0.9285    0.9271      5648

Using ELMo and training with the validation set gives a f-score of 93.09 (best model), 92.69 averaged over 10 runs (the best model is provided under data/models/sequenceLabelling/ner-en-conll2003-BidLSTM_CRF/with_validation_set/).

Using BERT architecture for sequence labelling (pre-trained transformer with fine-tuning), for instance here the bert-base-en, cased, pre-trained model, use:

python3 nerTagger.py --architecture bert-base-en --dataset-type conll2003 --fold-count 10 train_eval

average over 10 folds
            precision    recall  f1-score   support

       ORG     0.8804    0.9114    0.8957      1661
      MISC     0.7823    0.8189    0.8002       702
       PER     0.9633    0.9576    0.9605      1617
       LOC     0.9290    0.9316    0.9303      1668

  macro f1 = 0.9120
  macro precision = 0.9050
  macro recall = 0.9191

For training with all the available data:

python3 nerTagger.py --dataset-type conll2003 train

To take into account the strong impact of random seed, you need to train multiple times with the n-folds options. The model will be trained n times with different seed values but with the same sets if the evaluation set is provided. The evaluation will then give the average scores over these n models (against test set) and for the best model which will be saved. For 10 times training for instance, use:

python3 nerTagger.py --dataset-type conll2003 --fold-count 10 train_eval

After training a model, for tagging some text, for instance in a file data/test/test.ner.en.txt (), use the command:

python3 nerTagger.py --dataset-type conll2003 --file-in data/test/test.ner.en.txt tag

For instance for tagging the text with a specific architecture:

python3 nerTagger.py --dataset-type conll2003 --file-in data/test/test.ner.en.txt --architecture bert-base-en tag

Note that, currently, the input text file must contain one sentence per line, so the text must be presegmented into sentences. To obtain the JSON annotations in a text file instead than in the standard output, use the parameter --file-out. Predictions work at around 7400 tokens per second for the BidLSTM_CRF architecture with a GeForce GTX 1080 Ti.

This produces a JSON output with entities, scores and character offsets like this:

    "runtime": 0.34,
    "texts": [
            "text": "The University of California has found that 40 percent of its students suffer food insecurity. At four state universities in Illinois, that number is 35 percent.",
            "entities": [
                    "text": "University of California",
                    "endOffset": 32,
                    "score": 1.0,
                    "class": "ORG",
                    "beginOffset": 4
                    "text": "Illinois",
                    "endOffset": 134,
                    "score": 1.0,
                    "class": "LOC",
                    "beginOffset": 125
            "text": "President Obama is not speaking anymore from the White House.",
            "entities": [
                    "text": "Obama",
                    "endOffset": 18,
                    "score": 1.0,
                    "class": "PER",
                    "beginOffset": 10
                    "text": "White House",
                    "endOffset": 61,
                    "score": 1.0,
                    "class": "LOC",
                    "beginOffset": 49
    "software": "DeLFT",
    "date": "2018-05-02T12:24:55.529301",
    "model": "ner"

If you have trained the model with ELMo, you need to indicate to use ELMo-based model when annotating with the parameter --use-ELMo (note that the runtime impact is important as compared to traditional embeddings):

python3 nerTagger.py --dataset-type conll2003 --use-ELMo --file-in data/test/test.ner.en.txt tag

For English NER tagging, the default static embeddings is Glove (glove-840B). Other static embeddings can be specified with the parameter --embedding, for instance:

python3 nerTagger.py --dataset-type conll2003 --embedding word2vec train_eval

Ontonotes 5.0 CONLL 2012

DeLFT comes with pre-trained models with the Ontonotes 5.0 CoNLL-2012 NER dataset. As dataset-type identifier, use conll2012. All the options valid for CoNLL-2003 NER dataset are usable for this dataset. Default static embeddings for Ontonotes are fasttext-crawl, which can be changed with parameter --embedding.

With the default BidLSTM-CRF architecture, FastText embeddings and without any parameter tuning, f1 score is 86.65 averaged over these 10 trainings, with best run at 87.01 (provided model) when trained with the train set strictly.

With ELMo, f-score is 88.66 averaged over these 10 trainings, and with best best run at 89.01.

For re-training, the assembled Ontonotes datasets following CoNLL-2012 must be available and converted into IOB2 tagging scheme, see here for more details. To train and evaluate following the traditional approach (training with the train set without validation set, and evaluating on test set), use:

python3 nerTagger.py --dataset-type conll2012 train_eval

Evaluation on test set:
	f1 (micro): 87.01
                  precision    recall  f1-score   support

            DATE     0.8029    0.8695    0.8349      1602
        CARDINAL     0.8130    0.8139    0.8135       935
          PERSON     0.9061    0.9371    0.9214      1988
             GPE     0.9617    0.9411    0.9513      2240
             ORG     0.8799    0.8568    0.8682      1795
           MONEY     0.8903    0.8790    0.8846       314
            NORP     0.9226    0.9501    0.9361       841
         ORDINAL     0.7873    0.8923    0.8365       195
            TIME     0.5772    0.6698    0.6201       212
     WORK_OF_ART     0.6000    0.5060    0.5490       166
             LOC     0.7340    0.7709    0.7520       179
           EVENT     0.5000    0.5556    0.5263        63
         PRODUCT     0.6528    0.6184    0.6351        76
         PERCENT     0.8717    0.8567    0.8642       349
        QUANTITY     0.7155    0.7905    0.7511       105
             FAC     0.7167    0.6370    0.6745       135
        LANGUAGE     0.8462    0.5000    0.6286        22
             LAW     0.7308    0.4750    0.5758        40

all (micro avg.)     0.8647    0.8755    0.8701     11257

With ELMo embeddings (using the default hyper-parameters, except the batch size which is increased to better learn the less frequent classes):

Evaluation on test set:
  f1 (micro): 89.01
                  precision    recall  f1-score   support

             LAW     0.7188    0.5750    0.6389        40
         PERCENT     0.8946    0.8997    0.8971       349
           EVENT     0.6212    0.6508    0.6357        63
        CARDINAL     0.8616    0.7722    0.8144       935
        QUANTITY     0.7838    0.8286    0.8056       105
            NORP     0.9232    0.9572    0.9399       841
             LOC     0.7459    0.7709    0.7582       179
            DATE     0.8629    0.8252    0.8437      1602
        LANGUAGE     0.8750    0.6364    0.7368        22
             GPE     0.9637    0.9607    0.9622      2240
         ORDINAL     0.8145    0.9231    0.8654       195
             ORG     0.9033    0.8903    0.8967      1795
           MONEY     0.8851    0.9076    0.8962       314
             FAC     0.8257    0.6667    0.7377       135
            TIME     0.6592    0.6934    0.6759       212
          PERSON     0.9350    0.9477    0.9413      1988
     WORK_OF_ART     0.6467    0.7169    0.6800       166
         PRODUCT     0.6867    0.7500    0.7170        76

all (micro avg.)     0.8939    0.8864    0.8901     11257

For ten model training with average, worst and best model with ELMo embeddings, use:

python3 nerTagger.py --dataset-type conll2012 --use-ELMo --fold-count 10 train_eval

French model (based on Le Monde corpus)

Note that Le Monde corpus is subject to copyrights and is limited to research usage only, it is usually referred to as "corpus FTB". The corpus file ftb6_ALL.EN.docs.relinked.xml must be located under delft/data/sequenceLabelling/leMonde/. This is the default French model, so it will be used by simply indicating the language as parameter: --lang fr, but you can also indicate explicitly the dataset with --dataset-type ftb. Default static embeddings for French language models are wiki.fr, which can be changed with parameter --embedding.

Similarly as before, for training and evaluating use:

python3 nerTagger.py --lang fr --dataset-type ftb train_eval

In practice, we need to repeat training and evaluation several times to neutralise random seed effects and to average scores, here ten times:

python3 nerTagger.py --lang fr --dataset-type ftb --fold-count 10 train_eval

The performance is as follow, for the BiLSTM-CRF architecture and fasttext wiki.fr embeddings, with a f-score of 91.01 averaged over 10 training:

average over 10 folds
  macro f1 = 0.9100881012386587
  macro precision = 0.9048633201198737
  macro recall = 0.9153907496012759 

** Worst ** model scores - 

                  precision    recall  f1-score   support

      <location>     0.9467    0.9647    0.9556       368
   <institution>     0.8621    0.8333    0.8475        30
      <artifact>     1.0000    0.5000    0.6667         4
  <organisation>     0.9146    0.8089    0.8585       225
        <person>     0.9264    0.9522    0.9391       251
      <business>     0.8463    0.8936    0.8693       376

all (micro avg.)     0.9040    0.9083    0.9061      1254

** Best ** model scores - 

                  precision    recall  f1-score   support

      <location>     0.9439    0.9592    0.9515       368
   <institution>     0.8667    0.8667    0.8667        30
      <artifact>     1.0000    0.5000    0.6667         4
  <organisation>     0.8813    0.8578    0.8694       225
        <person>     0.9453    0.9641    0.9546       251
      <business>     0.8706    0.9122    0.8909       376

all (micro avg.)     0.9090    0.9242    0.9166      1254

With frELMo:

python3 nerTagger.py --lang fr --dataset-type ftb --fold-count 10 --use-ELMo train_eval

average over 10 folds
    macro f1 = 0.9209397554337976
    macro precision = 0.91949107960079
    macro recall = 0.9224082934609251 

** Worst ** model scores - 

                  precision    recall  f1-score   support

  <organisation>     0.8704    0.8356    0.8526       225
        <person>     0.9344    0.9641    0.9490       251
      <artifact>     1.0000    0.5000    0.6667         4
      <location>     0.9173    0.9647    0.9404       368
   <institution>     0.8889    0.8000    0.8421        30
      <business>     0.9130    0.8936    0.9032       376

all (micro avg.)     0.9110    0.9147    0.9129      1254

** Best ** model scores - 

                  precision    recall  f1-score   support

  <organisation>     0.9061    0.8578    0.8813       225
        <person>     0.9416    0.9641    0.9528       251
      <artifact>     1.0000    0.5000    0.6667         4
      <location>     0.9570    0.9674    0.9622       368
   <institution>     0.8889    0.8000    0.8421        30
      <business>     0.9016    0.9255    0.9134       376

all (micro avg.)     0.9268    0.9290    0.9279      1254

For historical reason, we can also consider a particular split of the FTB corpus into train, dev and set set and with a forced tokenization (like the old CoNLL 2013 NER), that was used in previous work for comparison. Obviously the evaluation is dependent to this particular set and the n-fold cross validation is a much better practice and should be prefered (as well as a format that do not force a tokenization). For using the forced split FTB (using the files ftb6_dev.conll, ftb6_test.conll and ftb6_train.conll located under delft/data/sequenceLabelling/leMonde/), use as parameter --dataset-type ftb_force_split:

python3 nerTagger.py --lang fr --dataset-type ftb_force_split --fold-count 10 train_eval

which gives for the BiLSTM-CRF architecture and fasttext wiki.fr embeddings, a f-score of 86.37 averaged over 10 training:

average over 10 folds
                    precision    recall  f1-score   support

      Organization     0.8410    0.7431    0.7888       311
            Person     0.9086    0.9327    0.9204       205
          Location     0.9219    0.9144    0.9181       347
           Company     0.8140    0.8603    0.8364       290
  FictionCharacter     0.0000    0.0000    0.0000         2
           Product     1.0000    1.0000    1.0000         3
               POI     0.0000    0.0000    0.0000         0
           company     0.0000    0.0000    0.0000         0

  macro f1 = 0.8637
  macro precision = 0.8708
  macro recall = 0.8567 

** Worst ** model scores -
                  precision    recall  f1-score   support

    Organization     0.8132    0.7138    0.7603       311
        Location     0.9152    0.9020    0.9086       347
         Company     0.7926    0.8172    0.8048       290
          Person     0.9095    0.9317    0.9205       205
         Product     1.0000    1.0000    1.0000         3
FictionCharacter     0.0000    0.0000    0.0000         2

all (micro avg.)     0.8571    0.8342    0.8455      1158

** Best ** model scores -
                  precision    recall  f1-score   support

    Organization     0.8542    0.7910    0.8214       311
        Location     0.9226    0.9280    0.9253       347
         Company     0.8212    0.8552    0.8378       290
          Person     0.9095    0.9317    0.9205       205
         Product     1.0000    1.0000    1.0000         3
FictionCharacter     0.0000    0.0000    0.0000         2

all (micro avg.)     0.8767    0.8722    0.8745      1158

With frELMo:

python3 nerTagger.py --lang fr --dataset-type ftb_force_split --fold-count 10 --use-ELMo train_eval

average over 10 folds
                    precision    recall  f1-score   support

      Organization     0.8605    0.7752    0.8155       311
            Person     0.9227    0.9371    0.9298       205
          Location     0.9281    0.9432    0.9356       347
           Company     0.8401    0.8779    0.8585       290
  FictionCharacter     0.1000    0.0500    0.0667         2
           Product     0.8750    1.0000    0.9286         3
               POI     0.0000    0.0000    0.0000         0
           company     0.0000    0.0000    0.0000         0

  macro f1 = 0.8831
  macro precision = 0.8870
  macro recall = 0.8793 

** Worst ** model scores -
                  precision    recall  f1-score   support

        Location     0.9366    0.9366    0.9366       347
    Organization     0.8309    0.7428    0.7844       311
          Person     0.9268    0.9268    0.9268       205
         Company     0.8179    0.8828    0.8491       290
         Product     0.7500    1.0000    0.8571         3
FictionCharacter     0.0000    0.0000    0.0000         2

all (micro avg.)     0.8762    0.8679    0.8720      1158

** Best ** model scores -
                  precision    recall  f1-score   support

        Location     0.9220    0.9539    0.9377       347
    Organization     0.8777    0.7846    0.8285       311
          Person     0.9187    0.9366    0.9275       205
         Company     0.8444    0.9172    0.8793       290
         Product     1.0000    1.0000    1.0000         3
FictionCharacter     0.0000    0.0000    0.0000         2

all (micro avg.)     0.8900    0.8946    0.8923      1158

For the ftb_force_split dataset, similarly as for CoNLL 2013, you can use the train_with_validation_set parameter to add the validation set in the training data. The above results are all obtained without using train_with_validation_set (which is the common approach).

Finally, for training with all the dataset without evaluation (e.g. for production):

python3 nerTagger.py --lang fr --dataset-type ftb train

and for annotating some examples:

python3 nerTagger.py --lang fr --dataset-type ftb --file-in data/test/test.ner.fr.txt tag

    "date": "2018-06-11T21:25:03.321818",
    "runtime": 0.511,
    "software": "DeLFT",
    "model": "ner-fr-lemonde",
    "texts": [
            "entities": [
                    "beginOffset": 5,
                    "endOffset": 13,
                    "score": 1.0,
                    "text": "Allemagne",
                    "class": "<location>"
                    "beginOffset": 57,
                    "endOffset": 68,
                    "score": 1.0,
                    "text": "Donald Trump",
                    "class": "<person>"
            "text": "Or l’Allemagne pourrait préférer la retenue, de peur que Donald Trump ne surtaxe prochainement les automobiles étrangères."

This above work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

GROBID models

DeLFT supports GROBID training data (originally for CRF) and GROBID feature matrix to be labelled. Default static embeddings for GROBID models are glove-840B, which can be changed with parameter --embedding.

Train a model:

python3 grobidTagger.py name-of-model train

where name-of-model is one of GROBID model (date, affiliation-address, citation, header, name-citation, name-header, ...), for instance:

python3 grobidTagger.py date train

To segment the training data and eval on 10%:

python3 grobidTagger.py name-of-model train_eval

For instance for the date model:

python3 grobidTagger.py date train_eval

        f1 (micro): 96.41
                 precision    recall  f1-score   support

        <month>     0.9667    0.9831    0.9748        59
         <year>     1.0000    0.9844    0.9921        64
          <day>     0.9091    0.9524    0.9302        42

    avg / total     0.9641    0.9758    0.9699       165

For applying a model on some examples:

python3 grobidTagger.py date tag

    "runtime": 0.509,
    "software": "DeLFT",
    "model": "grobid-date",
    "date": "2018-05-23T14:18:15.833959",
    "texts": [
            "entities": [
                    "score": 1.0,
                    "endOffset": 6,
                    "class": "<month>",
                    "beginOffset": 0,
                    "text": "January"
                    "score": 1.0,
                    "endOffset": 11,
                    "class": "<year>",
                    "beginOffset": 8,
                    "text": "2006"
            "text": "January 2006"
            "entities": [
                    "score": 1.0,
                    "endOffset": 4,
                    "class": "<month>",
                    "beginOffset": 0,
                    "text": "March"
                    "score": 1.0,
                    "endOffset": 13,
                    "class": "<day>",
                    "beginOffset": 10,
                    "text": "27th"
                    "score": 1.0,
                    "endOffset": 19,
                    "class": "<year>",
                    "beginOffset": 16,
                    "text": "2001"
            "text": "March the 27th, 2001"

As usual, the architecture to be used for the indicated model can be specified with the --architecture parameter:

python3 grobidTagger.py citation train_eval --architecture BidLSTM_CRF_FEATURES

With the architectures having a feature channel, the categorial features (as generated by GROBID) will be automatically selected (typically the layout and lexical class features). The models not having a feature channel will only use the tokens as input (as the usual Deep Learning models for text).

Similarly to the NER models, to use ELMo contextual embeddings, add the parameter --use-ELMo, e.g.:

python3 grobidTagger.py citation --use-ELMo train_eval

Add the parameter --use-BERT to use BERT extracted features as contextual embeddings for the RNN architecture.

Similarly to the NER models, for n-fold training (action train_eval only), specify the value of n with the parameter --fold-count, e.g.:

python3 grobidTagger.py citation --fold-count=10 train_eval

By default the Grobid data to be used are the ones available under the data/sequenceLabelling/grobid subdirectory, but a Grobid data file can be provided by the parameter --input:

python3 grobidTagger.py name-of-model train --input path-to-the-grobid-data-file-to-be-used-for-training


python3 grobidTagger.py name-of-model train_eval --input path-to-the-grobid-data-file-to-be-used-for-training_and_eval_with_random_split

The evaluation of a model with a specific Grobid data file can be performed using the eval action and specifying the data file with --input:

python3 grobidTagger.py citation eval --input path-to-the-grobid-data-file-to-be-used-for-evaluation

The evaluation of a model can be performed calling

python3 grobidTagger.py citation eval --input evaluation_data

Insult recognition

A small experimental model for recognising insults and threats in texts, based on the Wikipedia comment from the Kaggle Wikipedia Toxic Comments dataset, English only. This uses a small dataset labelled manually.

For training:

python3 insultTagger.py train

By default training uses the whole train set.

Example of a small tagging test:

python3 insultTagger.py tag

will produced (socially offensive language warning!) result like this:

    "runtime": 0.969,
    "texts": [
            "entities": [],
            "text": "This is a gentle test."
            "entities": [
                    "score": 1.0,
                    "endOffset": 20,
                    "class": "<insult>",
                    "beginOffset": 9,
                    "text": "moronic wimp"
                    "score": 1.0,
                    "endOffset": 56,
                    "class": "<threat>",
                    "beginOffset": 54,
                    "text": "die"
            "text": "you're a moronic wimp who is too lazy to do research! die in hell !!"
    "software": "DeLFT",
    "date": "2018-05-14T17:22:01.804050",
    "model": "insult"

Creating your own model

As long your task is a sequence labelling of text, adding a new corpus and create an additional model should be straightfoward. If you want to build a model named toto based on labelled data in one of the supported format (CoNLL, TEI or GROBID CRF), create the subdirectory data/sequenceLabelling/toto and copy your training data under it.

(To be completed)

Text classification

Available models

All the following models includes Dropout, Pooling and Dense layers with hyperparameters tuned for reasonable performance across standard text classification tasks. If necessary, they are good basis for further performance tuning.

  • gru: two layers Bidirectional GRU
  • gru_simple: one layer Bidirectional GRU
  • bidLstm: a Bidirectional LSTM layer followed by an Attention layer
  • cnn: convolutional layers followed by a GRU
  • lstm_cnn: LSTM followed by convolutional layers
  • mix1: one layer Bidirectional GRU followed by a Bidirectional LSTM
  • dpcnn: Deep Pyramid Convolutional Neural Networks (but not working as expected - to be reviewed)

also available (via TensorFlow):

  • bert or scibert: BERT (Bidirectional Encoder Representations from Transformers) architecture (classification corresponds to a fine tuning)

Note: by default the first 300 tokens of the text to be classified are used, which is largely enough for any short text classification tasks and works fine with low profile GPU (for instance GeForce GTX 1050 Ti with 4 GB memory). For taking into account a larger portion of the text, modify the config model parameter maxlen. However, using more than 1000 tokens for instance requires a modern GPU with enough memory (e.g. 10 GB).

For all these RNN architectures, it is possible to use ELMo contextual embeddings (--use-ELMo) or BERT extracted features as embeddings (--use-BERT). The integration of BERT as an additional non-RNN architecture is done via TensorFlow, we do not mix Keras and TensorFlow layers.


Toxic comment classification

The dataset of the Kaggle Toxic Comment Classification challenge can be found here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

This is a multi-label regression problem, where a Wikipedia comment (or any similar short texts) should be associated to 6 possible types of toxicity (toxic, severe_toxic, obscene, threat, insult, identity_hate).

To launch the training:

python3 toxicCommentClassifier.py train

For training with n-folds, use the parameter --fold-count:

python3 toxicCommentClassifier.py train --fold-count 10

After training (1 or n-folds), to process the Kaggle test set, use:

python3 toxicCommentClassifier.py test

To classify a set of comments:

python3 toxicCommentClassifier.py classify

Citation classification

We use the dataset developed and presented by A. Athar in the following article:

[7] Awais Athar. "Sentiment Analysis of Citations using Sentence Structure-Based Features". Proceedings of the ACL 2011 Student Session, 81-87, 2011. http://www.aclweb.org/anthology/P11-3015

For a given scientific article, the task is to estimate if the occurrence of a bibliographical citation is positive, neutral or negative given its citation context. Note that the dataset, similarly to the Toxic Comment classification, is highly unbalanced (86% of the citations are neutral).

In this example, we formulate the problem as a 3 class regression (negative. neutral, positive). To train the model:

python3 citationClassifier.py train

with n-folds:

python3 citationClassifier.py train --fold-count 10

Training and evalation (ratio) with 10-folds:

python3 citationClassifier.py train_eval --fold-count 10

which should produce the following evaluation (using the 2-layers Bidirectional GRU model gru):

Evaluation on 896 instances:
                   precision        recall       f-score       support
      negative        0.1494        0.4483        0.2241            29
       neutral        0.9653        0.8058        0.8784           793
      positive        0.3333        0.6622        0.4434            74

Similarly as other scripts, use --architecture to specify an alternative DL architecture, for instance SciBERT:

python3 citationClassifier.py train_eval --architecture scibert

Evaluation on 896 instances:
                   precision        recall       f-score       support
      negative        0.1712        0.6552        0.2714            29
       neutral        0.9740        0.8020        0.8797           793
      positive        0.4015        0.7162        0.5146            74

To classify a set of citation contexts with default model (2-layers Bidirectional GRU model gru):

python3 citationClassifier.py classify

which will produce some JSON output like this:

    "model": "citations",
    "date": "2018-05-13T16:06:12.995944",
    "software": "DeLFT",
    "classifications": [
            "negative": 0.001178970211185515,
            "text": "One successful strategy [15] computes the set-similarity involving (multi-word) keyphrases about the mentions and the entities, collected from the KG.",
            "neutral": 0.187219500541687,
            "positive": 0.8640883564949036
            "negative": 0.4590276777744293,
            "text": "Unfortunately, fewer than half of the OCs in the DAML02 OC catalog (Dias et al. 2002) are suitable for use with the isochrone-fitting method because of the lack of a prominent main sequence, in addition to an absence of radial velocity and proper-motion data.",
            "neutral": 0.3570767939090729,
            "positive": 0.18021513521671295
            "negative": 0.0726129561662674,
            "text": "However, we found that the pairwise approach LambdaMART [41] achieved the best performance on our datasets among most learning to rank algorithms.",
            "neutral": 0.12469841539859772,
            "positive": 0.8224021196365356
    "runtime": 1.202



  • The integration of FLAIR contextual embeddings (branch flair and flair2) raised several issues and we did not manage to reproduce the results from the full FLAIR implementation. We should experiment with https://github.com/kensho-technologies/bubs, a Keras/TensorFlow reimplementation of the Flair Contextualized Embeddings.

  • Try to migrate to TF 2.0 and tf.keras

  • Review/rewrite the current Linear Chain CRF layer that we are using, this Keras CRF implementation is (i) a runtime bottleneck, we could try to use Cython for improving runtime and (ii) the viterbi decoding is incomplete, it does not outputing final decoded label scores and it can't output n-best.

  • Port everything to Apache MXNet? :)


  • complete the benchmark with OntoNotes 5 - other languages

  • align the CoNLL corpus tokenisation (CoNLL corpus is "pre-tokenised", but we might not want to follow this particular tokenisation)


  • automatic download of embeddings on demand

  • improve runtime

Build more models and examples...

  • Model for entity disambiguation (deeptype for entity-fishing)

  • Relation extractions (in particular with medical texts)

Note that we are focusing on sequence labelling/information extraction and text classification tasks, which are our main applications, and not on text understanding and machine translation which are the object of already many other Open Source frameworks.


  • Keras CRF implementation by Philipp Gross

  • The evaluations for sequence labelling are based on a modified version of https://github.com/chakki-works/seqeval

  • The preprocessor of the sequence labelling part is derived from https://github.com/Hironsan/anago/

  • ELMo contextual embeddings are developed by the AllenNLP team and we use the TensorFlow library bilm-tf for integrating them into DeLFT.

  • BERT transformer original implementation by Google Research, which has been adapted for text classification and sequence labelling in DeLFT.

  • FastPredict from by Marc Stogaitis, adapted to our BERT usages.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Contact: Patrice Lopez ([email protected])

How to cite

If you want to this work, please refer to the present GitHub project, together with the Software Heritage project-level permanent identifier. For example, with BibTeX:

    title = {DeLFT},
    howpublished = {\url{https://github.com/kermitt2/delft}},
    publisher = {GitHub},
    year = {2018--2020},
    archivePrefix = {swh},
    eprint = {1:dir:54eb292e1c0af764e27dd179596f64679e44d06e}
  • Implementing features channel

    Implementing features channel

    See #76 for the initial comments (Closed as the source has been changed to be the branch features_layout on kermitt/delft)

    This PR is to solve the issue #42 . I took some code from @de-code 's implementation in https://github.com/elifesciences/sciencebeam-trainer-delft.

    The features are managed by the FeaturePreprocessor, passed to WordPreprocessor to deal with the features and a) pick just the one specified by the user with the parameter feature_indices or b) select all features with unique cardinality below 12.

    As @de-code did, the value 0 is reserved for eventually unseen features values, not covered by the mapping (which, with the indexes, is stored in the model configuration).

    NOTE: There is a second implementation that uses sklern library (from @de-code) to select and vectorise the features, but I could not manage to make it work. We can think about it once the whole end to end process works.

    The architecture has now an additional input layer feature_input. The features input channel is implemented only on the model BidLSTM_CRF_CASING, for this reason it needs to be run with --architecture BidLSTM_CRF_CASING. This input is concatenated in the input layer, and in x using TimeDistribution(Dense()). Not sure this is correct.

    Lastly, I also included some unit tests, again took them from @de-code. They were quite useful to cover different use cases without have to run the whole machinery.

    opened by lfoppiano 55
  • Add layout features to GROBID model

    Add layout features to GROBID model

    Hi @kermitt2

    Something you are already well aware of but I thought it's good to have an issue to record the discussion around it. I am not sure whether you already experimented with adding layout features.

    I've started doing it and implemented something here: https://github.com/elifesciences/sciencebeam-trainer-delft/pull/16

    Maybe you'll find some of it useful. (I don't want to flood you with too many PRs)

    opened by de-code 20
  • fix wrong preprocessors serialisation

    fix wrong preprocessors serialisation

    This PR attempt to solve the problem caused by the serialisation of the FeaturePreprocessor object. The fix could and should be improved but I could not find out quickly how

    When running the tagger using a model that has been trained with features, it will fail because features=None in the DataGenerator creation.

    CC @de-code @kermitt2

    opened by lfoppiano 9
  • Do not truncate when tagging

    Do not truncate when tagging

    This PR avoids IndexOutOfBounds when running from Grobid.

    /Library/Java/JavaVirtualMachines/jdk-11.0.3.jdk/Contents/Home/bin/java "-javaagent:/Users/lfoppiano/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/193.6494.35/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=52842:/Users/lfoppiano/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/193.6494.35/IntelliJ IDEA.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath /Users/lfoppiano/development/projects/grobid/grobid-superconductors/out/production/classes:/Users/lfoppiano/development/projects/grobid/grobid-superconductors/out/production/resources:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/org.grobid/grobid-trainer/0.5.6/a8b97fbec6b1fd8f3666365bea0f3c07274794e6/grobid-trainer-0.5.6.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/org.grobid/grobid-core/0.5.6/5df899777b169c8c19714b5b28367f6c2dbc58f0/grobid-core-0.5.6.jar:/Users/lfoppiano/.m2/repository/org/grobid/grobid-quantities/0.5.2-SNAPSHOT/grobid-quantities-0.5.2-SNAPSHOT.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/systems.uom/systems-ucum-java8/0.9/27df6500aba81d86185c8543f46cb304a9501b57/systems-ucum-java8-0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/si.uom/si-units-java8/0.9/c2287c0b267ada40036f5592b3ad641b57dd3b45/si-units-java8-0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/systems.uom/systems-unicode-java8/0.9/f770a4456a75eab2b581981443f6d29edf2f7b95/systems-unicode-java8-0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/systems.uom/systems-common/0.9/1266d5faf6480e81fbc9f64eb35c668f41493d97/systems-common-0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/si.uom/si-units/0.9/19d1aeb893353ed16c7c099e526e044765cc2131/si-units-0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/si.uom/si-quantity/0.9/a72421162415059dd5804034d928e9112ea25034/si-quantity-0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/tec.uom/uom-se/1.0.9/6081f0c33677d9866f8587c9f6046f631aecba9/uom-se-1.0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/systems.uom/systems-quantity/0.9/bd6c9e7ed57a41ad75f1849adeefe2a523707d37/systems-quantity-0.9.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/com.googlecode.clearnlp/clearnlp/1.3.1/af3078412e740d6483d7b5e79768b290e0c8fe42/clearnlp-1.3.1.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-collections4/4.1/a4cf4688fe1c7e3a63aa636cc96d013af537768e/commons-collections4-4.1.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/io.dropwizard/dropwizard-assets/1.3.16/6965027c5f6b076b5f0ce4012d838a20fe3eb487/dropwizard-assets-1.3.16.jar:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/com.hubspot.dropwizard/dropwizard-guicier/ org.grobid.service.GrobidSuperconductorsApplication trainingGeneration -dIn /Users/lfoppiano/development/projects/grobid/grobid-superconductors/resources/dataset/superconductors/corpus/pdf/batch-4 -dOut /Users/lfoppiano/development/projects/grobid/grobid-superconductors/resources/dataset/superconductors/corpus/staging -m superconductors -f xml -r resources/config/config.yml
    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by com.fasterxml.jackson.module.afterburner.util.MyClassLoader (file:/Users/lfoppiano/.gradle/caches/modules-2/files-2.1/com.fasterxml.jackson.module/jackson-module-afterburner/2.9.10/6cca4a73cb54aa8631775023ca8cc37626373cc8/jackson-module-afterburner-2.9.10.jar) to method java.lang.ClassLoader.findLoadedClass(java.lang.String)
    WARNING: Please consider reporting this to the maintainers of com.fasterxml.jackson.module.afterburner.util.MyClassLoader
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    WARN  [2020-03-12 01:16:05,775] org.grobid.core.main.GrobidHomeFinder: No Grobid property was provided. Attempting to find Grobid home in the current directory...
    WARN  [2020-03-12 01:16:05,777] org.grobid.core.main.GrobidHomeFinder: ***************************************************************
    WARN  [2020-03-12 01:16:05,777] org.grobid.core.main.GrobidHomeFinder: *** USING GROBID HOME: /Users/lfoppiano/development/projects/grobid/grobid-superconductors/../grobid-home
    WARN  [2020-03-12 01:16:05,777] org.grobid.core.main.GrobidHomeFinder: ***************************************************************
    INFO  [2020-03-12 01:16:05,788] org.grobid.core.main.LibraryLoader: Loading external native sequence labelling library
    INFO  [2020-03-12 01:16:05,793] org.grobid.core.main.LibraryLoader: Loading Wapiti native library...
    INFO  [2020-03-12 01:16:05,794] org.grobid.core.main.LibraryLoader: Loading JEP native library for DeLFT... /Users/lfoppiano/development/projects/grobid/grobid-home/lib/mac-64
    INFO  [2020-03-12 01:16:05,817] org.grobid.core.main.LibraryLoader: Configuring python environment: /Users/lfoppiano/opt/anaconda3/envs/delft
    INFO  [2020-03-12 01:16:05,817] org.grobid.core.main.LibraryLoader: Adding library paths [/Users/lfoppiano/opt/anaconda3/envs/delft/lib, /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/jep]
    INFO  [2020-03-12 01:16:05,826] org.grobid.core.main.LibraryLoader: Native library for sequence labelling loaded
    INFO  [2020-03-12 01:16:05,828] org.grobid.core.lexicon.Lexicon: Initiating dictionary
    INFO  [2020-03-12 01:16:05,828] org.grobid.core.lexicon.Lexicon: End of Initialization of dictionary
    INFO  [2020-03-12 01:16:05,828] org.grobid.core.lexicon.Lexicon: Initiating names
    INFO  [2020-03-12 01:16:05,828] org.grobid.core.lexicon.Lexicon: End of initialization of names
    INFO  [2020-03-12 01:16:06,228] org.grobid.core.lexicon.Lexicon: Initiating country codes
    INFO  [2020-03-12 01:16:06,228] org.grobid.core.lexicon.Lexicon: End of initialization of country codes
    INFO  [2020-03-12 01:16:06,328] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for affiliation-address...
    running thread: 1
    INFO  [2020-03-12 01:16:06,330] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 19
    Using TensorFlow backend.
    /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
      warnings.warn(msg, category=DeprecationWarning)
    WARNING:tensorflow:From /Users/lfoppiano/development/projects/grobid/grobid-superconductors/../../delft/delft/sequenceLabelling/preprocess.py:15: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
    WARNING:tensorflow:From /Users/lfoppiano/development/projects/grobid/grobid-superconductors/../../delft/delft/utilities/bert/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
    Instructions for updating:
    Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:186: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.
    WARNING:tensorflow:From /Users/lfoppiano/opt/anaconda3/envs/delft/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
    running thread: 1
    INFO  [2020-03-12 01:16:14,800] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for name-header...
    running thread: 1
    INFO  [2020-03-12 01:16:16,254] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for name-citation...
    running thread: 1
    INFO  [2020-03-12 01:16:17,956] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for header...
    running thread: 1
    INFO  [2020-03-12 01:16:19,877] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for date...
    running thread: 1
    INFO  [2020-03-12 01:16:22,067] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for citation...
    INFO  [2020-03-12 01:16:24,403] org.grobid.core.jni.WapitiModel: Loading model: /Users/lfoppiano/development/projects/grobid/grobid-superconductors/../grobid-home/models/fulltext/model.wapiti (size: 22415891)
    [Wapiti] Loading model: "/Users/lfoppiano/development/projects/grobid/grobid-superconductors/../grobid-home/models/fulltext/model.wapiti"
    Model path: /Users/lfoppiano/development/projects/grobid/grobid-superconductors/../grobid-home/models/fulltext/model.wapiti
    [Wapiti] Loading model: "/Users/lfoppiano/development/projects/grobid/grobid-superconductors/../grobid-home/models/segmentation/model.wapiti"
    INFO  [2020-03-12 01:16:26,730] org.grobid.core.jni.WapitiModel: Loading model: /Users/lfoppiano/development/projects/grobid/grobid-superconductors/../grobid-home/models/segmentation/model.wapiti (size: 17244068)
    Model path: /Users/lfoppiano/development/projects/grobid/grobid-superconductors/../grobid-home/models/segmentation/model.wapiti
    running thread: 1
    INFO  [2020-03-12 01:16:28,686] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for reference-segmenter...
    running thread: 1
    INFO  [2020-03-12 01:16:31,208] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for figure...
    running thread: 1
    INFO  [2020-03-12 01:16:33,897] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for table...
    INFO  [2020-03-12 01:16:36,899] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
    running thread: 1
    INFO  [2020-03-12 01:16:37,077] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for quantities...
    running thread: 1
    INFO  [2020-03-12 01:16:40,552] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for units...
    running thread: 1
    INFO  [2020-03-12 01:16:43,959] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for values...
    running thread: 1
    INFO  [2020-03-12 01:16:47,472] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for superconductors...
    INFO  [2020-03-12 01:16:51,295] org.grobid.core.engines.training.SuperconductorsParserTrainingData: 39 files to be processed.
    ERROR [2020-03-12 01:17:08,340] org.grobid.core.jni.DeLFTModel: DeLFT model reference_segmenter labelling failed
    ! java.lang.IndexOutOfBoundsException: Index 300 out of bounds for length 300
    ! at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
    ! at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
    ! at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
    ! at java.base/java.util.Objects.checkIndex(Objects.java:372)
    ! at java.base/java.util.ArrayList.get(ArrayList.java:458)
    ! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:128)
    ! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:64)
    ! at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    ! at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    ! at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    ! at java.base/java.lang.Thread.run(Thread.java:834)
    ! Causing: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 300 out of bounds for length 300
    ! at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
    ! at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
    ! at org.grobid.core.jni.JEPThreadPool.call(JEPThreadPool.java:162)
    ! at org.grobid.core.jni.DeLFTModel.label(DeLFTModel.java:155)
    ! at org.grobid.core.engines.tagging.DeLFTTagger.label(DeLFTTagger.java:29)
    ! at org.grobid.core.engines.AbstractParser.label(AbstractParser.java:42)
    ! at org.grobid.core.engines.ReferenceSegmenterParser.extract(ReferenceSegmenterParser.java:90)
    ! at org.grobid.core.engines.ReferenceSegmenterParser.extract(ReferenceSegmenterParser.java:74)
    ! at org.grobid.core.engines.ReferenceSegmenterParser.extract(ReferenceSegmenterParser.java:69)
    ! at org.grobid.core.engines.CitationParser.processingReferenceSection(CitationParser.java:185)
    ! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:222)
    ! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:113)
    ! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:477)
    ! at org.grobid.core.engines.training.SuperconductorsParserTrainingData.createTrainingPDF(SuperconductorsParserTrainingData.java:95)
    ! at org.grobid.core.engines.training.SuperconductorsParserTrainingData.createTrainingBatch(SuperconductorsParserTrainingData.java:234)
    ! at org.grobid.service.command.TrainingGenerationCommand.run(TrainingGenerationCommand.java:104)
    ! at org.grobid.service.command.TrainingGenerationCommand.run(TrainingGenerationCommand.java:24)
    ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:87)
    ! at io.dropwizard.cli.Cli.run(Cli.java:78)
    ! at io.dropwizard.Application.run(Application.java:93)
    ! at org.grobid.service.GrobidSuperconductorsApplication.main(GrobidSuperconductorsApplication.java:27)

    UPDATE: For bert the input is truncated: See def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer): at https://github.com/kermitt2/delft/blob/6c4a71817e8d8af7ea0291044cf76eb9429f8c35/delft/sequenceLabelling/models.py#L574

    and https://github.com/kermitt2/delft/blob/6c4a71817e8d8af7ea0291044cf76eb9429f8c35/delft/sequenceLabelling/models.py#L576

    I'm not sure what is the best way to fix it... We could give a very high number in max_sequence_lenght or the higher length, but that's means we need to pre-process all the inputs... not sure it's efficient enough.

    My solution here would be to use sys.maxsize:

    input_features, input_tokens = convert_examples_to_features(input_examples, self.labels, sys.maxsize, self.tokenizer)
    results = self.loaded_estimator.predict(input_features, sys.maxsize, self.predict_batch_size)
    opened by lfoppiano 8
  • Missing wikipedia-and-pmc embeddings location in starting guide

    Missing wikipedia-and-pmc embeddings location in starting guide

    Hey, I've tried to run delft, I've configured the embeddings (it worked previously) but latest changes introduced new embeddings for which I did not find the source:

    Compiling embeddings... (this is done only one time per embeddings at first launch)
    FileNotFoundError: [Errno 2] No such file or directory: '/media/lopez/T5/embeddings/wikipedia-pubmed-and-PMC-w2v.vec'

    I guess they are from here? http://evexdb.org/pmresources/vec-space-models/

    opened by lfoppiano 7
  • Suggestion: save preprocessor in a more transferable format

    Suggestion: save preprocessor in a more transferable format

    The pickle format is meant mainly for short term storage.

    It comes with a few drawbacks:

    • "The pickle module is not secure."
    • It ties the model to the current implementation: refactoring the code or upgrading a library might break unpickling
    • It's Python only and opaque: it would be difficult to use a trained model and serve it separately

    /cc @kermitt2 @lfoppiano

    opened by de-code 7
  • Training grobidTagger without embeddings

    Training grobidTagger without embeddings

    Hi @kermitt2, should training without embedding work?

    From the readme:

    Reduce model size, in particular by removing word embeddings from them. For instance, the model for the toxic comment classifier went down from a size of 230 MB with embeddings to 1.8 MB. In practice the size of all the models of DeLFT is less than 2 MB, except for Ontonotes 5.0 NER model which is 4.7 MB.

    When I try to set embeddings_name to None, it falls over soon after. I tried to fix the next two issues but there are more. Which makes me think maybe it's not meant to work?

    opened by de-code 7
  • LMDB Embeddings

    LMDB Embeddings

    Hi, thanks for the good work!

    Loving the idea of using LMDB to store and query the embeddings! I wrote a standalone package for this here: https://github.com/ThoughtRiver/lmdb-embeddings/tree/master. I thought it could be of use to embed it here thereby separating the logic out, but also to use in other settings. Is it worth me doing some work on this?

    Let me know what you think! Thanks

    opened by DomHudson 7
  • Bert sequence labelling

    Bert sequence labelling

    Add BERT architecture for sequence labelling.

    As noted here the original CoNLL-2003 NER results reported by the Google Research paper are not reproducible, by far, and they probably reported token-level metrics instead of entity-level metrics (as done by conlleval and previous works). In general, generic transformer pre-trained models appear to perform poorly for information extraction and NER tasks (both with fine-tuning or contextual embedding features), as compared to ELMo.

    Still it's a good exercise and using scibert/biobert for scientific text achieves very good and faster results, even compared to ELMo+BidLSTM-CRF.

    Similarly as the usage of BERT for text classification in DeLFT, we use a data generator to feed BERT when predicting (instead of the file-based input function of the original BERT implementation), and avoid reloading the whole TF graph for each batch. This was possible by using the FastPredict class in model.py, which is adapted from https://github.com/marcsto/rl/blob/master/src/fast_predict2.py by Marc Stogaitis.

    Using a nvidia GeForce 1080 GPU, we can process around 1000 tokens per second with this approach, which is 3 times faster than BiLSTM-CRF+ELMo, but 30 times slower than with a BiLSTM-CRF (and 100 times slower than what we get with a Wapiti CRF model on a modern workstation ;).

    opened by kermitt2 7
  • Add evaluation only command

    Add evaluation only command

    This PR adds the command eval which perform evaluation based on the provided file in --input.

    With it I tried to parameterize more the command line. Got inspired from @de-code's Science beam delft trainer

    opened by lfoppiano 6
  • add parameter to optionally output raw results in the evaluation

    add parameter to optionally output raw results in the evaluation

    When we perform the n-fold cross validation or holdout evaluation, we would like to have the possibility to output the raw results (on a separate file) as we do in grobid. In this way we can compare what is expected and predicted for each evaluation task.

    Components to be updated:

    • [ ] sequence Labelling
      • [ ] Grobid
      • [ ] BERT
      • [ ] NER
    • [ ] classification
    opened by lfoppiano 8
  • update imports for fastext - needed when using binary embeddings

    update imports for fastext - needed when using binary embeddings

    opened by lfoppiano 3
  • sequence labelling, n-fold training should use separate preprocessors

    sequence labelling, n-fold training should use separate preprocessors

    (Not sure if this was discussed before) It seems that the train_nfold method allows the preprocessor to see the whole dataset. It may be more correct to use a separate preprocessor for each split. The difference would be how the model handles unseen characters, feature tokens etc. (It may depend on the dataset how likely it only contains some of the characters in the validation split)

    /cc @kermitt2 @lfoppiano

    opened by de-code 1
  • Allow multiple tokens per feature data row

    Allow multiple tokens per feature data row

    This is carried over from https://github.com/kermitt2/delft/issues/90#issuecomment-606691994

    Since the segmentation data is using the first two tokens of a line, it would make sense to have an option to be able to use that in DeLFT. Currently it would only use the first one.

    Potential solution:

    • an option to specify the columns with the tokens (similar to the features)
    • concatenate the word embeddings and other token related vectors

    Probably need to change a few places that expect a single token as an input.

    /cc @kermitt2 @lfoppiano

    opened by de-code 3
  • Training callbacks are ignored when using a BERT architecture for sequence labelling

    Training callbacks are ignored when using a BERT architecture for sequence labelling

    In Trainer.train

            if 'bert' not in self.model_config.model_type.lower():
                self.model = self.train_model(self.model, x_train, y_train, x_valid, y_valid, 
                                                      self.training_config.max_epoch, callbacks=callbacks)
                # for BERT architectures, directly call the model trainer
                if self.training_config.early_stop:
                    self.model.train(np.concatenate([x_train,x_valid]), np.concatenate([y_train,y_valid]))

    Would it be possible to support the callbacks also with BERT?

    opened by oterrier 0
  • Implement sliding window

    Implement sliding window

    I thought it might be better to discuss the sliding window in a separate issue.


    I was just considering whether we need sliding windows to not have to use a really large max_sequence_length.


    As you can see, it's not related to a sliding windows as we have in CRF. With CRF, we always have a contextual window centered on the current "target" token which will be labelled, and the CRF template is used to determine the size of the window.

    With the DL approach, we have a prediction on a complete sequence, without sliding window and the whole sequence is involved when training the weights or outputting something. For very large input sequence like the header model, it's of course an issue (size of the input could be more than 1000 tokens in worst cases) - but it's potentially also where it is interesting because the "recursive" aspect of the RNN makes the backpropagation potentially impacting the complete sequence.

    It would be indeed interesting to compare the "traditional" global full sequence network approach and a local sliding-window network, though I am not sure how to do it. It would require some review to see how it was approached by other people.

    /cc @kermitt2 @lfoppiano

    opened by de-code 6
  • grobidTagger: make model an optional argument

    grobidTagger: make model an optional argument

    It might be good if the first argument was the "action", which is effectively a sub command.

    The model could be made optional, allowing a path to a model to be passed in. Then you could for example just evaluate a model or use the tagger by pointing to the model path. e.g. compare different versions of a header model.

    opened by de-code 0
  • Switch to TensorFlow 2.0 and tf.keras

    Switch to TensorFlow 2.0 and tf.keras

    From the Keras site:

    The current release is Keras 2.3.0, which makes significant API changes and add support for TensorFlow 2.0. The 2.3.0 release will be the last major release of multi-backend Keras. Multi-backend Keras is superseded by tf.keras.

    Maybe it's time to upgrade?

    opened by de-code 10
  • Feature-based approach with BERT for seq. labelling is super slow

    Feature-based approach with BERT for seq. labelling is super slow

    We are currently using keras-bert for the feature-based approach with BERT for seq. labelling and this is super slow: 56 tokens per second (using the concatenation of the 4 top four hidden layers of the pre-trained transformer, as in the original paper) - to be compared to ~300 tokens/s with ELMo and, more relevant, around 1000 tokens per second when using BERT fine-tuned model.

    I think there is no reason to have something so slow when using the pre-trained transformer as compared to the fine-tuned model, so we should use our BERT integration too, rather than keras-bert for the feature-based approach (as bonus, it will remove this dependency).

    opened by kermitt2 0
  • Automatically download embeddings

    Automatically download embeddings

    opened by kermitt2 1
  • v0.2.6(Dec 26, 2020)

  • v0.2.5(Dec 21, 2020)

    • fix serialization of models with feature preprocessor (PR #110)
    • update grobid models with features
    • some other models and score updates
    • add "software was used" classification model for software citations
    • update tensorflow dependency
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Sep 12, 2020)

    • generic support for feature channel in sequence labeling, test with Grobid training data
    • fix issues #40 #44 #48 #50 #52 #54 #56 #66 #69 #71 #94 #100 #103
    • update eval (average field level n-fold cross-validation)
    • dataseer and software use classification models
    • review and improvement for BERT sequence labeling and classification (unicode, binary/multi-label, test SciBERT, bioBERT, ...)
    • force split lemonde corpus evaluation (to be compared with some publication results using this)
    • fixing truncation in sequence labeling
    • more documentation
    • various bug fixing
    Source code(tar.gz)
    Source code(zip)
Patrice Lopez
Patrice Lopez
TextField: Learning A Deep Direction Field for Irregular Scene Text Detection (TIP 2019)

TextField: Learning A Deep Direction Field for Irregular Scene Text Detection Introduction The code and trained models of: TextField: Learning A Deep

Yukang Wang 99 Nov 10, 2021
A curated list of resources for text detection/recognition (optical character recognition ) with deep learning methods.

awesome-deep-text-detection-recognition A curated list of awesome deep learning based papers on text detection and recognition. Text Detection Papers

null 2.3k Nov 26, 2021
Generate text images for training deep learning ocr model

New version release:https://github.com/oh-my-ocr/text_renderer Text Renderer Generate text images for training deep learning OCR model (e.g. CRNN). Su

Qing 1k Nov 21, 2021
Text recognition (optical character recognition) with deep learning methods.

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | paper | training and evaluation data | failure cases and cle

Clova AI Research 2.6k Nov 23, 2021
Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

Marek Mauder 95 Nov 16, 2021
An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

InceptText-Tensorflow An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Orien

GeorgeJoe 118 Nov 8, 2021
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 487 Nov 9, 2021
text detection mainly based on ctpn model in tensorflow, id card detect, connectionist text proposal network

text-detection-ctpn Scene text detection based on ctpn (connectionist text proposal network). It is implemented in tensorflow. The origin paper can be

Shaohui Ruan 3.2k Nov 23, 2021
keras复现场景文本检测网络CPTN: 《Detecting Text in Natural Image with Connectionist Text Proposal Network》;欢迎试用,关注,并反馈问题...

keras-ctpn [TOC] 说明 预测 训练 例子 4.1 ICDAR2015 4.1.1 带侧边细化 4.1.2 不带带侧边细化 4.1.3 做数据增广-水平翻转 4.2 ICDAR2017 4.3 其它数据集 toDoList 总结 说明 本工程是keras实现的CPTN: Detecti

mick.yi 99 Nov 11, 2021
Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

Detecting Text in Natural Image with Connectionist Text Proposal Network The codes are used for implementing CTPN for scene text detection, described

Tian Zhi 1.3k Nov 21, 2021
huoyijie 1.2k Nov 25, 2021
OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

Hussein Youssef 84 Nov 22, 2021
OCR, Scene-Text-Understanding, Text Recognition

Scene-Text-Understanding Survey [2015-PAMI] Text Detection and Recognition in Imagery: A Survey paper [2014-Front.Comput.Sci] Scene Text Detection and

Alan Tang 349 Nov 17, 2021
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 624 Nov 29, 2021
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

null 56 Nov 21, 2021
Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

null 14 Nov 22, 2021
Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

Ankush Gupta 1.7k Nov 29, 2021
TextBoxes: A Fast Text Detector with a Single Deep Neural Network https://github.com/MhLiao/TextBoxes 基于SSD改进的文本检测算法,textBoxes_note记录了之前整理的笔记。

TextBoxes: A Fast Text Detector with a Single Deep Neural Network Introduction This paper presents an end-to-end trainable fast scene text detector, n

zhangjing1 25 Jan 9, 2020