A set of workflows for corpus building through OCR, post-correction and normalisation

Overview

Language Machines Badge Build Status

GitHub release Project Status: Inactive - The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows

PICCL: Philosophical Integrator of Computational and Corpus Libraries

PICCL offers a workflow for corpus building and builds on a variety of tools. The primary component of PICCL is TICCL; a Text-induced Corpus Clean-up system, which performs spelling correction and OCR post-correction (normalisation of spelling variants etc).

PICCL and TICCL constitute original research by Martin Reynaert (Tilburg University & Radboud University Nijmegen), and is currently developed in the scope of the CLARIAH project.

This repository hosts the relevant workflows that constitute PICCL, powered by Nextflow. These will be shipped as part of our LaMachine software distribution. The combination of these enable the PICCL workflow to be portable and scalable; it can be executed accross multiple computing nodes on a high performance cluster such as SGE, LSF, SLURM, PBS, HTCondor, Kubernetes and Amazon AWS. Parallellisation is handled automatically. Consult the Nextflow documentation for details regarding this.

All the modules that make up TICCL are part of the TicclTools collection, and are not part of the current repository. Certain other required components are in the FoLiA-Utils collection. There is no need to install either of these or other dependencies manually.

PICCL makes extensive use of the FoLiA format, a rich XML-based format for linguistic annotation.

Important Note: This is beta software still in development; for the old and deprecated version consult this repository.

Installation

PICCL is shipped as a part of LaMachine, although you need to explicitly select it for installation using lamachine-add piccl && lamachine-update (from inside a LaMachine installation). Once inside LaMachine, the command line interface can be invoked by directly specifying one of the workflows:

$ ocr.nf

Or

$ ticcl.nf

If you using a LaMachine installation, you can skip the rest of this section. If not, you can install Nextflow and Docker manually and then run the following to obtain the latest development release of PICCL:

$ nextflow pull LanguageMachines/PICCL

In this case you need to ensure to always run it with the -with-docker proycon/lamachine:piccl parameter, this lets nextflow manage your LaMachine docker container (this is not tested as much as running from inside the container directly):

$ nextflow run LanguageMachines/PICCL -with-docker proycon/lamachine:piccl

We have prepared PICCL for work in many languages, mainly on the basis of available open source lexicons due to Aspell, these data files serve as the input for TICCL and have to be downloaded once as follows;

$ nextflow run LanguageMachines/PICCL/download-data.nf -with-docker proycon/lamachine:piccl

This will generate a data/ directory in your current directory, and will be referenced in the usage examples in the next section. In a LaMachine environment, this directory is already available in $LM_PREFIX/opt/PICCL/data.

In addition, you can also download example corpora (>300MB), which will be placed in a corpora/ directory:

$ nextflow run LanguageMachines/PICCL/download-examples.nf -with-docker proycon/lamachine:piccl

Architecture

PICCL consists of two workflows, one for optical character recognition using tesseract, and a TICCL workflow for OCR-post-correction and normalisation. Third, PICCL provides a webservice that ties together both these workflows and also integrates two other workflows from aNtiLoPe: a workflow for tokenisation (using ucto) and Dutch Linguistic Enrichment (using frog).

The architecture of the PICCL webservice, and its two integral workflows, is visualised schematically as follows:

PICCL Architecture

Usage

Command line interface

PICCL encompasses two workflows (and in webservice form it also integrates two more from aNtiLoPe)

  • ocr.nf - A pipeline for Optical Character Recognition using Tesseract; takes PDF documents or images of scanned pages and produces FoLiA documents.
  • ticcl.nf - The Text-induced Corpus Clean-up system: performs OCR-postcorrection, takes as input the result from ocr.nf, or standalone text or PDF (text; no OCR), and produces further enriched FoLiA documents.

If you are inside LaMachine, you can invoke these directly. If you let Nextflow manage LaMachine through docker, then you have to invoke them like nextflow run LanguageMachines/PICCL/ocr.nf -with-docker proycon/lamachine:piccl. This applies to all examples in this section.

Running with the --help parameter or absence of any parameters will output usage information.

$ ocr.nf --help
--------------------------
OCR Pipeline
--------------------------
Usage:
  ocr.nf [PARAMETERS]

Mandatory parameters:
  --inputdir DIRECTORY     Input directory
  --language LANGUAGE      Language (iso-639-3)

Optional parameters:
--inputtype STR          Specify input type, the following are supported:
        pdf (extension *.pdf)  - Scanned PDF documents (image content) [default]
        tif ($document-$sequencenumber.tif)  - Images per page (adhere to the naming convention!)
        jpg ($document-$sequencenumber.jpg)  - Images per page
        png ($document-$sequencenumber.png)  - Images per page
        gif ($document-$sequencenumber.gif)  - Images per page
        djvu (extension *.djvu)"
        (The hyphen delimiter may optionally be changed using --seqdelimiter)
--outputdir DIRECTORY    Output directory (FoLiA documents)
--virtualenv PATH        Path to Python Virtual Environment to load (usually path to LaMachine)
--pdfhandling reassemble Reassemble/merge all PDFs with the same base name and a number suffix; this can
                         for instance reassemble a book that has its chapters in different PDFs.
                         Input PDFs must adhere to a \$document-\$sequencenumber.pdf convention.
                         (The hyphen delimiter may optionally be changed using --seqdelimiter)
--seqdelimiter           Sequence delimiter in input files (defaults to: _)
--seqstart               What input field is the sequence number (may be a negative number to count from the end), default: -2


$ ticcl.nf --help
--------------------------
TICCL Pipeline
--------------------------
Usage:
  ticcl.nf [OPTIONS]

Mandatory parameters:
  --inputdir DIRECTORY     Input directory (FoLiA documents with an OCR text layer)
  --lexicon FILE           Path to lexicon file (*.dict)
  --alphabet FILE          Path to alphabet file (*.chars)
  --charconfus FILE        Path to character confusion list (*.confusion)

Optional parameters:
  --outputdir DIRECTORY    Output directory (FoLiA documents)
  --language LANGUAGE      Language
  --extension STR          Extension of FoLiA documents in input directory (default: folia.xml)
  --inputclass CLASS       FoLiA text class to use for input, defaults to 'current' for FoLiA input; must be set to 'OCR' for FoLiA documents produced by ocr.nf
  --inputtype STR          Input type can be either 'folia' (default), 'text', or 'pdf' (i.e. pdf with text; no OCR)
  --virtualenv PATH        Path to Virtual Environment to load (usually path to LaMachine)
  --artifrq INT            Default value for missing frequencies in the validated lexicon (default: 10000000)
  --distance INT           Levenshtein/edit distance (default: 2)
  --clip INT               Limit the number of variants per word (default: 10)
  --corpusfreqlist FILE    Corpus frequency list (skips the first step that would compute one for you)
  --low INT                skip entries from the anagram file shorter than 'low' characters. (default=5)
  --high INT               skip entries from the anagram file longer than 'high' characters. (default=35)
  --chainclean BOOLINT     enable chain clean or not (1 = on, 0 = off, default)

An example of invoking an OCR workflow for English is provided below, it assumes the sample data are installed in the corpora/ directory. It OCRs the OllevierGeets.pdf file, which contains scanned image data, therefore we choose the pdfimages input type.

$ ocr.nf --inputdir corpora/PDF/ENG/ --inputtype pdfimages --language eng

Alternative input types are images per page, in which case inputtype is set to either tif, jpg, gif or png. These input files should be placed in the designated input directory and follow the naming convention $documentname-$sequencenumber.$extension, for example harrypotter-032.png. An example invocation on dutch scanned pages in the example collection would be:

$ ocr.nf --inputdir corpora/TIFF/NLD/ --inputtype tif --language nld

In case of the first example the result will be a file OllevierGeets.folia.xml in the ocr_output/ directory. This in turn can serve as input for the TICCL workflow, which will attempt to correct OCR errors. Take care that that the --inputclass OCR parameter is mandatory if you want to use the FoLiA output of ocr.nf as input for TICCL:

$ ticcl.nf --inputdir ocr_output/ --inputclass OCR --lexicon $LM_PREFIX/opt/PICCL/data/int/eng/eng.aspell.dict --alphabet $LM_PREFIX/opt/PICCL/data/int/eng/eng.aspell.dict.lc.chars --charconfus $LM_PREFIX/opt/PICCL/data/int/eng/eng.aspell.dict.c0.d2.confusion

Note that here we pass a language-specific lexicon file, alphabet file, and character confusion file from the data files obtained by download-data.nf. Result will be a file OllevierGeets.folia.ticcl.xml in the ticcl_output/ directory, containing enriched corrections. The second example, on the dutch corpus data, can be run as follows:

$ ticcl.nf --inputdir ocr_output/ --inputclass OCR --lexicon $LM_PREFIX/opt/PICCL/data/int/nld/nld.aspell.dict --alphabet $LM_PREFIX/opt/PICCL/data/int/nld/nld.aspell.dict.lc.chars --charconfus $LM_PREFIX/opt/PICCL/data/int/nld/nld.aspell.dict.c20.d2.confusion

Webapplication / RESTful webservice

Installation

PICCL is also available as a webapplication and RESTful webservice, powered by CLAM. If you are in LaMachine with PICCL, the webservice is already installed, but you may need to run lamachine-start-webserver if it is not already running.

For production environments, you will want to adapt the CLAM configuration. To this end, copy $LM_PREFIX/etc/piccl.config.yml to $LM_PREFIX/etc/piccl.$HOST.yml, where $HOST corresponds with your hostname and edit the file with your host specific settings. Always enable authentication if your server is world-accessible (consult the CLAM documentation to read how).

Technical Details & Contributing

Please see CONTRIBUTE.md for technical details and information on how to contribute.

Comments
  • Autosearch forwarder gives server error

    Autosearch forwarder gives server error

    Clicking the Autosearch forwarder (for one file) on a file, gives a 500 internal server error, caused by Python: https://pastebin.ubuntu.com/p/4HRCH4HsRM/

    bug ready 
    opened by peterdekker 18
  • "Process `ticclunk (1)` terminated with an error exit status (134)" from ticcl.nf

    I am getting an error when I try to run ticcl.nf with a folia.xml file I got from ocr.nf. I'm being led to believe this is an issue with the corpus.wordfreqlist.tsv file. When I omit the optional parameter --corpusfreqlist I get

    lamachine@0085222b6173:~$ ticcl.nf --inputdir /home/lamachine --lexicon /data/int/eng/eng.aspell.dict --alphabet /data/int/eng/eng.aspell.dict.lc.chars --charconfus /data/int/eng/eng.aspell.dict.c0.d2.confusion
    N E X T F L O W  ~  version 0.30.2
    Launching `/usr/local/bin/ticcl.nf` [dreamy_colden] - revision: 3bd4e988b7
    --------------------------
    TICCL Pipeline
    --------------------------
    [warm up] executor > local
    [05/df5926] Submitted process > corpusfrequency (1)
    [ef/85c258] Submitted process > corpusfrequency (2)
    [af/61c201] Submitted process > ticclunk (1)
    ERROR ~ Error executing process > 'ticclunk (1)'
    
    Caused by:
      Process `ticclunk (1)` terminated with an error exit status (134)
    
    Command executed:
    
      set +u
      if [ ! -z "" ]; then
          source /bin/activate
      fi
      set -u
    
      TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"
    
    Command exit status:
      134
    
    Command output:
      (empty)
    
    Command error:
      terminate called after throwing an instance of 'std::runtime_error'
        what():  creating UniFilter: default_filter failed
      error in rules, line=-1 at postion: -1
      .command.sh: line 8:   551 Aborted                 TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"
    
    Work dir:
      /home/lamachine/work/af/61c20118a7986818b0285e321ea881
    
    Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`
    
     -- Check '.nextflow.log' file for details
    

    and when I include --corpusfreqlist I get

    lamachine@0085222b6173:~$ ticcl.nf --inputdir /home/lamachine --lexicon /data/int/eng/eng.aspell.dict --alphabet /data/int/eng/eng.aspell.dict.lc.chars --charconfus /data/int/eng/eng.aspell.dict.c0.d2.confusion --corpusfreqlist /home/lamachine/corpus.wordfreqlist.tsv
    N E X T F L O W  ~  version 0.30.2
    Launching `/usr/local/bin/ticcl.nf` [grave_dalembert] - revision: 3bd4e988b7
    --------------------------
    TICCL Pipeline
    --------------------------
    [warm up] executor > local
    [27/88234a] Submitted process > ticclunk (1)
    ERROR ~ Error executing process > 'ticclunk (1)'
    
    Caused by:
      Process `ticclunk (1)` terminated with an error exit status (134)
    
    Command executed:
    
      set +u
      if [ ! -z "" ]; then
          source /bin/activate
      fi
      set -u
    
      TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"
    
    Command exit status:
      134
    
    Command output:
      (empty)
    
    Command error:
      terminate called after throwing an instance of 'std::runtime_error'
        what():  creating UniFilter: default_filter failed
      error in rules, line=-1 at postion: -1
      .command.sh: line 8:   637 Aborted                 TICCL-unk --background "eng.aspell.dict" --artifrq 10000000 "corpus.wordfreqlist.tsv"
    
    Work dir:
      /home/lamachine/work/27/88234a21c792567fc01b34424bc2e3
    
    Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
    
     -- Check '.nextflow.log' file for details
    

    The error appears to be rooted in the corpus.wordfreqlist.tsv still. My few guesses are that eng.aspell.dict doesn't include some words that appear in corpus.wordfreqlist.tsv, but I'm not sure what kind of clean up program wouldn't account for incorrectly spelled words so I don't think that would be the case.

    Another issue I see is that some of the words I have in my wordfreqlist begin with numbers and the program doesn't know how to account for cases like "13" where there was one "3" in the PDF read into ocr.nf, this, among several other similar cases, could be problematic.

    A final idea that I have is that when I got the wordfreqlist I began to look through it just to see what it was, and the most frequent word in the file "the" did not have a frequency attached to it. It should have appeared "9the". When I noticed this I attempted to fix it and input the corrected (or so I am led to believe) corpus.wordfreqlist.tsv with the --corpusfreqlist parameter. This was the second example I included and it still didn't work so I don't know what's going wrong.

    question 
    opened by willstout 17
  • Please make test book available in the PICCL workflow

    Please make test book available in the PICCL workflow

    Please amek available the following test book version in the PICCL work flow:

    [mreynaert@scootaloo:~]$ ls -l /vol/tensusers/mreynaert/DPO35tiff.tar.gz -rw-rw-r-- 1 mreynaert mreynaert 1304172529 Feb 5 16:07 /vol/tensusers/mreynaert/DPO35tiff.tar.gz

    ready test 
    opened by martinreynaert 17
  • Plain text processing does not work as expected?

    Plain text processing does not work as expected?

    Moved from proycon/LaMachine#37, by @mathias3

    nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir /home/projects/Kaggle_denoise/ICDAR-2017-Post-OCR-Correction/text/ --lexicon data/int/pol/pol.aspell.dict --alphabet data/int/pol/pol.aspell.dict.lc.chars --charconfus data/int/pol/pol.aspell.dict.c0.d2.confusion --inputtype 'text'
    
    N E X T F L O W ~ version 0.27.4
    Launching LanguageMachines/PICCL [mad_legentil] - revision: a006ed747c [master]
    NOTE: Your local project version looks outdated - a different revision is available in the remote repository [e9754ef2e1]
    
    TICCL Pipeline
    
    [warm up] executor > local
    [f4/ce2b3e] Submitted process > txt2folia (1)
    [3f/e17ce8] Submitted process > corpusfrequency (1)
    ERROR ~ Error executing process > 'corpusfrequency (1)'
    
    Caused by:
    Missing output file(s) corpus.wordfreqlist.tsv expected by process corpusfrequency (1)
    
    Command executed:
    
    set +u
    if [ ! -z "" ]; then
    source /bin/activate
    fi
    set -u
    
    FoLiA-stats --class "OCR" -s -t 1 -e folia.xml --lang=none --ngram 1 -o corpus .
    
    Command exit status:
    0
    
    Command output:
    start processing of 1 files
    done processsing directory '.'
    start calculating the results
    in total 0 n-grams were found.
    
    Command error:
    
    XML-error: PCDATA invalid Char value 12
    
    FoLiA-stats: failed to load document './doc.folia.xml'
    FoLiA-stats: reason: XML error: No XML document read
    
    Work dir:
    /usr/src/LaMachine/work/3f/e17ce8cb5ee9d497957609b34fdc29
    
    Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out
    
    -- Check '.nextflow.log' file for details`
    

    @martinreynaert Can you replicate this problem and suggest a remedy?

    bug enhancement investigate 
    opened by proycon 15
  • Tesseract produces garbage output without warning

    Tesseract produces garbage output without warning

    We found out the hard way that new versions (since 3.05, also 4) of Tesseract produce garbage without warning. This is because the location and file names of helper files and the config file have changed.

    In 3.04 the command line was:

    export TESSDATA_PREFIX="/usr/share/tesseract/tessdata/"; /usr/local/bin/tesseract $doc $hocrdir/$last -l nld /usr/share/tesseract/tessdata/tools/config.hocr

    For 3.05 and 4 this should be like:

    export TESSDATA_PREFIX="/roaming/tesseract/local/share/tessdata/langfiles"; /roaming/tesseract/local/bin/tesseract $doc $hocrdir/$last -l nld /roaming/tesseract/local/sha
    re/tessdata/configs/hocr

    Note that each Linux distribution may be installing a different Tesseract version by default. LaMachine currently relies on the Linux distro for installing Tesseract.

    MRE

    expired 
    opened by martinreynaert 13
  • PICCL pipelines need to do better input validation and provide better error/warning messages to the user + general lack of documentation needs to improve

    PICCL pipelines need to do better input validation and provide better error/warning messages to the user + general lack of documentation needs to improve

    I'm trying to run ocr.nf with docker and I'm not sure how the parameters are meant to be used.

    So for like the --inputdir parameter, we're only supposed to give the folder that contains the images? Does this mean that what ever image files are within that folder are going to be run through the pipeline? And is this file system our normal file system or our docker file system?

    So if I want to run a pdf, that's sitting in my desktop folder, through the pipeline, I would run "ocr.nf --inputdir C:\Users\willstout\Desktop --language eng"? Or would I first need to add it to a docker container

    ocr.nf is quite confusing to work with because there isn't a lot of documentation on the whole program. In fact running "ocr.nf --help" does the exact same thing as "ocr.nf". Additionally if I wanted to purposefully run something wrong just to see what error I would be given, the program will run the same as if nothing is wrong. For instance running "ocr.nf --inputdir" and not giving it a specified directory sends me back to the starting point of the OCR pipeline. Running with a specified directory just tells me

    N E X T F L O W ~ version 0.30.2 Launching /usr/local/bin/ocr.nf [desperate_newton] - revision: 76d7839f83

    OCR Pipeline

    [warm up] executor > local lamachine@eab8a83a33ea:~$

    And running this with a directory that doesn't exist gives that same output. So there's no way to tell if what I am doing is correct.

    enhancement question expired 
    opened by willstout 12
  • Tested Full OCR & TICCL & Frog pipeline in web version

    Tested Full OCR & TICCL & Frog pipeline in web version

    I ran project 'DPO35maybe'. This is using the Dutch Martinet book.

    • There was no option whatsoever to select an available lexicon. Lexicons for the various languages are available in the data/int/ iternal dirs.
    • In the output I see no evidence whatsoever that TICCL actually ran. Apart that is from there being two versions of the paragraphs. This may be a consequence of no lexicon having been available to the system.
    • Frog seems to have run the dependency parser. That is very slow and certainly not required by default.
    enhancement ready test 
    opened by martinreynaert 11
  • TICCL fails on empty unknown words file

    TICCL fails on empty unknown words file

    I invoke TICCL via the command line, which normally works for data which has been generated by ocr.nf. Now, I am using TICCL external data, the beaufort dataset from Huygens ING.

    When running TICCL, the FoLiA-correct step gives an error about the unknown words file:

    mei-29 14:25:28.982 [main] DEBUG nextflow.cli.Launcher - $> /vol1/lamachine/bin/nextflow /vol1/lamachine/bin/ticcl.nf --inputdir beaufort/ --lexicon /vol1/lamachine/piccldata/data/int/nld/nld.aspell.dict --alphabet /vol1/lamachine/piccldata/data/int/nld/nld.aspell.dict.lc.chars --charconfus /vol1/lamachine/piccldata/data/int/nld/nld.aspell.dict.c20.d2.confusion --inputclass default --extension xml
    mei-29 14:25:29.227 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 0.29.1
    mei-29 14:25:29.252 [main] INFO  nextflow.cli.CmdRun - Launching `/vol1/lamachine/bin/ticcl.nf` [clever_ampere] - revision: be4c5985bb
    mei-29 14:25:29.441 [main] DEBUG nextflow.Session - Session uuid: 450091c9-b699-4c50-8c3a-898c7691bad4
    mei-29 14:25:29.441 [main] DEBUG nextflow.Session - Run name: clever_ampere
    mei-29 14:25:29.445 [main] DEBUG nextflow.Session - Executor pool size: 2
    mei-29 14:25:29.471 [main] DEBUG nextflow.cli.CmdRun - 
      Version: 0.29.1 build 4804
      Modified: 10-05-2018 07:47 UTC (09:47 CEST)
      System: Linux 3.10.0-693.21.1.el7.x86_64
      Runtime: Groovy 2.4.15 on OpenJDK 64-Bit Server VM 1.8.0_171-b10
      Encoding: UTF-8 (UTF-8)
      Process: [email protected] [172.16.4.88]
      CPUs: 2 - Mem: 7,6 GB (1,6 GB) - Swap: 820 MB (819,5 MB)
    mei-29 14:25:29.606 [main] DEBUG nextflow.Session - Work-dir: /home/piccl/work [xfs]
    mei-29 14:25:29.606 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /vol1/lamachine/bin/bin
    mei-29 14:25:29.797 [main] DEBUG nextflow.Session - Session start invoked
    mei-29 14:25:29.812 [main] DEBUG nextflow.processor.TaskDispatcher - Dispatcher > start
    mei-29 14:25:29.813 [main] DEBUG nextflow.script.ScriptRunner - > Script parsing
    mei-29 14:25:30.572 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
    mei-29 14:25:30.577 [main] INFO  nextflow.Nextflow - --------------------------
    mei-29 14:25:30.577 [main] INFO  nextflow.Nextflow - TICCL Pipeline
    mei-29 14:25:30.577 [main] INFO  nextflow.Nextflow - --------------------------
    mei-29 14:25:30.809 [main] DEBUG nextflow.Channel - files for syntax: glob; folder: beaufort/; pattern: **.xml; options: null
    mei-29 14:25:30.992 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: null
    mei-29 14:25:30.992 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
    mei-29 14:25:31.003 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
    mei-29 14:25:31.007 [main] INFO  nextflow.executor.Executor - [warm up] executor > local
    mei-29 14:25:31.017 [main] DEBUG n.processor.LocalPollingMonitor - Creating local task monitor for executor 'local' > cpus=2; memory=7,6 GB; capacity=2; pollInterval=100ms; dumpInterval=5m
    mei-29 14:25:31.024 [main] DEBUG nextflow.processor.TaskDispatcher - Starting monitor: LocalPollingMonitor
    mei-29 14:25:31.025 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: local)
    mei-29 14:25:31.028 [main] DEBUG nextflow.executor.Executor - Invoke register for executor: local
    mei-29 14:25:31.110 [main] DEBUG nextflow.Session - >>> barrier register (process: corpusfrequency)
    mei-29 14:25:31.117 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > corpusfrequency -- maxForks: 2
    mei-29 14:25:31.571 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: null
    mei-29 14:25:31.571 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
    mei-29 14:25:31.572 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
    mei-29 14:25:31.573 [main] DEBUG nextflow.Session - >>> barrier register (process: ticclunk)
    mei-29 14:25:31.575 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > ticclunk -- maxForks: 2
    mei-29 14:25:31.603 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: null
    mei-29 14:25:31.604 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
    mei-29 14:25:31.604 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
    mei-29 14:25:31.605 [main] DEBUG nextflow.Session - >>> barrier register (process: anahash)
    mei-29 14:25:31.606 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > anahash -- maxForks: 2
    mei-29 14:25:31.642 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: null
    mei-29 14:25:31.642 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
    mei-29 14:25:31.643 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
    mei-29 14:25:31.648 [main] DEBUG nextflow.Session - >>> barrier register (process: indexer)
    mei-29 14:25:31.650 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > indexer -- maxForks: 2
    mei-29 14:25:31.667 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: null
    mei-29 14:25:31.668 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
    mei-29 14:25:31.668 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
    mei-29 14:25:31.669 [main] DEBUG nextflow.Session - >>> barrier register (process: resolver)
    mei-29 14:25:31.670 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > resolver -- maxForks: 2
    mei-29 14:25:31.679 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: null
    mei-29 14:25:31.679 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
    mei-29 14:25:31.679 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
    mei-29 14:25:31.680 [main] DEBUG nextflow.Session - >>> barrier register (process: rank)
    mei-29 14:25:31.681 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > rank -- maxForks: 2
    mei-29 14:25:31.696 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: null
    mei-29 14:25:31.696 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
    mei-29 14:25:31.697 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
    mei-29 14:25:31.697 [main] DEBUG nextflow.Session - >>> barrier register (process: foliacorrect)
    mei-29 14:25:31.698 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > foliacorrect -- maxForks: 2
    mei-29 14:25:31.701 [main] DEBUG nextflow.script.ScriptRunner - > Await termination 
    mei-29 14:25:31.701 [main] DEBUG nextflow.Session - Session await
    mei-29 14:25:31.765 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:31.776 [Task submitter] INFO  nextflow.Session - [35/95f629] Submitted process > corpusfrequency (1)
    mei-29 14:25:31.809 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:31.818 [Task submitter] INFO  nextflow.Session - [d0/95c1ba] Submitted process > corpusfrequency (2)
    mei-29 14:25:32.115 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: corpusfrequency (1); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/35/95f6295ae371c0ee11dbc3b62a0a9a]
    mei-29 14:25:32.121 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.121 [Task submitter] INFO  nextflow.Session - [64/b1a200] Submitted process > corpusfrequency (3)
    mei-29 14:25:32.137 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 2; name: corpusfrequency (2); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/d0/95c1bafdc68c7f914026cf8a790677]
    mei-29 14:25:32.140 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.141 [Task submitter] INFO  nextflow.Session - [e6/040b4c] Submitted process > corpusfrequency (4)
    mei-29 14:25:32.241 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 3; name: corpusfrequency (3); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/64/b1a200dfebab837c97732685ccef4c]
    mei-29 14:25:32.245 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.245 [Task submitter] INFO  nextflow.Session - [ce/4fe930] Submitted process > corpusfrequency (5)
    mei-29 14:25:32.309 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 4; name: corpusfrequency (4); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/e6/040b4c09057517748e283e727398e8]
    mei-29 14:25:32.318 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.319 [Task submitter] INFO  nextflow.Session - [d7/db7579] Submitted process > corpusfrequency (7)
    mei-29 14:25:32.376 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 7; name: corpusfrequency (7); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/d7/db75799ab3cf0bf5537bb67c50c265]
    mei-29 14:25:32.378 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.378 [Task submitter] INFO  nextflow.Session - [9a/17b174] Submitted process > corpusfrequency (6)
    mei-29 14:25:32.472 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: corpusfrequency (5); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/ce/4fe930213e9c0149e2f87efb747701]
    mei-29 14:25:32.477 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.478 [Task submitter] INFO  nextflow.Session - [ab/792e6e] Submitted process > corpusfrequency (9)
    mei-29 14:25:32.560 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 9; name: corpusfrequency (9); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/ab/792e6eec823c42d6a5f7b670b48c87]
    mei-29 14:25:32.572 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.572 [Task submitter] INFO  nextflow.Session - [9f/ed8dce] Submitted process > corpusfrequency (8)
    mei-29 14:25:32.627 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 6; name: corpusfrequency (6); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/9a/17b174dd6792613a968f68cc0ed7dd]
    mei-29 14:25:32.635 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.635 [Task submitter] INFO  nextflow.Session - [5a/75d941] Submitted process > corpusfrequency (11)
    mei-29 14:25:32.710 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 8; name: corpusfrequency (8); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/9f/ed8dceb048b78dfb7642e5fe29a17f]
    mei-29 14:25:32.715 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.715 [Task submitter] INFO  nextflow.Session - [5e/6b15ba] Submitted process > corpusfrequency (10)
    mei-29 14:25:32.741 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 11; name: corpusfrequency (11); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/5a/75d941ab54286c3d3e60e23a395e3c]
    mei-29 14:25:32.745 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.746 [Task submitter] INFO  nextflow.Session - [4f/051b72] Submitted process > corpusfrequency (12)
    mei-29 14:25:32.798 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 10; name: corpusfrequency (10); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/5e/6b15ba01737b46f738a0edb9741758]
    mei-29 14:25:32.805 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.805 [Task submitter] INFO  nextflow.Session - [6e/2e32b3] Submitted process > corpusfrequency (13)
    mei-29 14:25:32.884 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 12; name: corpusfrequency (12); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/4f/051b72dcb44e2a50a4b61b421a2960]
    mei-29 14:25:32.891 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.891 [Task submitter] INFO  nextflow.Session - [44/c70447] Submitted process > corpusfrequency (14)
    mei-29 14:25:32.901 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 13; name: corpusfrequency (13); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/6e/2e32b37a27e56780ad99088f466a2e]
    mei-29 14:25:32.909 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:32.909 [Task submitter] INFO  nextflow.Session - [6e/89fbdf] Submitted process > corpusfrequency (16)
    mei-29 14:25:33.005 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 14; name: corpusfrequency (14); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/44/c704479bd73ad3efcff2953baf3d01]
    mei-29 14:25:33.028 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.028 [Task submitter] INFO  nextflow.Session - [44/216282] Submitted process > corpusfrequency (17)
    mei-29 14:25:33.049 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 16; name: corpusfrequency (16); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/6e/89fbdfa28b42c0349a96857589dd3a]
    mei-29 14:25:33.056 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.057 [Task submitter] INFO  nextflow.Session - [10/dd9193] Submitted process > corpusfrequency (15)
    mei-29 14:25:33.257 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 17; name: corpusfrequency (17); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/44/2162826d13bd1d2fe62b79385ea0d7]
    mei-29 14:25:33.268 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.268 [Task submitter] INFO  nextflow.Session - [98/dda69e] Submitted process > corpusfrequency (18)
    mei-29 14:25:33.442 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 15; name: corpusfrequency (15); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/10/dd9193efd9de05eb0c52a85dbe6044]
    mei-29 14:25:33.445 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.446 [Task submitter] INFO  nextflow.Session - [74/bb4e7d] Submitted process > corpusfrequency (19)
    mei-29 14:25:33.568 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 18; name: corpusfrequency (18); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/98/dda69e9b7fab2a4b3d042835e593e5]
    mei-29 14:25:33.574 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.574 [Task submitter] INFO  nextflow.Session - [17/79f987] Submitted process > corpusfrequency (20)
    mei-29 14:25:33.687 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 19; name: corpusfrequency (19); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/74/bb4e7d089b05dff76a04932fca0379]
    mei-29 14:25:33.691 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.691 [Task submitter] INFO  nextflow.Session - [77/fec8b6] Submitted process > corpusfrequency (22)
    mei-29 14:25:33.743 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 22; name: corpusfrequency (22); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/77/fec8b6ede99ff37c2a386c3c69f78b]
    mei-29 14:25:33.746 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.747 [Task submitter] INFO  nextflow.Session - [9c/78d28b] Submitted process > corpusfrequency (23)
    mei-29 14:25:33.793 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 23; name: corpusfrequency (23); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/9c/78d28b7ab2e5ffd9af2e3f2a96864c]
    mei-29 14:25:33.797 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.797 [Task submitter] INFO  nextflow.Session - [fc/9db4f5] Submitted process > corpusfrequency (21)
    mei-29 14:25:33.847 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 20; name: corpusfrequency (20); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/17/79f987e34e64b80620637b1e30b689]
    mei-29 14:25:33.849 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.849 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 21; name: corpusfrequency (21); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/fc/9db4f59790b4eb0318a04ef4248455]
    mei-29 14:25:33.850 [Task submitter] INFO  nextflow.Session - [9a/e7386c] Submitted process > corpusfrequency (24)
    mei-29 14:25:33.856 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.856 [Task submitter] INFO  nextflow.Session - [4f/69ad19] Submitted process > corpusfrequency (25)
    mei-29 14:25:33.963 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 24; name: corpusfrequency (24); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/9a/e7386cc11f6005d8e83ff34c989839]
    mei-29 14:25:33.968 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:33.968 [Task submitter] INFO  nextflow.Session - [04/0474d4] Submitted process > corpusfrequency (26)
    mei-29 14:25:34.012 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 25; name: corpusfrequency (25); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/4f/69ad19f8d2732107b6e8b12ad297d3]
    mei-29 14:25:34.017 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.017 [Task submitter] INFO  nextflow.Session - [de/681753] Submitted process > corpusfrequency (27)
    mei-29 14:25:34.020 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 26; name: corpusfrequency (26); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/04/0474d4e504ebd5747f901f93d1824d]
    mei-29 14:25:34.023 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.023 [Task submitter] INFO  nextflow.Session - [27/900d1a] Submitted process > corpusfrequency (28)
    mei-29 14:25:34.063 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 27; name: corpusfrequency (27); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/de/6817533aabbee545e152e282d454d2]
    mei-29 14:25:34.067 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.067 [Task submitter] INFO  nextflow.Session - [4a/c75387] Submitted process > corpusfrequency (29)
    mei-29 14:25:34.074 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 28; name: corpusfrequency (28); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/27/900d1a6c2ca38af50d95b89fca3d3c]
    mei-29 14:25:34.079 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.079 [Task submitter] INFO  nextflow.Session - [9b/1b2278] Submitted process > corpusfrequency (31)
    mei-29 14:25:34.113 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 29; name: corpusfrequency (29); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/4a/c75387ce9f69b2e057be1da29a4775]
    mei-29 14:25:34.119 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.119 [Task submitter] INFO  nextflow.Session - [45/3bf2ef] Submitted process > corpusfrequency (30)
    mei-29 14:25:34.163 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 30; name: corpusfrequency (30); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/45/3bf2ef1471bd4342ff51a7c592f13a]
    mei-29 14:25:34.166 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.166 [Task submitter] INFO  nextflow.Session - [dc/f0c188] Submitted process > corpusfrequency (33)
    mei-29 14:25:34.183 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 31; name: corpusfrequency (31); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/9b/1b2278bed14b4c5c93cf289a1013bf]
    mei-29 14:25:34.186 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.186 [Task submitter] INFO  nextflow.Session - [2c/be9d26] Submitted process > corpusfrequency (34)
    mei-29 14:25:34.209 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 33; name: corpusfrequency (33); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/dc/f0c18884fb7ed18b00326bc62b13b1]
    mei-29 14:25:34.213 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.213 [Task submitter] INFO  nextflow.Session - [32/519b97] Submitted process > corpusfrequency (35)
    mei-29 14:25:34.275 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 34; name: corpusfrequency (34); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/2c/be9d269e1641c6624b1d86bcb3e223]
    mei-29 14:25:34.280 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.280 [Task submitter] INFO  nextflow.Session - [52/80e50c] Submitted process > corpusfrequency (36)
    mei-29 14:25:34.368 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 36; name: corpusfrequency (36); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/52/80e50c0b7eac98cf1c7bc1af0e659e]
    mei-29 14:25:34.371 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 35; name: corpusfrequency (35); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/32/519b97179d7101204722b0b519886c]
    mei-29 14:25:34.371 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.371 [Task submitter] INFO  nextflow.Session - [38/164078] Submitted process > corpusfrequency (37)
    mei-29 14:25:34.376 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.377 [Task submitter] INFO  nextflow.Session - [6a/33a3e2] Submitted process > corpusfrequency (32)
    mei-29 14:25:34.496 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 32; name: corpusfrequency (32); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/6a/33a3e2f134c2678dd939c60f436d2e]
    mei-29 14:25:34.499 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.499 [Task submitter] INFO  nextflow.Session - [67/005432] Submitted process > corpusfrequency (38)
    mei-29 14:25:34.538 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 37; name: corpusfrequency (37); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/38/164078d39e27f1b6c55ae4efe7c759]
    mei-29 14:25:34.540 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.540 [Task submitter] INFO  nextflow.Session - [08/f377ee] Submitted process > corpusfrequency (39)
    mei-29 14:25:34.544 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 38; name: corpusfrequency (38); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/67/00543217b1c13e39867cf60046a1f5]
    mei-29 14:25:34.546 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:34.546 [Task submitter] INFO  nextflow.Session - [21/26d1af] Submitted process > ticclunk (1)
    mei-29 14:25:34.684 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 39; name: corpusfrequency (39); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/08/f377ee4f0aa106bdc06003480f0362]
    mei-29 14:25:34.688 [Actor Thread 8] DEBUG nextflow.Session - <<< barrier arrive (process: corpusfrequency)
    mei-29 14:25:36.023 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 40; name: ticclunk (1); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/21/26d1afd131f0c6f49e00d9bcc50028]
    mei-29 14:25:36.037 [Actor Thread 4] DEBUG nextflow.Session - <<< barrier arrive (process: ticclunk)
    mei-29 14:25:36.039 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:36.039 [Task submitter] INFO  nextflow.Session - [28/32b2cc] Submitted process > anahash (1)
    mei-29 14:25:38.126 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 41; name: anahash (1); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/28/32b2cc0b91c32bf9994aef92448f4c]
    mei-29 14:25:38.133 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:25:38.133 [Task submitter] INFO  nextflow.Session - [93/e871ac] Submitted process > indexer (1)
    mei-29 14:25:38.141 [Actor Thread 4] DEBUG nextflow.Session - <<< barrier arrive (process: anahash)
    mei-29 14:26:09.720 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 42; name: indexer (1); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/93/e871ac20ef0ac0582b77b439c06e88]
    mei-29 14:26:09.721 [Actor Thread 4] DEBUG nextflow.Session - <<< barrier arrive (process: indexer)
    mei-29 14:26:09.731 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:26:09.731 [Task submitter] INFO  nextflow.Session - [84/3b9d83] Submitted process > resolver (1)
    mei-29 14:26:12.843 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 43; name: resolver (1); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/84/3b9d838be9ee5773fb568c81dfe7ab]
    mei-29 14:26:12.846 [Actor Thread 11] DEBUG nextflow.Session - <<< barrier arrive (process: resolver)
    mei-29 14:26:12.853 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:26:12.853 [Task submitter] INFO  nextflow.Session - [d1/30a9b3] Submitted process > rank (1)
    mei-29 14:26:14.102 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 44; name: rank (1); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/d1/30a9b3be980c2e7644439c754fab9a]
    mei-29 14:26:14.105 [Actor Thread 4] DEBUG nextflow.Session - <<< barrier arrive (process: rank)
    mei-29 14:26:14.121 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
    mei-29 14:26:14.122 [Task submitter] INFO  nextflow.Session - [d2/8bc290] Submitted process > foliacorrect (1)
    mei-29 14:26:16.669 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 45; name: foliacorrect (1); status: COMPLETED; exit: 0; error: -; workDir: /home/piccl/work/d2/8bc290a0290ea0078ae4111368446c]
    mei-29 14:26:16.703 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'foliacorrect (1)'
    
    Caused by:
      Missing output file(s) `*.ticcl.folia.xml` expected by process `foliacorrect (1)`
    
    Command executed:
    
      set +u
      if [ ! -z "/vol1/lamachine" ]; then
          source /vol1/lamachine/bin/activate
      fi
      set -u
      
      #some bookkeeping
      mkdir outputdir
      
      FoLiA-correct --class current --nums 10 -e xml -O outputdir/ --unk "corpus.wordfreqlist.tsv.unk" --punct "corpus.wordfreqlist.tsv.punct" --rank "corpus.wordfreqlist.tsv.clean.ldcalc.ranked"  -t 1 .
      mv outputdir/*.xml .
    
    Command exit status:
      0
    
    Command output:
      Processed :HuygensING-beaufort-1-1_44578908-29a3-43c0-bebb-23977a955d60.xml into outputdir/HuygensING-beaufort-1-1_44578908-29a3-43c0-bebb-23977a955d60.ticcl.xml still 29 files to go.
      Processed :HuygensING-beaufort-1-1_462e11ec-3ac9-48fe-8e91-15804dedd176.xml into outputdir/HuygensING-beaufort-1-1_462e11ec-3ac9-48fe-8e91-15804dedd176.ticcl.xml still 28 files to go.
      Processed :HuygensING-beaufort-1-1_584e4c68-f09f-4720-bb24-af23d9464c2e.xml into outputdir/HuygensING-beaufort-1-1_584e4c68-f09f-4720-bb24-af23d9464c2e.ticcl.xml still 27 files to go.
      Processed :HuygensING-beaufort-1-1_5c47ac91-3c67-4859-9903-eee6f7269c68.xml into outputdir/HuygensING-beaufort-1-1_5c47ac91-3c67-4859-9903-eee6f7269c68.ticcl.xml still 26 files to go.
      Processed :HuygensING-beaufort-1-1_6ed36b83-8652-4503-b15c-64236ccd2c51.xml into outputdir/HuygensING-beaufort-1-1_6ed36b83-8652-4503-b15c-64236ccd2c51.ticcl.xml still 25 files to go.
      Processed :HuygensING-beaufort-1-1_76386a9e-5db0-4746-9ad6-c40d92cc23a2.xml into outputdir/HuygensING-beaufort-1-1_76386a9e-5db0-4746-9ad6-c40d92cc23a2.ticcl.xml still 24 files to go.
      Processed :HuygensING-beaufort-1-1_7817ff25-4019-40bd-99ae-d0dc82a2cdb7.xml into outputdir/HuygensING-beaufort-1-1_7817ff25-4019-40bd-99ae-d0dc82a2cdb7.ticcl.xml still 23 files to go.
      Processed :HuygensING-beaufort-1-1_7f3c12f4-d64c-49c3-858d-b326b65beac2.xml into outputdir/HuygensING-beaufort-1-1_7f3c12f4-d64c-49c3-858d-b326b65beac2.ticcl.xml still 22 files to go.
      Processed :HuygensING-beaufort-1-1_7f956afd-1f58-426b-855e-a03e0e4b0546.xml into outputdir/HuygensING-beaufort-1-1_7f956afd-1f58-426b-855e-a03e0e4b0546.ticcl.xml still 21 files to go.
      Processed :HuygensING-beaufort-1-1_83a543ad-dbcf-4063-8032-c0ea30b2ac6f.xml into outputdir/HuygensING-beaufort-1-1_83a543ad-dbcf-4063-8032-c0ea30b2ac6f.ticcl.xml still 20 files to go.
      Processed :HuygensING-beaufort-1-1_92defa6d-d10e-4d15-8634-cad01e12429e.xml into outputdir/HuygensING-beaufort-1-1_92defa6d-d10e-4d15-8634-cad01e12429e.ticcl.xml still 19 files to go.
      Processed :HuygensING-beaufort-1-1_945564c3-7c16-4d4a-a052-3bcf929aad22.xml into outputdir/HuygensING-beaufort-1-1_945564c3-7c16-4d4a-a052-3bcf929aad22.ticcl.xml still 18 files to go.
      Processed :HuygensING-beaufort-1-1_97c8917b-d224-409e-96d4-5980dca08ff4.xml into outputdir/HuygensING-beaufort-1-1_97c8917b-d224-409e-96d4-5980dca08ff4.ticcl.xml still 17 files to go.
      Processed :HuygensING-beaufort-1-1_9b382aa1-3afc-4ba2-98a3-471e79e0c656.xml into outputdir/HuygensING-beaufort-1-1_9b382aa1-3afc-4ba2-98a3-471e79e0c656.ticcl.xml still 16 files to go.
      Processed :HuygensING-beaufort-1-1_9b7f029f-707e-4c99-9f39-b67513ff1d6d.xml into outputdir/HuygensING-beaufort-1-1_9b7f029f-707e-4c99-9f39-b67513ff1d6d.ticcl.xml still 15 files to go.
      Processed :HuygensING-beaufort-1-1_9b83cb93-36fc-4db1-ab99-7bc07fa9af58.xml into outputdir/HuygensING-beaufort-1-1_9b83cb93-36fc-4db1-ab99-7bc07fa9af58.ticcl.xml still 14 files to go.
      Processed :HuygensING-beaufort-1-1_a257257c-dc23-4c04-a36c-f23f15e67484.xml into outputdir/HuygensING-beaufort-1-1_a257257c-dc23-4c04-a36c-f23f15e67484.ticcl.xml still 13 files to go.
      Processed :HuygensING-beaufort-1-1_a6bae747-589c-46fa-8ccb-b6779dbfec8b.xml into outputdir/HuygensING-beaufort-1-1_a6bae747-589c-46fa-8ccb-b6779dbfec8b.t-1-1_462e11ec-3ac9-48fe-8e91-15804dedd176.ticcl.xml still 28 files to go.
      Processed :HuygensING-beaufort-1-1_584e4c68-f09f-4720-bb24-af23d9464c2e.xml into outputdir/HuygensING-beaufort-1-1_584e4c68-f09f-4720-bb24-af23d9464c2e.ticcl.xml still 27 files to go.
      Processed :HuygensING-beaufort-1-1_5c47ac91-3c67-4859-9903-eee6f7269c68.xml into outputdir/HuygensING-beaufort-1-1_5c47ac91-3c67-4859-9903-eee6f7269c68.ticcl.xml still 26 files to go.
      Processed :HuygensING-beaufort-1-1_6ed36b83-8652-4503-b15c-64236ccd2c51.xml into outputdir/HuygensING-beaufort-1-1_6ed36b83-8652-4503-b15c-64236ccd2c51.ticcl.xml still 25 files to go.
      Processed :HuygensING-beaufort-1-1_76386a9e-5db0-4746-9ad6-c40d92cc23a2.xml into outputdir/HuygensING-beaufort-1-1_76386a9e-5db0-4746-9ad6-c40d92cc23a2.ticcl.xml still 24 files to go.
      Processed :HuygensING-beaufort-1-1_7817ff25-4019-40bd-99ae-d0dc82a2cdb7.xml into outputdir/HuygensING-beaufort-1-1_7817ff25-4019-40bd-99ae-d0dc82a2cdb7.ticcl.xml still 23 files to go.
      Processed :HuygensING-beaufort-1-1_7f3c12f4-d64c-49c3-858d-b326b65beac2.xml into outputdir/HuygensING-beaufort-1-1_7f3c12f4-d64c-49c3-858d-b326b65beac2.ticcl.xml still 22 files to go.
      Processed :HuygensING-beaufort-1-1_7f956afd-1f58-426b-855e-a03e0e4b0546.xml into outputdir/HuygensING-beaufort-1-1_7f956afd-1f58-426b-855e-a03e0e4b0546.ticcl.xml still 21 files to go.
      Processed :HuygensING-beaufort-1-1_83a543ad-dbcf-4063-8032-c0ea30b2ac6f.xml into outputdir/HuygensING-beaufort-1-1_83a543ad-dbcf-4063-8032-c0ea30b2ac6f.ticcl.xml still 20 files to go.
      Processed :HuygensING-beaufort-1-1_92defa6d-d10e-4d15-8634-cad01e12429e.xml into outputdir/HuygensING-beaufort-1-1_92defa6d-d10e-4d15-8634-cad01e12429e.ticcl.xml still 19 files to go.
      Processed :HuygensING-beaufort-1-1_945564c3-7c16-4d4a-a052-3bcf929aad22.xml into outputdir/HuygensING-beaufort-1-1_945564c3-7c16-4d4a-a052-3bcf929aad22.ticcl.xml still 18 files to go.
      Processed :HuygensING-beaufort-1-1_97c8917b-d224-409e-96d4-5980dca08ff4.xml into outputdir/HuygensING-beaufort-1-1_97c8917b-d224-409e-96d4-5980dca08ff4.ticcl.xml still 17 files to go.
      Processed :HuygensING-beaufort-1-1_9b382aa1-3afc-4ba2-98a3-471e79e0c656.xml into outputdir/HuygensING-beaufort-1-1_9b382aa1-3afc-4ba2-98a3-471e79e0c656.ticcl.xml still 16 files to go.
      Processed :HuygensING-beaufort-1-1_9b7f029f-707e-4c99-9f39-b67513ff1d6d.xml into outputdir/HuygensING-beaufort-1-1_9b7f029f-707e-4c99-9f39-b67513ff1d6d.ticcl.xml still 15 files to go.
      Processed :HuygensING-beaufort-1-1_9b83cb93-36fc-4db1-ab99-7bc07fa9af58.xml into outputdir/HuygensING-beaufort-1-1_9b83cb93-36fc-4db1-ab99-7bc07fa9af58.ticcl.xml still 14 files to go.
      Processed :HuygensING-beaufort-1-1_a257257c-dc23-4c04-a36c-f23f15e67484.xml into outputdir/HuygensING-beaufort-1-1_a257257c-dc23-4c04-a36c-f23f15e67484.ticcl.xml still 13 files to go.
      Processed :HuygensING-beaufort-1-1_a6bae747-589c-46fa-8ccb-b6779dbfec8b.xml into outputdir/HuygensING-beaufort-1-1_a6bae747-589c-46fa-8ccb-b6779dbfec8b.ticcl.xml still 12 files to go.
      Processed :HuygensING-beaufort-1-1_b8420dd3-5bbd-444d-a1a4-8eb2c1e33c8f.xml into outputdir/HuygensING-beaufort-1-1_b8420dd3-5bbd-444d-a1a4-8eb2c1e33c8f.ticcl.xml still 11 files to go.
      Processed :HuygensING-beaufort-1-1_bdcde7b4-08ef-4abb-931b-37cc3756ca10.xml into outputdir/HuygensING-beaufort-1-1_bdcde7b4-08ef-4abb-931b-37cc3756ca10.ticcl.xml still 10 files to go.
      Processed :HuygensING-beaufort-1-1_c65c2ee6-598a-43f4-98da-8273f4714fa1.xml into outputdir/HuygensING-beaufort-1-1_c65c2ee6-598a-43f4-98da-8273f4714fa1.ticcl.xml still 9 files to go.
      Processed :HuygensING-beaufort-1-1_d3905d0e-e4c8-470f-906f-e7d115a2ec91.xml into outputdir/HuygensING-beaufort-1-1_d3905d0e-e4c8-470f-906f-e7d115a2ec91.ticcl.xml still 8 files to go.
      Processed :HuygensING-beaufort-1-1_da7b43e6-7023-41c7-86b6-f6732acf4b38.xml into outputdir/HuygensING-beaufort-1-1_da7b43e6-7023-41c7-86b6-f6732acf4b38.ticcl.xml still 7 files to go.
      Processed :HuygensING-beaufort-1-1_e21e7be8-9702-41be-b6e1-097c9d2b11bb.xml into outputdir/HuygensING-beaufort-1-1_e21e7be8-9702-41be-b6e1-097c9d2b11bb.ticcl.xml still 6 files to go.
      Processed :HuygensING-beaufort-1-1_e8805c41-9238-4b67-bdd7-b778a7f48750.xml into outputdir/HuygensING-beaufort-1-1_e8805c41-9238-4b67-bdd7-b778a7f48750.ticcl.xml still 5 files to go.
      Processed :HuygensING-beaufort-1-1_eac9b47c-7fe9-4cf1-acd5-076f7adbc434.xml into outputdir/HuygensING-beaufort-1-1_eac9b47c-7fe9-4cf1-acd5-076f7adbc434.ticcl.xml still 4 files to go.
      Processed :HuygensING-beaufort-1-1_ebfc8986-bc4b-45aa-b235-3057c85d886d.xml into outputdir/HuygensING-beaufort-1-1_ebfc8986-bc4b-45aa-b235-3057c85d886d.ticcl.xml still 3 files to go.
      Processed :HuygensING-beaufort-1-1_f01dc651-d6df-401e-939c-baf9697794e9.xml into outputdir/HuygensING-beaufort-1-1_f01dc651-d6df-401e-939c-baf9697794e9.ticcl.xml still 2 files to go.
      Processed :HuygensING-beaufort-1-1_f03474ce-85b7-41ef-8ed8-bd20ec6a904f.xml into outputdir/HuygensING-beaufort-1-1_f03474ce-85b7-41ef-8ed8-bd20ec6a904f.ticcl.xml still 1 files to go.
      done processsing directory '.'
      edit statistics: 
      	edit	 count
      	11	17310
      	TOKENS	438468
    
    Command error:
      no unknown words!
    
    Work dir:
      /home/piccl/work/d2/8bc290a0290ea0078ae4111368446c
    

    The unknown words file, generated by the TICCL-unk step, is empty. Should this file not be empty (is there something wrong with my data?)? Or is this a bug, and should FoLiA-correct not complain about the empty file?

    bug investigate 
    opened by peterdekker 10
  • frog.nf cannot find frog xml output

    frog.nf cannot find frog xml output

    I am running Frog as part of the LaMachine distribution. When I run the following command: $ nextflow run LanguageMachines/PICCL/frog.nf --inputdir ticcl_output/ --inputformat folia --extension folia.xml --skip=acmpn --outputdir frog_output (same result without --inputformat and --outputdir, or with --extension xml)

    I get the following error:

    N E X T F L O W  ~  version 0.29.0
    Launching `LanguageMachines/PICCL` [disturbed_watson] - revision: c12599e479 [master]
    WARN: The config file defines settings for an unknown process: indexer
    ----------------------------------
    Frog pipeline
    ----------------------------------
    WARN: `params.inputformat` is defined multiple times -- Assignments following the first are ignored
    [warm up] executor > local
    [7b/3e1554] Submitted process > frog_folia2folia (1)
    ERROR ~ Error executing process > 'frog_folia2folia (1)'
    
    Caused by:
      Missing output file(s) `*.xml` expected by process `frog_folia2folia (1)`
    
    Command executed:
    
      set +u
            if [ ! -z "/vol1/lamachine" ]; then
                source /vol1/lamachine/bin/activate
            fi
            set -u
      
            opts=""
            if [ ! -z "acmpn" ]; then
      	opts="--skip=acmpn"
      fi
      
            #move input files to separate staging directory
            mkdir input
            mv *.xml input/
      
            #output will be in cwd
            frog $opts --inputclass "current" --outputclass "current" --xmldir "." --threads 1 --nostdout --testdir input/ -x
    
    Command exit status:
      0
    
    Command output:
      (empty)
    
    Command error:
      frog-:Mon May  7 15:00:27 2018 done with sentence[6574]
      frog-:Mon May  7 15:00:27 2018 done with sentence[6575]
      frog-:Mon May  7 15:00:27 2018 done with sentence[6576]
      frog-:Mon May  7 15:00:27 2018 done with sentence[6577]
      frog-:Mon May  7 15:00:27 2018 done with sentence[6578]
      frog-:Mon May  7 15:00:27 2018 done with sentence[6579]
      frog-:Mon May  7 15:00:27 2018 done with sentence[6580]
      frog-:tokenisation took:  21 seconds, 78 milliseconds and 152 microseconds
      frog-:CGN tagging took:   300 seconds, 614 milliseconds and 635 microseconds
      frog-:Mblem took:         4 seconds, 876 milliseconds and 835 microseconds
      frog-:Frogging in total took: 308 seconds, 694 milliseconds and 783 microseconds
      frog-:resulting FoLiA doc saved in ./img.ticcl.folia.xml
      frog-:Mon May  7 15:00:37 2018 Frogging input/img_de_nederlander_1850_ddd_000013854.ticcl.folia.xml
      frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
      frog-:Mon May  7 15:00:37 2018 process 29 sentences
      frog-:Mon May  7 15:00:37 2018 done with sentence[1]
      frog-:Mon May  7 15:00:37 2018 done with sentence[2]
      frog-:Mon May  7 15:00:37 2018 done with sentence[3]
      frog-:Mon May  7 15:00:37 2018 done with sentence[4]
      frog-:Mon May  7 15:00:38 2018 done with sentence[5]
      frog-:Mon May  7 15:00:38 2018 done with sentence[6]
      frog-:Mon May  7 15:00:38 2018 done with sentence[7]
      frog-:Mon May  7 15:00:38 2018 done with sentence[8]
      frog-:Mon May  7 15:00:38 2018 done with sentence[9]
      frog-:Mon May  7 15:00:38 2018 done with sentence[10]
      frog-:Mon May  7 15:00:38 2018 done with sentence[11]
      frog-:Mon May  7 15:00:38 2018 done with sentence[12]
      frog-:Mon May  7 15:00:38 2018 done with sentence[13]
      frog-:Mon May  7 15:00:38 2018 done with sentence[14]
      frog-:Mon May  7 15:00:38 2018 done with sentence[15]
      frog-:Mon May  7 15:00:38 2018 done with sentence[16]
      frog-:Mon May  7 15:00:38 2018 done with sentence[17]
      frog-:Mon May  7 15:00:38 2018 done with sentence[18]
      frog-:Mon May  7 15:00:38 2018 done with sentence[19]
      frog-:Mon May  7 15:00:38 2018 done with sentence[20]
      frog-:Mon May  7 15:00:38 2018 done with sentence[21]
      frog-:Mon May  7 15:00:38 2018 done with sentence[22]
      frog-:Mon May  7 15:00:39 2018 done with sentence[23]
      frog-:Mon May  7 15:00:39 2018 done with sentence[24]
      frog-:Mon May  7 15:00:39 2018 done with sentence[25]
      frog-:Mon May  7 15:00:39 2018 done with sentence[26]
      frog-:Mon May  7 15:00:39 2018 done with sentence[27]
      frog-:Mon May  7 15:00:39 2018 done with sentence[28]
      frog-:Mon May  7 15:00:40 2018 done with sentence[29]
      frog-:tokenisation took:  0 seconds, 89 milliseconds and 363 microseconds
      frog-:CGN tagging took:   2 seconds, 537 milliseconds and 471 microseconds
      frog-:Mblem took:         0 seconds, 16 milliseconds and 267 microseconds
      frog-:Frogging in total took: 2 seconds, 562 milliseconds and 107 microseconds
      frog-:resulting FoLiA doc saved in ./img_de_nederlander_1850_ddd_000013854.ticcl.folia.xml
      frog-:Mon May  7 15:00:40 2018 Frog finished
    
    Work dir:
      /home/piccl/work/7b/3e1554352860cce59cbca95673db69
    
    Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
    
     -- Check '.nextflow.log' file for details
    

    It seems that the Nextflow script cannot find the xml output from frog. This seems to go wrong in lines 72 and 117 of ocr.nf (https://github.com/LanguageMachines/PICCL/blob/master/frog.nf#L72), where the output is defined using a Wildcard. When I run an earlier version of frog.nf, where the output is more explicitly defined, it runs without errors: https://github.com/LanguageMachines/PICCL/commit/b4e05a044d6ae4037c7e435fe26dbb5f6c700f72#diff-b1623eb35be7cba58a6c27b0a3e54453R57

    bug 
    opened by peterdekker 10
  • Conduct extensive tests of the latest release (v0.7)

    Conduct extensive tests of the latest release (v0.7)

    • [ ] Verify once more that all desired functionality that was implemented in @martinreynaert's prototype at https://github.com/martinreynaert/TICCL/blob/master/TICCLops.PICCL.pl (commit state 62a398f) is present. (This is what my entire Nextflow rewrite was based on after all. Some exceptions and all Nederlab work aside, I didn't implement things not in there that were not explicitly requested)

    You might want to repeat the following items for multiple languages, and with/without post-processing options such as tokenisation (ucto) and linguistic enrichment (Frog). You may also want to test both from the interface as well as from the command line:

    • [ ] Verify OCR & TICCL pipeline from PDF containing scanned images
    • [ ] Verify OCR & TICCL pipeline of individual images with PDF reassembly option enabled
    • [ ] Verify OCR pipeline without TICCL
    • [ ] Verify TICCL pipeline on plaintext (without OCR)
    • [ ] Verify TICCL pipeline on PDF that contains text (without OCR)

    In testing all of the above, consider a) whether it runs (i.e. no crashes) and b) whether results are actually sensible/correct.

    For most of these scenarios, an integration test for the backend is already available in test.sh, and those tests are only limited to aspect (a). There are no tests that invoke the front-end and wrapper. Please suggest (small) data and improvements for further automated tests where applicable.

    Regarding the webservice-end of things:

    • [ ] Verify whether the provided parameters and profiles (input templates and outpute templates) cover all you want to offer. (taking into account the fact that we currently merely offer a generic CLAM interface rather than a tailored front-end for end-users).
    • [ ] Verify whether the desired intermediate output is published (this goes for any of the scenarios)

    Regarding software dependencies:

    • [ ] Verify whether the software versions offered by the latest debian/ubuntu/CentOS suffice and whether PICCL is capable with sufficiently handling the inevitable version discrepancies that occur between distributions. Are some versions must-haves and are some versions show-stoppers?

    Please always report problems or feature requests in separate new issues (not in this thread), but do confirm things that are working properly here (just tick the boxes) so we have verification. Ideally, wherever they can't be automated yet, these tests should be repeated when a major release update occurs.

    test expired 
    opened by proycon 9
  • FoLiA alignments in OCR output

    FoLiA alignments in OCR output

    This may be more of a Ticcltools or foliautils issue, but I'll post it here as it is the outcome of the pipeline. When running a document through OCR, we obtain very verbose untokenised FoLiA output as follows:

    <p xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10">
     <t class="OCR">
       <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13">DISEASES</t-str>
       <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_14">OF</t-str>
       <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_15">AQUATIC</t-str>
       <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_16">ORGANISMS</t-str>
       <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_17">Dis.</t-str>
       <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_18">aquat.</t-str>
       <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_19">Org.</t-str>
     </t>
     <str annotator="folia-hocr" datetime="2018-11-19T20:47:13" xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13"><t class="OCR" offset="0">DISEASES</t>
       <alignment xlink:href="FH-OllevierGeets-001-000.tif" xlink:type="simple">
        <aref id="word_1_13" type="str"/>
     </alignment>
    </str>
    

    My question is about the alignments here. They refer to tif images and mention an ID. I realize you want to tie each word to its occurrence in the image. But I don't think the TIF file contains this information (being just a bitmap afaik). Shouldn't this link to the hOCR output instead? (or is ALTO XML still involved here and should it be that?). (@kosloot I'd suggest adding a format attribute on the alignment to make clear to what kind of file (mimetype) it links)

    Moreover, is this intermediate output that the PICCL OCR pipeline should publish as output for the user? Because it currently doesn't. And linking to something you don't output seems fairly useless.

    During our last meeting @kdepuydt lamented that the FoLiA XML output of TICCL was not very human-readable, where she has a point, but it is also kind of inevitable if you want to include all this higher-order information. The question is whether everybody wants to? A possible suggestion here could also be to make outputting certain information optional (such as the substrings and alignments). Still, I'd rather include too much information than too little.

    test expired 
    opened by proycon 7
  • Add text markup information after FoLiA-correct

    Add text markup information after FoLiA-correct

    FoLiA-correct doesn't add text markup information, but FLAT relies on this to properly display TICCL output (proycon/flat#92). Add foliatextcontent (proycon/foliatools#32) to the end of the PICCL pipeline if no further tokenisation or linguistic enrichments are selected, this tool will automatically add the necessary text markup linking to the strings, with support for the corrections.

    enhancement 
    opened by proycon 2
  • Pipeline is slower for files which are combined (input files with same prefix)

    Pipeline is slower for files which are combined (input files with same prefix)

    When running the pipeline for input files with the same prefix, the files are combined to one output file. Now I am testing on many-core hardware, it becomes apparent that combining files makes the pipeline much slower, especially in the OCR step. Probably this happens because parallelization across multiple CPU cores cannot be applied.

    This is not a problem in itself, but I think it is good to notify users that files with the same prefix will be combined, and will have longer processing time. Or give them a choice to enable/disable combination. Now, a small difference in input, gives a large difference in processing time.

    EDIT: I saw there is a "Reassamble PDF" option in the webinterface, but same-prefix files were also combined if I disabled this option.

    investigate 
    opened by peterdekker 0
Releases(v0.9.5)
  • v0.9.5(Dec 11, 2020)

    Added a string linking stage to ticcl, this adds extra markup information (t-str/t-correction) using the foliatextcontent tool, this is in turn needed by FLAT for proper visualisation.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.4(Oct 1, 2020)

  • v0.9.3(Oct 1, 2020)

    Minor update: Added an --outputclass parameter for ticcl.nf to choose the output text class and provide extra flexibility. Set either that or --inputclass.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Oct 1, 2020)

    • added a clearer error message with explanation in case the indexNT file is empty (related to LanguageMachines/lexiconenrichment#1)
    • removed explicit flat url (let LaMachine handle it)
    • minor README update
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Aug 19, 2020)

  • v0.9.0(Apr 15, 2020)

    This PICCL release builds upon the long awaited TICCLtools v0.7:

    Ticcl:

    • Fixed chaining
    • Implemented chainclean and made it optional
    • Changed default separator to underscore
    • TICCL-rank invocation changed
    • changed skipcols
    • added --low --high and --ngrams parameter
    • added alphabet file to TICCL-unk

    General:

    • Migrated to nextflow process selectors, solved deprecation warnings (#57)
    • verify output files have non-zero size
    • Added schematic figures to document the architecture of the pipelines

    Webservice:

    • Added inputtemplate for custom lexicon #56
    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Aug 25, 2019)

  • v0.8.1(Aug 25, 2019)

  • v0.8.0(Jun 14, 2019)

    • Several workflows that used to be part of PICCL have been split off into separate projects now, this concerns:
      • The nederlab pipeline for enrichment of historical dutch, they are now in https://github.com/proycon/nederlab-pipeline
      • The frog, ucto and folia validation pipelines https://github.com/proycon/aNtiLoPe, PICCL depends on this new aNtiLoPe project now
      • This is an organisational change in favour of more modularity, clarity and better maintainability, it does not affect the functionality or installation of PICCL!
    • Allow unsetting flaturl in external yaml configuration to disable flat viewers (proycon/clam#75)
    • Propagate existing input textclass option to PICCL and assume a default of 'current' (rather than 'OCR') if OCR is skipped (#48) and change TICCL inputclass default to 'current' instead of 'OCR' when dealing with FoLiA input
    • Delete zero byte input files prior to FoLiA-correct (artefact of earlier patchy error ignore strategy) #49
    Source code(tar.gz)
    Source code(zip)
  • v0.7.6(Mar 5, 2019)

    • Another fix for plain text input and no ocr AND no ticcl scenario (addressed in #43)
    • Clean up in the wrapper script (it's becoming too convoluted)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.5.1(Feb 28, 2019)

  • v0.7.5(Feb 27, 2019)

  • v0.7.4(Feb 11, 2019)

  • v0.7.3.1(Jan 16, 2019)

  • v0.7.3(Dec 12, 2018)

    At least for some PDFs, the PDF to image file convertor in PICCL, i.e. PDFimage, created spurious image files. These sometimes resulted in 'pages' of garbage. Also, when we started building this pipeline, PDFimages did not yet convert straight into tiff-format. So we also used 'convert'.

    Both have now been replaced by pdftoppm, which seems to produce exactly the same amount of output tiff-files as regular PDF viewers report.

    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Dec 11, 2018)

    • fixed --help flag for ocr and ticcl (was broken)
    • fix for text input when skipping TICCL
    • Use 300dpi instead of the default 72dpi when converting bitmaps to TIF, should reduce garbage output #45
    • Minor logging improvements: output tesseract version to standard output (#45) + feedback on why frog is enabled
    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Nov 21, 2018)

  • v0.7.0(Nov 19, 2018)

    • added a document CONTRIBUTE.md with contributor guidelines and technical details
    • added and expanded comments to aid @martinreynaert in understanding the Nextflow pipelines
    • Restructured the webservice profiles (CLAM):
      • Publish relevant output of intermediate stages for the end-user, not just a single final end-result.
      • Less duplication
      • Some small fixes
      • Removed obsolete/implicit tokeniser option for Frog
    • Fixes in the wrapper script
      • Fixes for text input
      • Fix: Output did not show up for download when only OCR is enabled #40
    • Updated the startserver* scripts for the piccl webservice, made them more LaMachine-aware
    • Prevent accidentally feeding Nextflow's trace.txt log as input
    • Report input files to stdout for some pipelines (ticcl,frog, tokenize)
    • Fix in nederlab pipeline, allow untokenised folia input and add --tok option to force tokenisation
    • README fix. #41

    See also https://github.com/proycon/clam/issues/69

    Source code(tar.gz)
    Source code(zip)
  • v0.6.3(Jul 12, 2018)

  • v0.6.2(Jun 6, 2018)

  • v0.6.1(Jun 6, 2018)

    • more verbose output from clam wrapper
    • added debug option
    • force en_US.UTF-8 locale in CLAM wrapper (solves LanguageMachines/ticcltools#18)
    • added a test
    Source code(tar.gz)
    Source code(zip)
  • v0.6(Jun 5, 2018)

    This is an important bugfix release with some new features as well:

    • New features:
      • Added TICCL-chainer
      • Propagate alphabet file to resolver (TICCL-LDcalc)
    • Fixes:
      • Fixed and refactored integrations tests, travis works again #16
      • Some fixes for running ticcl.nf for folia files with different extension #32 (but not extensively tested yet)
      • frog.nf couldn't find frog xml output #29
      • Use new inputclass/outputclass parameters for FoLiA-correct #34
      • ocr.nf could not find FoLiA-hocr output files #30

    This release depends upon the new releases (released today) of ticcltools (v0.6) and foliautils (v0.9.2).

    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(May 23, 2018)

  • v0.5.2(May 19, 2018)

  • v0.5.1(Apr 13, 2018)

  • v0.5(Apr 5, 2018)

    • Use CLAM 2.3 and the new external configuration file capability
    • More detailed output from nextflow in webservice error.log
    • Some improvements in handling quotes/spaces, but not complete yet (#24), usage of spaces in input filenames is still not supported!
    • Fix in tokeniser invocation from webservice
    • Compatibility with LaMachine v2 (this may break LaMachine v1 compatibility), installation instructions updated accordingly
    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(Mar 9, 2018)

    First careful release of the current state of PICCL so we can at least differentiate production from develpment (releases are mandatory for LaMachine v2 inclusion). Some issues are still awaiting further testing from @martinreynaert , so the stability of the ticcl pipeline is still uncertain.

    The DBNL pipeline for Nederlab is functional and this version corresponds with the delivered corpus enriched documents.

    Source code(tar.gz)
    Source code(zip)
Owner
Language Machines
NLP Research group at Centre for Language Studies, Radboud University Nijmegen
Language Machines
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

null 27.5k Jan 8, 2023
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 32 Jul 24, 2022
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
Provides OCR (Optical Character Recognition) services through web applications

OCR4all As suggested by the name one of the main goals of OCR4all is to allow basically any given user to independently perform OCR on a wide variety

null 174 Dec 31, 2022
Steve Tu 71 Dec 30, 2022
Multi-choice answer sheet correction system using computer vision with opencv & python.

Multi choice answer correction ?? 5 answer sheet samples with a specific solution for detecting answers and sheet correction. ?? By running the soluti

Reza Firouzi 7 Mar 7, 2022
A post-processing tool for scanned sheets of paper.

unpaper Originally written by Jens Gulden — see AUTHORS for more information. Licensed under GNU GPL v2 — see COPYING for more information. Overview u

null 27 Dec 7, 2022
scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Scan Tailor - scantailor.org This project is no longer maintained, and has not been maintained for a while. About Scan Tailor is an interactive post-p

null 1.5k Dec 28, 2022
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 3, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 1, 2022
ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics 21 Dec 8, 2021
Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

OCRopus 3.2k Dec 31, 2022
CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

简介 基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别 文本检测:CTPN 文本识别:DenseNet + CTC 环境部署 sh setup.sh 注:CPU环境执行前需注释掉for gpu部分,并解开for cpu部分的注释 Demo 将测试图片放入test_images

Yang Chenguang 2.6k Dec 29, 2022