A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

You might also like...
A bot to get Statistics like the Playercount from your Minecraft-Server on your Discord-Server

Hey Thanks for reading me. Warning: My English is not the best I have programmed this bot to show me statistics about the player numbers and ping of m

This Server Cloner can clone the server you want with all the perms of roles in every particular channel.

Server-Cloner-with-perms 🚀 This Server Cloner can clone the server you want with all the perms of roles in every particular channel. Features Clone C

A Discord Server Cloner Which Can Clone Any Discord Server In Just Few Minutes
A Discord Server Cloner Which Can Clone Any Discord Server In Just Few Minutes

A Discord Server Cloner Which Can Clone Any Discord Server In Just Few Minutes.

A simple Spamming software made in python
A simple Spamming software made in python

Spam-qlk Warning!!! 'I' am not responsible for the 'damage or harm' caused by this 'Software'!!! Use at your own risk!!! Input the message. After you

This is my personal version of Pac-Man using python, which is the first assignment of EA Software Engineering Virtual Experience Program from Forage.com
This is my personal version of Pac-Man using python, which is the first assignment of EA Software Engineering Virtual Experience Program from Forage.com

Vac-Man in Python This is my personal version of Vax-man game using python, which is the first task of EA Software Engineering Virtual Experience Prog

A collective list of free APIs for use in software and web development.

Public APIs A collective list of free APIs for use in software and web development. A public API for this project can be found here! For information o

Simple software that can send WhatsApp message to a single or multiple users (including unsaved number**)

wp-automation Info: this is a simple automation software that sends WhatsApp message to single or multiple users. Key feature: -Sends message to multi

This software's intent is to automate all activities related to manage Axie Infinity Scholars. It is specially aimed to mangers with large scholar roasters.

Axie Scholars Utilities This software's intent is to automate all activities related to manage Scholars. It is specially aimed to mangers with large s

Primeira etapa do processo seletivo para a bolsa de migração de conteúdo de Design de Software.

- Este processo já foi concluído. Obrigado pelo seu interesse! Processo Seletivo para a bolsa de migração de conteúdo de Design de Software Primeirame

Comments
  • BSO PR

    BSO PR

    J'ai :

    • ajouté un init.py pour pouvoir installer le package.
    • ajouté la possibilité d'envoyer des pdf gzippés (.pdf.gz) sur software_mentions.
    • changé les paramètres de config (data_path -> lmdb_path) et (software_mention_url -> software_mention_host + software_mention_port)
    opened by antoinebres 3
Owner
null
Dns-Client-Server - Dns Client Server For Python

Dns-client-server DNS Server: supporting all types of queries and replies. Shoul

Nishant Badgujar 1 Feb 15, 2022
(@Tablada32BOT is my bot in twitter) This is a simple bot, its main and only function is to reply to tweets where they mention their bot with their @

Remember If you are going to host your twitter bot on a page where they can read your code, I recommend that you create an .env file and put your twit

null 3 Jun 4, 2021
Discord-Mass-Mention - Yup the title says it all

Protocol - Mass Mention (i havent tested this with any token other than my own t

Mallowies 14 Nov 6, 2022
ALIEN: idA Local varIables rEcogNizer

ALIEN: idA Local varIables rEcogNizer ALIEN is an IDA Pro plugin that allows the user to get more information about ida local variables with the help

null 16 Nov 26, 2022
A python client for the Software-Challenge Germany.

sc-client-python A python client for the Software-Challenge Germany. Creating a new project (Optional) Install virtualenv virtualenv is a tool that cr

rpkak 3 Jan 22, 2022
Raphtory-client - The python client for the Raphtory project

Raphtory Client This is the python client for the Raphtory project Install via p

Raphtory 5 Apr 28, 2022
Drcom-pt-client - Drcom Pt version client with refresh timer

drcom-pt-client Drcom Pt version client with refresh timer Dr.com Pt版本客户端 可用于网页认

null 4 Nov 16, 2022
Python Client Library to interface with the Phoenix Realtime Server

supabase-realtime-client Python Client Library to interface with the Phoenix Realtime Server This is a fork of the supabase community realtime client

Anand 2 May 24, 2022
Python Client for MLflow Tracking Server

Python Client for MLflow Python client for MLflow REST API. Features: Unlike MLflow Tracking client all REST API methods are exposed to user. All clas

MTS 35 Dec 23, 2022
A discord Server Bot made with Python, This bot helps people feel better by inspiring them with motivational quotes or by responding with a great message, also the users of the server can create custom messages by telling the bot with Commands.

A discord Server Bot made with Python, This bot helps people feel better by inspiring them with motivational quotes or by responding with a great message, also the users of the server can create custom messages by telling the bot with Commands.

Aran 1 Oct 13, 2021