A machine learning software for extracting information from scholarly documents

Patrice Lopez

Last update: Jan 8, 2023

Related tags

Computer Vision metadata pdf machine-learning deep-learning crf fulltext scientific-articles bibliographical-references hamburger-to-cow

Overview

GROBID

GROBID documentation

Visit the GROBID documentation for more detailed information.

Summary

GROBID (or Grobid, but not GroBid nor GroBiD) means GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby. In 2011 the tool has been made available in open source. Work on GROBID has been steady as a side project since the beginning and is expected to continue as such.

The following functionalities are available:

Header extraction and parsing from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
References extraction and parsing from articles in PDF format, around .87 f-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references. All the usual publication metadata are covered (including DOI, PMID, etc.).
Citation contexts recognition and resolution to the full bibliographical references of the article. The accuracy of citation contexts resolution is above .76 f-score (which corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
Parsing of references in isolation (around .90 f-score at instance-level, .95 f-score at field level).
Parsing of names (e.g. person title, forenames, middlename, etc.), in particular author names in header, and author names in references (two distinct models).
Parsing of affiliation and address blocks.
Parsing of dates, ISO normalized day, month, year.
Full text extraction and structuring from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference callout, figure, table, etc.).
Consolidation/resolution of the extracted bibliographical references using the biblio-glutton service or the CrossRef REST API. In both cases, DOI resolution performance is higher than 0.95 f-score from PDF extraction.
Extraction and parsing of patent and non-patent references in patent publications.
PDF coordinates for extracted information, allowing to create "augmented" interactive PDF.

In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middlenames, affiliation types, detailed address, journal, volume, issue, pages, doi, pmid, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure headers, etc.).

GROBID includes a comprehensive web service API, batch processing, a JAVA API, a Docker image, a generic evaluation framework (precision, recall, etc., n-fold cross-evaluation) and the semi-automatic generation of training data.

GROBID can be considered as production ready. Deployments in production includes ResearchGate, HAL Research Archive, INIST-CNRS, CERN (Invenio), scite.ai, and many more. The tool is designed for high scalability in order to address the full scientific literature corpus.

GROBID should run properly "out of the box" on Linux (64 bits) and macOS. We cannot ensure currently support for Windows as we did before (help welcome!).

GROBID uses optionnally Deep Learning models relying on the DeLFT library, a task-agnostic Deep Learning framework for sequence labelling and text classification. The tool can run with feature engineered CRF (default), Deep Learning architectures (with or without layout feature channels) or any mixtures of CRF and DL to balance scalability and accuracy.

For more information on how the tool works, on its key features and benchmarking, visit the GROBID documentation.

Demo

For testing purposes, a public GROBID demo server is available at the following address: https://grobid.science-miner.com

The Web services are documented here.

Warning: Some quota and query limitation apply to the demo server! Please be courteous and do not overload the demo server.

Clients

For helping to exploit GROBID service at scale, we provide clients written in Python, Java, node.js using the web services for parallel batch processing:

All these clients will take advantage of the multi-threading for scaling large set of PDF processing. As a consequence, they will be much more efficient than the batch command lines (which use only one thread) and should be prefered.

We have been able recently to run the complete fulltext processing at around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client listed above during one week on one 16 CPU machine (16 threads, 32GB RAM, no SDD, articles from mainstream publishers), see here (11.3M PDF were processed in 6 days by 2 servers without interruption).

In addition, a Java example project is available to illustrate how to use GROBID as a Java library: https://github.com/kermitt2/grobid-example. The example project is using GROBID Java API for extracting header metadata and citations from a PDF and output the results in BibTeX format.

Finally, the following python utilities can be used to create structured full text corpora of scientific articles simply by indicating a list of strong identifiers like DOI or PMID, performing the identification of online Open Access PDF, the harvesting, the metadata agreegation and the Grobid processing in one step at scale: article-dataset-builder

GROBID Modules

A series of additional modules have been developed for performing structure aware text mining directly on scholar PDF, reusing GROBID's PDF processing and sequence labelling weaponery:

grobid-ner: named entity recognition
grobid-quantities: recognition and normalization of physical quantities/measurements
software-mention: recognition of software mentions and attributes in scientific literature
grobid-astro: recognition of astronomical entities in scientific papers
grobid-bio: a bio-entity tagger using BioNLP/NLPBA 2004 dataset
grobid-dictionaries: structuring dictionaries in raw PDF format
grobid-superconductors: recognition of superconductor material and properties in scientific literature
entity-fishing, a tool for extracting Wikidata entities from text and document, can also use Grobid to pre-process scientific articles in PDF, leading to more precise and relevant entity extraction and the capacity to annotate the PDF with interative layout.
dataseer-ml: identification of sections and sentences introducing a dataset in a scientific article, and classification of the type of this dataset.

Release and changes

See the Changelog.

License

GROBID is distributed under Apache 2.0 license.

The documentation is distributed under CC-0 license and the annotated data under CC-BY license.

If you contribute to GROBID, you agree to share your contribution following these licenses.

Main author and contact: Patrice Lopez ([email protected])

How to cite

If you want to cite this work, please refer to the present GitHub project, together with the Software Heritage project-level permanent identifier. For example, with BibTeX:

@misc{GROBID,
    title = {GROBID},
    howpublished = {\url{https://github.com/kermitt2/grobid}},
    publisher = {GitHub},
    year = {2008--2021},
    archivePrefix = {swh},
    eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
}

See the GROBID documentation for more related resources.

Comments

Could the java documentation and the process of embedding grobid into Java project be updated?

Following the instructions on the grobid site, I cannot embed grobid into my JAVA project due to poor instructions regarding Gradle and Maven. Also, I do not know how to use the APIs because the Java documentation has different parameters for the methods. Specifically, fullTextToTei take in different arguments than what is shown in the Java docs.
need help Windows-specific

opened by lucaspada894 40
Dropwizard service
Reworked how Grobid Home is found now

Search in classpath

Search via property (able to provide http/https path)

Locally around the working directory -Dropwizard integration

Adapted JS files

Dependency injection

Exception mapper so that it's not necessary to construct a response object for failed requests
opened by detonator413 32
Update to gradle 6.5.1 to support JDK 13 and 14
[x] Update to gradle 6.5

[x] remove duplicated build code from build.gradle

[x] fix problems due to jacoco and coveralls

[x] fix fatJar build

[x] gradle install on local maven repository

[x] fix incompatibilities between shadowJar and Gradle 6.5

~~[ ] bintray publication~~ can't really test it

enhancement
opened by lfoppiano 24
JEP support on macOS

I have changed the configuration to use delft models but I'm encountering an issue :

Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathOrLoadFromClasspath WARNING: No Grobid property was provided. Attempting to find Grobid home in the current directory... Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail WARNING: *************************************************************** Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail WARNING: *** USING GROBID HOME: /Users/azhar/work/grobid/grobid-home Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail WARNING: *************************************************************** Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathOrLoadFromClasspath WARNING: No Grobid property was provided. Attempting to find Grobid home in the current directory... Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail WARNING: *************************************************************** Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail WARNING: *** USING GROBID HOME: /Users/azhar/work/grobid/grobid-home Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder findGrobidHomeOrFail WARNING: *************************************************************** Mar 19, 2019 10:45:56 AM org.grobid.core.main.GrobidHomeFinder findGrobidPropertiesOrFail WARNING: Grobid property file location was not explicitly set via 'org.grobid.property' system variable, defaulting to: /Users/azhar/work/grobid/grobid-home/config/grobid.properties Processing: /Users/azhar/work/grobid/testfiles/kano2012.pdf Mar 19, 2019 10:45:57 AM org.grobid.core.engines.ProcessEngine inferOutputPath INFO: No path set for the output directory. Using: /Users/azhar/work/grobid/. Mar 19, 2019 10:45:58 AM org.grobid.core.main.LibraryLoader load INFO: Loading external native sequence labelling library Mar 19, 2019 10:45:58 AM org.grobid.core.main.LibraryLoader load INFO: Loading Wapiti native library... Mar 19, 2019 10:45:58 AM org.grobid.core.main.LibraryLoader load INFO: Loading JEP native library for DeLFT... /Users/azhar/work/grobid/grobid-home/lib/mac-64 Mar 19, 2019 10:45:58 AM org.grobid.core.main.LibraryLoader load INFO: Native library for sequence labelling loaded Mar 19, 2019 10:45:58 AM org.grobid.core.lexicon.Lexicon initDictionary INFO: Initiating dictionary Mar 19, 2019 10:45:58 AM org.grobid.core.lexicon.Lexicon initDictionary INFO: End of Initialization of dictionary Mar 19, 2019 10:45:58 AM org.grobid.core.lexicon.Lexicon initNames INFO: Initiating names Mar 19, 2019 10:45:58 AM org.grobid.core.lexicon.Lexicon initNames INFO: End of initialization of names Mar 19, 2019 10:45:58 AM org.grobid.core.lexicon.Lexicon initCountryCodes INFO: Initiating country codes Mar 19, 2019 10:45:58 AM org.grobid.core.lexicon.Lexicon initCountryCodes INFO: End of initialization of country codes Mar 19, 2019 10:45:59 AM org.grobid.core.jni.WapitiModel init INFO: Loading model: /Users/azhar/work/grobid/grobid-home/models/fulltext/model.wapiti (size: 21734019) [Wapiti] Loading model: "/Users/azhar/work/grobid/grobid-home/models/fulltext/model.wapiti" Model path: /Users/azhar/work/grobid/grobid-home/models/fulltext/model.wapiti Mar 19, 2019 10:46:03 AM org.grobid.core.jni.WapitiModel init INFO: Loading model: /Users/azhar/work/grobid/grobid-home/models/segmentation/model.wapiti (size: 16788068) [Wapiti] Loading model: "/Users/azhar/work/grobid/grobid-home/models/segmentation/model.wapiti" Model path: /Users/azhar/work/grobid/grobid-home/models/segmentation/model.wapiti Mar 19, 2019 10:46:07 AM org.grobid.core.jni.DeLFTModel header INFO: Loading DeLFT model for header... running thread: 1 Mar 19, 2019 10:48:06 AM org.grobid.core.jni.DeLFTModel label SEVERE: DeLFT model header labelling failed java.util.concurrent.ExecutionException: java.lang.UnsatisfiedLinkError: jep.Jep.init(Ljava/lang/ClassLoader;ZZ)J at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.grobid.core.jni.JEPThreadPool.call(JEPThreadPool.java:114) at org.grobid.core.jni.DeLFTModel.label(DeLFTModel.java:134) at org.grobid.core.engines.tagging.DeLFTTagger.label(DeLFTTagger.java:29) at org.grobid.core.engines.AbstractParser.label(AbstractParser.java:42) at org.grobid.core.engines.HeaderParser.processingHeaderBlock(HeaderParser.java:129) at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:136) at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:109) at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:474) at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:465) at org.grobid.core.engines.ProcessEngine.processFullTextDirectory(ProcessEngine.java:183) at org.grobid.core.engines.ProcessEngine.processFullText(ProcessEngine.java:148) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:409) at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:184) Caused by: java.lang.UnsatisfiedLinkError: jep.Jep.init(Ljava/lang/ClassLoader;ZZ)J at jep.Jep.init(Native Method) at jep.Jep.(Jep.java:252) at jep.Jep.(Jep.java:228) at org.grobid.core.jni.JEPThreadPool.getJEPInstance(JEPThreadPool.java:80) at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:76) at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:64) at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) at java.util.concurrent.FutureTask.run(FutureTask.java) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Mar 19, 2019 10:48:06 AM org.grobid.core.engines.ProcessEngine processFullTextDirectory SEVERE: An error occured while processing the file /Users/azhar/work/grobid/testfiles/kano2012.pdf. Continuing the process for the other files org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid. at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:296) at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:109) at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:474) at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:465) at org.grobid.core.engines.ProcessEngine.processFullTextDirectory(ProcessEngine.java:183) at org.grobid.core.engines.ProcessEngine.processFullText(ProcessEngine.java:148) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.grobid.core.utilities.Utilities.launchMethod(Utilities.java:409) at org.grobid.core.main.batch.GrobidMain.main(GrobidMain.java:184) Caused by: java.lang.NullPointerException at org.grobid.core.engines.tagging.GenericTaggerUtils.processLabeledResult(GenericTaggerUtils.java:64) at org.grobid.core.engines.tagging.GenericTaggerUtils.getTokensWithLabelsAndFeatures(GenericTaggerUtils.java:60) at org.grobid.core.tokenization.TaggingTokenSynchronizer.(TaggingTokenSynchronizer.java:35) at org.grobid.core.tokenization.TaggingTokenSynchronizer.(TaggingTokenSynchronizer.java:30) at org.grobid.core.tokenization.TaggingTokenClusteror.(TaggingTokenClusteror.java:53) at org.grobid.core.data.BiblioItem.generalResultMapping(BiblioItem.java:4125) at org.grobid.core.engines.HeaderParser.resultExtraction(HeaderParser.java:1066) at org.grobid.core.engines.HeaderParser.processingHeaderBlock(HeaderParser.java:130) at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:136) ... 11 more

====================================================================================

Mar 19, 2019 10:48:06 AM org.grobid.core.jni.DeLFTModel close INFO: Close DeLFT model header... running thread: 1
enhancement

opened by Aazhar 24
pdfalto process error

Hello, when I process this PDF(https://arxiv.org/pdf/2204.12536.pdf), grobid throw the error org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 143. [/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /opt/grobid/grobid-home/tmp/origin6001765469370931985.pdf, /opt/grobid/grobid-home/tmp/GwEkZWZaPN.lxml] org.grobid.core.process.ProcessPdfToXml: pdfalto return message: I was running at 0.7.1
wontfix Windows-specific

opened by Yuxiang1995 22
DeLFT: packages are not available from conda channels

I have a question regarding creating a conda env in case DeLFT. The documentation provides the following command.

conda create --name grobidDelft --file requirements.conda.delft.cpu.txt So we have the following error.

It looks like those packages are not really available through any conda channels. It looks like such packages need to be installed using pip while having env activated. Am I right?

@lfoppiano I would appreciate your thought regarding the problem if possible. Thank you in advance for the info!

opened by andrei-volkau 22
I got 404 Not Found when I test the RESTFul API with curl
What should I do to fix this problem. curl -v --form input=@./TestPDF.pdf attlaspj.ddns.net:8080/processHeaderDocument

Hostname was NOT found in DNS cache

Trying 171.97.255.227...

Connected to attlaspj.ddns.net (171.97.255.227) port 8080 (#0)

POST /processHeaderDocument HTTP/1.1 User-Agent: curl/7.38.0 Host: attlaspj.ddns.net:8080 Accept: / Content-Length: 155372 Expect: 100-continue Content-Type: multipart/form-data; boundary=------------------------e8cd352b69b9f1cc

< HTTP/1.1 100 Continue < HTTP/1.1 404 Not Found < Date: Thu, 19 Jan 2017 19:54:19 GMT

Server Apache/2.4.10 (Raspbian) is not blacklisted < Server: Apache/2.4.10 (Raspbian) < Content-Length: 306 < Content-Type: text/html; charset=iso-8859-1

HTTP error before end of send, stop sending <

404 Not Found
Not Found

The requested URL /processHeaderDocument was not found on this server.

Apache/2.4.10 (Raspbian) Server at attlaspj.ddns.net Port 8080
* Closing connection 0
opened by AhaAhaz 22
Docker with GPU

This PR replace the use of the cuda-10.1 image in favour of thetensorflow/tensorflow:1.15.5-gpu image. It should solve the problem of the GPU not being really used when running it with docker

This python script can be used for testing the real use of GPU and also troubleshoot when it does not work (sometimes can happen that the GPU is not found - without any apparent reason):
enhancement

opened by lfoppiano 21
[wip] better integration with Delft via JEP
support for anaconda and virtualenv

support for mac and Linux (for Windows we need help)

automatic selection of python version and jep library (system/env installed or from grobid-home)

overall simplification of the installation process

documentation

enhancement
opened by lfoppiano 19
Error: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf

>>>>>>>> GROBID_HOME=C:\grobid-master\grobid-home [main] INFO org.grobid.core.main.LibraryLoader - Loading external native CRF library [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library... [main] INFO org.grobid.core.main.LibraryLoader - Library crfpp loaded [main] INFO org.grobid.core.jni.WapitiModel - Loading model: C:\grobid-master\grobid-home\models\header\model.wapiti (size: 36094028) org.grobid.core.exceptions.GrobidException: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf file 1.pdf at org.grobid.core.document.DocumentSource.processPdf2XmlThreadMode(DocumentSource.java:184) at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:133) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:62) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:49) at org.grobid.core.engines.HeaderParser.processing2(HeaderParser.java:84) at org.grobid.core.engines.Engine.processHeader(Engine.java:434) at org.grobid.core.engines.Engine.processHeader(Engine.java:410) at WeRe.Grobid.performFun(Grobid.java:25) at WeRe.MainClass.main(MainClass.java:12) [Wapiti] Loading model: "C:\grobid-master\grobid-home\models\header\model.wapiti" Model path: C:\grobid-master\grobid-home\models\header\model.wapiti

The above error shows up when I select a particular pdf. The same pdf gets processed for the header document over the web application. Can you inform as to what the error could be?
need help

opened by rathancage 19
Publishing grobid-core to Maven central

I am trying to integrate grobid into Apache Tika for metadata extracion. It would be nice to have grobid-core published to maven central to make adding the dependency in pom.xml easier.

opened by sujen1412 19
Grobid, get the page number for the references

Hi I am trying to get the page number for references in sections and in citations as well. I turn on the TEI coordinates in the process_fulltext_document. Iam not sure how to get the coordinates using Beautiful soup.

parsed_article = BeautifulSoup(r.content, 'lxml') if article.find('text') is not None: references = article.find('text').find('div', attrs={'type': 'references'}) references = references.find_all('biblstruct') if references is not None else [] reference_list = [] for reference in references: print(reference['coords'])

When I try to do this I get an error that attribute is not there. do you know how can I fix it ?
question

opened by rabia0001 1
Add additional pattern to match the mathematical dashes (hypens
I've been facing a small issue with the character \u2212 which is a MINUS SIGN and belongs under the mathematical symbols (https://stackoverflow.com/questions/57358321/why-unicode-character-minus-sign-u2212-is-not-in-regex-unicode-group-ppd#57363745), therefore is not replaced by the Dash regex in UnicodeUtil.normalizeText().

Issue with this character is that the FormatNumber class fails to parse a negative number with such symbol.

e.g.:

java.text.ParseException: Unparseable number: "−0.5341"

I was wondering if we could add an additional regex to replace such symbol with the classic -.
opened by lfoppiano 1
Grobid "processFulltextDocument" skipping some references

in processing references of some reports (by using processFulltextDocument) I noticed that Grobid seems to skip some pages

for example, when the following file is processed , the references extracted seems to start after the third page (it jumps to references starting with the letter B )

FAO_report_biodiversityforFoodAgriculture.pdf

see results xref_raw_test.txt

it does also not help if one adds few pages in the front :

FAO_report_biodiversityforFoodAgriculture_1.pdf

see results xref_raw_test_1.txt

any idea on how to deal with this ?

opened by almugabo 0
Support for Apple ARM M1
This PR add the following:

add the rebuilt binaries of pdfalto, wapiti and jep, for Mac ARM architecture

fix path construction to select the ARM architecture properly
opened by lfoppiano 4
about how to get the training and test datasets

Hi! I know I can get test datasets for end to end evaluation, but I am still confused how to get training and test datasets for segmenation training、header training、citation training and so on. Now I get pdfs and create training with grobid。Then I fixed wrongly labeled files of tei.xml format and put them in the training datasets.
It takes a lot of time to correct the wrongly labeled files. I wonder if it is a right way to obtain datasets. Is there a better way?

opened by majiajun0 1

Releases(0.7.2)

0.7.2(Nov 21, 2022)
Added

Explicit identification of data/code availability statements (#951) and funding statements (#959), including when they are located in the header

Link footnote and their "callout" marker in full text (#944)

Option to consolidate header only with DOI if a DOI is extracted (#742)

"Window" application of RNN model for reference-segmenter to cover long bibliographical sections

Add dynamic timeout on pdfalto_server (#926)

A modest Python script to help to find "interesting" error cases in a repo of JATS/PDF pairs, grobid-home/scripts/select_error_cases.py

Changed

Update to DeLFT version 0.3.2

Some more training data (authors in reference, segmentation, citation, reference-segmenter) (including #961, #864)

Update of some models, RNN with feature channels and CRF (segmentation, header, reference-segmenter, citation)

Review guidelines for segmentation model

Better URL matching, using in particular PDF URL annotation in account

Fixed

Fix unexpected figure and table labeling in short texts

When matching an ORCID to an author, prioritize Crossref info over extracted ORCID from the PDF (#838)

Annotation errors for acknowledgement and other minor stuff

Fix for Python library loading on Mac

Update docker file to support new CUDA key

Do not dehyphenize text in superscript or subscript

Allow absolute temporary paths

Fix redirected stderr from pdfalto not "gobbled" by the java ProcessBuilder call (#923)

Other minor fixes

Source code(tar.gz)
Source code(zip)
0.7.1(Apr 16, 2022)
Added

Web services for training models (#778)

Some additional training data for bibliographical references from arXiv

Add a web service to process a list of reference strings, see https://grobid.readthedocs.io/en/processcitationlist/Grobid-service/#apiprocesscitationlist

Extended processHeaderDocument to get result in bibTeX

Changed

Update to DeLFT version to 0.3.1 and TensorFlow 2.7, with many improvements, see https://github.com/kermitt2/delft/releases/tag/v0.3.0

Update of Deep Learning models

Update of JEP and add install script

Update to new biblio-glutton version 0.2, for improved and faster bibliographical reference matching

circleci to replace Travis

Update of processFulltextAssetDocument service to use the same parameters as processFulltextDocument

Pre-compile regex if not already done

Review features for header model

Fixed

Improved date normalization (#760)

Fix possible issue with coordinates related to reference markers (#908) and sentence (#811)

Fix path to bitmap/vector graphics (#836)

Fix possible catastrophic regex backtracking (#867)

Other minor fixes

Source code(tar.gz)
Source code(zip)
0.7.0(Jul 17, 2021)
Added

New YAML configuration: all the settings are in one single yaml file, each model can be fully configured independently

Improvement of the segmentation and header models (for header, +1 F1-score for PMC evaluation, +4 F1-score for bioRxiv), improvements for body and citations

Add figure and table pop-up visualization on PDF in the console demo

Add PDF MD5 digest in the TEI results (service only)

Language support packages and xpdfrc file for pdfalto (support of CJK and exotic fonts)

Prometheus metrics

BidLSTM-CRF-FEATURES implementation available for more models

Addition of a "How GROBID works" page in the documentation

Changed

JitPack release (RIP jcenter)

Improved DOI cleaning

Speed improvement (around +10%), by factorizing some layout token manipulation

Update CrossRef requests implementation to align to the current usage of CrossRef's X-Rate-Limit-Limit response parameter

Fixed

Fix base url in demo console

Add missing pdfalto Graphics information when -noImage is used, fix graphics data path in TEI

Fix the tendency to merge tables when they are in close proximity

Source code(tar.gz)
Source code(zip)
0.6.2(Mar 20, 2021)
Added

Docker image covering both Deep Learning and CRF models, with GPU detection and preloading of embeddings

For Deep Learning models, labeling is now done by batch: application of the citation DL model is 4 times faster for BidLSTM-CRF (with or without features) and 6 times faster for SciBERT

More tests for sentence segmentation

Add orcid of persons when available from the PDF or via consolidation (i.e. if in CrossRef metadata)

Add BidLSTM-CRF-FEATURES header model (with feature channel)

Add bioRxiv end-to-end evaluation

Bounding boxes for optional section titles coordinates

Changed

Reduce the size of docker images

Improve end-to-end evaluation: multithreaded processing of PDF, progress bar, output the evaluation report in markdown format

Update of several models covering CRF, BidLSTM-CRF and BidLSTM-CRF-FEATURES, mainly improving citation and author recognitions

OpenNLP is the default optional sentence segmenter (similar result as Pragmatic Segmenter for scholar documents after benchmarking, but 30 times faster)

Refine sentence segmentation to exploit layout information and predicted reference callouts

Update jep version to 3.9.1

Fixed

Ignore invalid utf-8 sequences

Update CrossRef multithreaded calls to avoid using the unreliable time interval returned by the CrossRef REST API service, update usage of Crossref-Plus-API-Token and update the deprecated crossref field query.title

Missing last table or figure when generating training data for the fulltext model

Fix an error related to the feature value for the reference callout for the fulltext model

Review/correct DeLFT configuration documentation, with a step-by-step configuration documentation

Other minor fixes

Source code(tar.gz)
Source code(zip)
0.6.1(Aug 12, 2020)
Added

Support of line number (typically in preprints)

End-to-end evaluation and benchmark for preprints using the bioRxiv 10k dataset

Check whether PDF annotation is orcid and add orcid to author in the TEI result

Configuration for making sequence labeling engine (CRF Wapiti or Deep Learning) specific to models

Add a developers guide and a FAQ section in the documentation

Visualization of formulas on PDF layout in the demo console

Feature for subscript/superscript style in fulltext model

Changed

New significantly improved header model: with new features, new training data (600 new annotated examples, old training data is entirely removed), new labels and updated data structures in line with the other models

Update of the segmentation models with more training data

Removal of heuristics related to the header

Update to gradle 6.5.1 to support JDK 13 and 14

TEI schemas

Windows is not supported in this release

Fixed

Preserve affiliations after consolidation of the authors

Environment variable config override for all properties

Unfrequent duplication of the abstract in the TEI result

Incorrect merging of affiliations

Noisy parentheses in the bibliographical reference markers

In the console demo, fix the output filename wrongly taken from the input form when the text form is used

Synchronisation of the language detection singleton initialisation in case of multithread environment

Other minor fixes

Source code(tar.gz)
Source code(zip)
0.6.0(Apr 24, 2020)
Added

Table content structuring (thanks to @Vitaliy-1), see PR #546

Support for application/x-bibtex at /api/processReferences and /api/processCitation (thanks to @koppor)

Optionally include raw affiliation string in the TEI result

Add dummy model for facilitating test in Grobid modules

Allow environment variables for config properties values to ease Docker config

ChangeLog

Changed

Improve CORS configuration #527 (thank you @lfoppiano)

Documentation improvements

Update of segmentation and fulltext model and training data

Better handling of affiliation block fragments

Improved DOI string recognition

More robust n-fold cross validation (case of shared grobid-home)

Source code(tar.gz)
Source code(zip)
0.5.6(Oct 16, 2019)
Better abstract structuring (with citation contexts)

n-fold cross evaluation and better evaluation report (thanks to @lfoppiano)

Improved PMC ID and PMID recognition

Improved subscript/superscript and font style recognition (via pdfalto)

Improved JEP integration (support of python virtual environment for using DeLFT Deep Learning library, thanks @de-code and @lfoppiano)

Several bug fixes (thanks @de-code, @bnewbold, @Vitaliy-1 and @lfoppiano)

Improved dehyphenization (thanks to @lfoppiano)

Source code(tar.gz)
Source code(zip)
0.5.5(May 28, 2019)
Using pdfalto instead of pdf2xml for the first PDF parsing stage, with many improvements in robustness, ICU support, unknown glyph/font normalization

Improvement and full review of the integration of consolidation services, supporting biblio-glutton (additional identifiers and Open Access links) and Crossref REST API (add specific user agent, email and token for Crossref Metadata Plus)

Fix bounding box issues for some PDF #330

Updated lexicon #396

Source code(tar.gz)
Source code(zip)
0.5.4(Feb 12, 2019)
Changes:

transparent usage of DeLFT deep learning models (usual BidLSTM-CRF) instead of Wapiti CRF models, native integration via JEP

support of biblio-glutton as DOI/metadata matching service, alternative to crossref REST API

improvement of citation context identification and matching (+9% recall with similar precision, for PMC sample 1943 articles, from 43.35 correct citation contexts per article to 49.98 correct citation contexts per article)

citation callout now in abstract, figure and table captions

structured abstract (including update of TEI schema)

bug fixes and some more parameters: by default using all available threads when training and possibility to load models at the start of the service

Source code(tar.gz)
Source code(zip)
0.5.3(Dec 10, 2018)
Changes:

Improvement of consolidation options and processing (better handling of CrossRef API, but the best is coming soon ;)

Better recall for figure and table identification (thanks to @detonator413)

Support of proxy for calling crossref with Apache HttpClient

Minor bugfixing

Source code(tar.gz)
Source code(zip)
grobid-core-0.5.3.jar(19.95 MB)
grobid-core-0.5.3-onejar.jar(46.90 MB)
grobid-trainer-0.5.3.jar(202.65 KB)
grobid-service-0.5.3.zip(78.53 MB)
0.5.2(Oct 17, 2018)
Changes:

Corrected back status codes from the REST API when no available engine (503 is back again to inform the client to wait, it was removed by error in version 0.5.0 and 0.5.1 for PDF processing services only, see documentation of the REST API)

Added metrics in the REST entrypoint (accessible via http://localhost:8071)

Added Grobid clients for Java, Python and NodeJS

Added counters for consolidation tasks and consolidation results

Add case sensitiveness option in lexicon/FastMatcher

Updated documentation

Bugfixing: #339, #322, #300, and other

Source code(tar.gz)
Source code(zip)
grobid-core-0.5.2-onejar.jar(47.30 MB)
grobid-core-0.5.2.jar(19.94 MB)
grobid-service-0.5.2.zip(78.94 MB)
grobid-trainer-0.5.2.jar(188.33 KB)
0.5.1(Jan 29, 2018)

Bug fixes
Source code(tar.gz)
Source code(zip)
0.5.0(Nov 9, 2017)
The latest stable release of GROBID is version 0.5.0. As compared to previous version 0.4.3, this version brings:

Migrate from maven to gradle for faster, more flexible and more stable build, release, etc.

Usage of Dropwizard for web services

Move the Grobid service manual to readthedocs

(thanks to @detonator413 and @lfoppiano for this release! future work in versions 0.5.* will focus again on improving PDF parsing and structuring accuracy)

Source code(tar.gz)
Source code(zip)
grobid-parent-0.4.4(Oct 13, 2017)

Fixed issue that was making the release build not working
Source code(tar.gz)
Source code(zip)
grobid-parent-0.4.3(Oct 7, 2017)
The latest stable release of GROBID is version 0.4.3. As compared to previous version 0.4.2, this version brings:

New models: f-score improvement on the PubMed Central sample, bibliographical references +2.5%, header +7%

New training data and features for bibliographical references, in particular for covering HEP domain (INSPIRE), arXiv identifier, DOI and url (thanks @iorala and @michamos !)

Support for CrossRef REST API (instead of the slow OpenURL-style API which requires a CrossRef account), in particular for multithreading usage (thanks @Vi-dot)

Improve training data generation and documentation (thanks @jfix)

Unicode normalisation and more robust body extraction (thanks @aoboturov)

fixes, tests, documentation and update of the pdf2xml fork for Windows (thanks @lfoppiano)

Source code(tar.gz)
Source code(zip)
grobid-parent-0.4.2(Aug 5, 2017)

Versions 0.4.2 of GROBID
Source code(tar.gz)
Source code(zip)
grobid-parent-0.4.1(Oct 2, 2016)

Source code(tar.gz)
Source code(zip)
grobid-grobid-parent-0.4.1.zip(167.58 MB)
grobid-parent-0.3.9(Jan 11, 2016)

Latest stable version for versions 0.3.* of GROBID
Source code(tar.gz)
Source code(zip)
grobid-grobid-parent-0.3.9.zip(414.57 MB)