LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Overview

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Based on the work by Smith et al. (2021)

Querying both structured and unstructured data via a single common query interface such as SQL or natural language has been a long standing research goal. Moreover, as methods for extracting information from unstructured data become ever more powerful, the desire to integrate the output of such extraction processes with "clean", structured data grows. We are convinced that for successful integration into databases, such extracted information in the form of "triples" needs to be both 1) of high quality and 2) have the necessary generality to link up with varying forms of structured data. It is the combination of both these aspects, which heretofore have been usually treated in isolation, where our approach breaks new ground.

The cornerstone of our work is a novel, generic method for extracting open information triples from unstructured text, using a combination of linguistics and learning-based extraction methods, thus uniquely balancing both precision and recall. Our system called LILLIE (LInked Linguistics and Learning-Based Information Extractor) uses dependency tree modification rules to refine triples from a high-recall learning-based engine, and combines them with syntactic triples from a high-precision engine to increase effectiveness. In addition, our system features several augmentations, which modify the generality and the degree of granularity of the output triples. Even though our focus is on addressing both quality and generality simultaneously, our new method substantially outperforms current state-of-the-art systems on the two widely-used CaRB and Re-OIE16 benchmark sets for information extraction.

Installation

Requires Python 3.6.9.

  1. pip install -r requirements.txt
  2. python3 -m spacy download en_core_web_md
  3. Clone ClausIE to ./learning_based/pyclausie (https://github.com/AnthonyMRios/pyclausie)
  4. Install with: cd ./learning_based/pyclausie python3 setup.py install
  5. Clone OpenIE5 to ./learning_based/OpenIE-Standalone (https://github.com/dair-iitd/OpenIE-standalone)
  6. Run OIE5 with: cd ./learning_based/OpenIE-standalone java -Xmx16g -jar openie-assembly-5.0-SNAPSHOT.jar --httpPort 9000
  7. Download Stanford CoreNLP Server 3.9.2 to ./rule_based/parser (https://stanfordnlp.github.io/CoreNLP/history.html)
  8. Run the parser: java -mx6g -cp "./rule_based/parser/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 10000 -timeout 30000
  9. Run the learning-based extractor: python3 ./learning_based/paralleloie.py -i data/pubmedabstracts.json
  10. Run the rule-based extractor-refiner: python3 ./rule_based/extract_refine.py -i extracted_triples_learning.csv
You might also like...
Tools for Optuna, MLflow and the integration of both.
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Uplift modeling and causal inference with machine learning algorithms
Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

Machine Learning Algorithms

Machine-Learning-Algorithms In this project, the dataset was created through a survey opened on Google forms. The purpose of the form is to find the p

Machine learning algorithms implementation

Machine learning algorithms implementation This repository consisits of implementation of various machine learning algorithms. The algorithms implemen

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )
Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map, play blackjack game and robot in grid world and evaluate reward for it

Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

END TO END MACHINE LEARNING PROJECT ON HITTERS DATASET Can a machine learning project be implemented to estimate the salaries of baseball players whos

 ml4ir: Machine Learning for Information Retrieval
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Combines MLflow with a database (PostgreSQL) and a reverse proxy (NGINX) into a multi-container Docker application

Combines MLflow with a database (PostgreSQL) and a reverse proxy (NGINX) into a multi-container Docker application (with docker-compose).

Comments
  • Error in Importing AllenNLP's Predictor Class

    Error in Importing AllenNLP's Predictor Class

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-8-0c2b06a8c759>](https://localhost:8080/#) in <module>()
    ----> 1 from allennlp.predictors.predictor import Predictor
    
    14 frames
    [/usr/local/lib/python3.7/dist-packages/overrides/signature.py](https://localhost:8080/#) in ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
        286     if not _issubtype(sub_return, super_return) and super_return is not None:
        287         raise TypeError(
    --> 288             f"{method_name}: return type `{sub_return}` is not a `{super_return}`."
        289         )
    
    TypeError: ArrayField.empty_field: return type `None` is not a `<class 'allennlp.data.fields.field.Field'>`.
    

    I used the requirements.txt file given in the repository.

    opened by abheesht17 2
  • java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLPServer

    java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLPServer

    Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLPServer Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLPServer . . . when trying to run command: java -mx6g -cp "./rule_based/parser/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 10000 -timeout 30000

    opened by nondefo 9
Owner
null
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

DIAL | Notre Dame 220 Dec 13, 2022
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah)  Xue 0 Jan 31, 2022
A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

null 17 Aug 14, 2022
database for artificial intelligence/machine learning data

AIDB v0.0.1 database for artificial intelligence/machine learning data Overview aidb is a database designed for large dataset for machine learning pro

Aarush Gupta 1 Oct 24, 2021
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 6, 2023
flexible time-series processing & feature extraction

tsflex is a toolkit for flexible time-series processing & feature extraction, making few assumptions about input data. Useful links Documentation Exam

PreDiCT.IDLab 206 Dec 28, 2022
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Mert Sezer Ardal 1 Jan 31, 2022