LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Last update: Aug 6, 2022

Related tags

Machine Learning LILLIE

Overview

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Based on the work by Smith et al. (2021)

Querying both structured and unstructured data via a single common query interface such as SQL or natural language has been a long standing research goal. Moreover, as methods for extracting information from unstructured data become ever more powerful, the desire to integrate the output of such extraction processes with "clean", structured data grows. We are convinced that for successful integration into databases, such extracted information in the form of "triples" needs to be both 1) of high quality and 2) have the necessary generality to link up with varying forms of structured data. It is the combination of both these aspects, which heretofore have been usually treated in isolation, where our approach breaks new ground.

The cornerstone of our work is a novel, generic method for extracting open information triples from unstructured text, using a combination of linguistics and learning-based extraction methods, thus uniquely balancing both precision and recall. Our system called LILLIE (LInked Linguistics and Learning-Based Information Extractor) uses dependency tree modification rules to refine triples from a high-recall learning-based engine, and combines them with syntactic triples from a high-precision engine to increase effectiveness. In addition, our system features several augmentations, which modify the generality and the degree of granularity of the output triples. Even though our focus is on addressing both quality and generality simultaneously, our new method substantially outperforms current state-of-the-art systems on the two widely-used CaRB and Re-OIE16 benchmark sets for information extraction.

Installation

Requires Python 3.6.9.

pip install -r requirements.txt
python3 -m spacy download en_core_web_md
Clone ClausIE to ./learning_based/pyclausie (https://github.com/AnthonyMRios/pyclausie)
Install with: cd ./learning_based/pyclausie python3 setup.py install
Clone OpenIE5 to ./learning_based/OpenIE-Standalone (https://github.com/dair-iitd/OpenIE-standalone)
Run OIE5 with: cd ./learning_based/OpenIE-standalone java -Xmx16g -jar openie-assembly-5.0-SNAPSHOT.jar --httpPort 9000
Download Stanford CoreNLP Server 3.9.2 to ./rule_based/parser (https://stanfordnlp.github.io/CoreNLP/history.html)
Run the parser: java -mx6g -cp "./rule_based/parser/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 10000 -timeout 30000
Run the learning-based extractor: python3 ./learning_based/paralleloie.py -i data/pubmedabstracts.json
Run the rule-based extractor-refiner: python3 ./rule_based/extract_refine.py -i extracted_triples_learning.csv

You might also like...

Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

17 Nov 20, 2022

Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

3.7k Jan 7, 2023

Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

1.3k Dec 28, 2022

Machine Learning Algorithms

Machine-Learning-Algorithms In this project, the dataset was created through a survey opened on Google forms. The purpose of the form is to find the p

3 Aug 10, 2022

Machine learning algorithms implementation

Machine learning algorithms implementation This repository consisits of implementation of various machine learning algorithms. The algorithms implemen

1 Jan 3, 2022

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map, play blackjack game and robot in grid world and evaluate reward for it

1 Jan 19, 2022

Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

END TO END MACHINE LEARNING PROJECT ON HITTERS DATASET Can a machine learning project be implemented to estimate the salaries of baseball players whos

7 Dec 18, 2021

ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

77 Jan 6, 2023

Combines MLflow with a database (PostgreSQL) and a reverse proxy (NGINX) into a multi-container Docker application

Combines MLflow with a database (PostgreSQL) and a reverse proxy (NGINX) into a multi-container Docker application (with docker-compose).

2 Dec 3, 2021

Comments

Error in Importing AllenNLP's Predictor Class

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-8-0c2b06a8c759>](https://localhost:8080/#) in <module>()
----> 1 from allennlp.predictors.predictor import Predictor

14 frames
[/usr/local/lib/python3.7/dist-packages/overrides/signature.py](https://localhost:8080/#) in ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
    286     if not _issubtype(sub_return, super_return) and super_return is not None:
    287         raise TypeError(
--> 288             f"{method_name}: return type `{sub_return}` is not a `{super_return}`."
    289         )

TypeError: ArrayField.empty_field: return type `None` is not a `<class 'allennlp.data.fields.field.Field'>`.

I used the requirements.txt file given in the repository.

opened by abheesht17 2

java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLPServer Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLPServer . . . when trying to run command: java -mx6g -cp "./rule_based/parser/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 10000 -timeout 30000

opened by nondefo 9

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Related tags

Overview

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Installation

You might also like...

Tools for Optuna, MLflow and the integration of both.

Uplift modeling and causal inference with machine learning algorithms

Metric learning algorithms in Python

Machine Learning Algorithms

Machine learning algorithms implementation

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

ml4ir: Machine Learning for Information Retrieval

Combines MLflow with a database (PostgreSQL) and a reverse proxy (NGINX) into a multi-container Docker application

Comments

Error in Importing AllenNLP's Predictor Class

java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Owner

CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Python-based implementations of algorithms for learning on imbalanced data.

Implemented four supervised learning Machine Learning algorithms

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

database for artificial intelligence/machine learning data

Automatic extraction of relevant features from time series:

flexible time-series processing & feature extraction

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms