This is Assignment1 code for the Web Data Processing System.

Last update: Dec 4, 2022

Related tags

Text Data & NLP wdps2126

Overview

First Assignment - Entity Linking

Web Data Processing System Assignment 1 - 2021 - Group 26

Zhining Bai
Bowen Lyu
Tianshi Chen
Yiming Xu

Description

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below:

Read WARC

Use pyspark to read large-scale warc files, so the program supports parallel computing.
Extract text information from HTML files by using beautifulsoup.

Named entity recognition

Extract entities by using recognize_entities_bert model from sparknlp.

Disambiguation and NIL

We considered the popularity of the candidate page as well as the semantic similarity between the sentence where the entity is located and the candidate description to achieve Disambiguation.

Popularity: Calculate popularity rankings using the Elasticsearch scoring algorithm and the number of properties of the mention from the knowledge graph.
Sentence similarity: Measure the difference between text and description using the Levenshtein distance.

NIL: Retain results with distances < 40.

Prerequisites

Codes are run on the DAS cluster at /var/scratch/wdps2106/wdps_2126, result1 is a conda virtual environment that has been created. Below are the packages installed to run the assignment.

# if you want to use pip(pip for python3) to install the packages, use the following command(python version 3.8)
pip install pyspark==3.1.2
pip install spark-nlp==3.3.3
pip install beautifulsoup4
pip install python-Levenshtein
pip install elasticsearch

# if you want to use conda to install the packages, use the following command(recommended)
conda create -n 
   
     python=3.8
conda install pyspark
conda install bs4
conda install elasticsearch
pip install python-Levenshtein
pip install sparknlp

Run

To run the program, you can simply use the command below. The parameter Keyname is the name of page ID in WARC files such as WARC_TREC_ID. You need to declare the name of the page ID using this parameter. Be aware that the result file will be renamed as result.tsv.

sh run.sh /path/to/warc/file.warc.gz /path/to/result/ Keyname

If you use DAS cluster, you also need to add this command before running:

export OPENBLAS_NUM_THREADS=10

To check the score of the result file, use the command below.

python3 score.py /sample/annotation/file/sample.tsv /generated/result/file/result.tsv

Result

We tested our entity linking code using sample.warc.gz. Since sample_annotations.tsv only contains the entities that page_id is less than 92, our test results only output entity links with page_id <= 92. The f1 score of the sample data is 0.1122.

Metric	Value
Gold	500
Predicted	480
Correct	55
Precision	0.1145
Recall	0.11
F1 Score	0.1122

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

21 Aug 12, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 1, 2023

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

1.2k Dec 21, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

19.5k Feb 13, 2021

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

77.3k Jan 3, 2023

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

3k Jan 5, 2023

Releases(wdps)

wdps(Jun 1, 2022)

This is a releas test.
Source code(tar.gz)
Source code(zip)

This is Assignment1 code for the Web Data Processing System.

Related tags

Overview

First Assignment - Entity Linking

Description

Read WARC

Named entity recognition

Disambiguation and NIL

Prerequisites

Run

Result

You might also like...

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Basic Utilities for PyTorch Natural Language Processing (NLP)

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

💫 Industrial-strength Natural Language Processing (NLP) in Python

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

State of the Art Natural Language Processing

Releases(wdps)

wdps(Jun 1, 2022)

Owner

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Tools, wrappers, etc... for data science with a concentration on text processing

Data manipulation and transformation for audio signal processing, powered by PyTorch

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Python library for processing Chinese text

💫 Industrial-strength Natural Language Processing (NLP) in Python

Multilingual text (NLP) processing toolkit