Journalism AI – Quotes extraction for modular journalism

Related tags

Text Data & NLP quote-extraction

Overview

Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

3k Feb 17, 2021

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

384 Dec 12, 2022

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

The PyTorch-Kaldi Speech Recognition Toolkit PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition sys

2.3k Dec 27, 2022

Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If you know more datasets, and want to contribute, please, notify me.

LIAAD - Laboratory of Artificial Intelligence and Decision Support

163 Dec 23, 2022

Comments

Data and models available?

In the spirit of open source, are the annotations you collected with prodigy or your trained spacy model available? Would be amazing to be able to help work on deep learning solutions in this space.

opened by ghomasHudson 0

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

You might also like...

An Open-Source Package for Neural Relation Extraction (NRE)

SpikeX - SpaCy Pipes for Knowledge Extraction

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Datasets of Automatic Keyphrase Extraction

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

Blue Brain text mining toolbox for semantic search and structured information extraction

Utilizing RBERT model for KLUE Relation Extraction task

Spert NLP Relation Extraction API deployed with torchserve for inference

Comments

Data and models available?

Owner

Journalism AI collab 2021

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

code for modular summarization work published in ACL2021 by Krishna et al

code for modular summarization work published in ACL2021 by Krishna et al

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Python implementation of TextRank for phrase extraction and summarization of text documents

An Open-Source Package for Neural Relation Extraction (NRE)

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Python implementation of TextRank for phrase extraction and summarization of text documents