Journalism AI – Quotes extraction for modular journalism
This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.
Further reading can be found in our blog post.
The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.
The contribution consists of several self-contained pieces of work, namely:
- a regular expression pipeline attempting to extract quotes by matching patterns
- a rule set to define different types of quotes and guide the quote annotation
- custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
- a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
- example data and data schema for displaying the extracted quote information in a search tool
Repo structure
Each folder in this repo reflects one of the pieces of work mentioned above.
regex_pipeline/
– code to run the regular expression-based quote extractionannotation_rules/
– document with rules and definitions to guide the quote annotation stepannotation_scripts/
– custom annotation scripts for Prodigycoreference/
– proof of concept for rules-based coreferencing toolschema/
– data output schema and example data
Each folder contains a separate README
file with instructions to set up and run each piece of work.