A Dataset for Direct Quotation Extraction and Attribution in News Articles.

THUNLP-MT

Last update: Sep 23, 2022

Related tags

Deep Learning DirectQuote

Overview

DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles

DirectQuote is a corpus containing 19,760 paragraphs and 10,353 direct quotations manually annotated from online news media.

A quotation is a general notion that covers different kinds of speech, thought, and writing in text (Semino and Short,2004). It is a prominent linguistic device for expressing opinions, statements, and assessments attributed to the speaker (Cappelen and Lepore, 2012). Among all kinds of quotations, the entire content of the direct quotation (O’Keefe et al.,2013) is in quotation marks, which means that what the speaker said is transcribed verbatim.

Task Definition

Quotation extractionis defined as extracting reported speech from a third party in the text, also known as reportedspeech extraction. Quotation attribution refers to determining the speaker of the quotation. When annotating speakers, we ensure that valid speakers should be able to belinked to a person entity in a named entity library. Among them, simple patterns are removed to increase the diversity of the corpus.

Data

Region	Name	Numbers
U.S.	Associated Press	438
	Cable News Network	627
	American Broadcasting Company	240
	New York Times	5,642
	CBS Broadcasting	4,890
UK	British Broadcasting Corporation	926
	Reuters	5,836
	The Guardian	4,302
Canada	The Globe and Mail	1,955
Canada	The Star	13,769
New Zealand	NZ Herald	115
Australia	Australian Broadcasting Corporation	312
Australia	Sydney Morning Herald	93

We select representative and multiple news sources across the political spectrum, including 13 well-known online news media from five major English-speaking countries. The corpus adopts the format consistent with CoNLL 2003. We use IOB1 format in the corpus. Raw texts are tokenized by whitespace tokenizer. Every word is classified into the following lables:

LeftSpeaker Quotation, the corresponding speaker is in the preceding text
RightSpeaker Quotation, the corresponding speaker is in the following text
Unknown Quotation, no corresponding speaker
Speaker Speaker
Out Neither

Statistics

	Numbers
News Article	39,153
Paragraph	19,760
Quotation	10,353
Time	2020.09-2021.03

Reference

DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles, Yuanchi Zhang, Yang Liu

Comments

Suggestion of including source of each paragraph

Hello Sir / Ma'am,

I am highly interested in using this dataset for research purposes. I understand that the data provided has already been preprocessed; may I ask if it would be possible to include the news source (e.g. The New York Times, The Guardian) of each paragraph? This is because I am looking for articles from The New York Times specifically. Thank you so much!

Best regards

opened by leongjwm 0

This repository contains codes of ICCV2021 paper: SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation

SO-Pose This repository contains codes of ICCV2021 paper: SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation This paper is basically an

52 Nov 25, 2022

This respository includes implementations on Manifoldron: Direct Space Partition via Manifold Discovery

Manifoldron: Direct Space Partition via Manifold Discovery This respository includes implementations on Manifoldron: Direct Space Partition via Manifo

4 Apr 28, 2022

Implement of "Training deep neural networks via direct loss minimization" in PyTorch for 0-1 loss

This is the implementation of "Training deep neural networks via direct loss minimization" published at ICML 2016 in PyTorch. The implementation targe

1 Jan 18, 2022

This is the official repo for TransFill: Reference-guided Image Inpainting by Merging Multiple Color and Spatial Transformations at CVPR'21. According to some product reasons, we are not planning to release the training/testing codes and models. However, we will release the dataset and the scripts to prepare the dataset.

TransFill-Reference-Inpainting This is the official repo for TransFill: Reference-guided Image Inpainting by Merging Multiple Color and Spatial Transf

80 Dec 8, 2022

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

184 Dec 11, 2022

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Larger Google Sat2Map dataset This dataset extends the aerial ⟷ Maps dataset used in pix2pix (Isola et al., CVPR17). The provide script download_sat2m

34 Dec 28, 2022

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

174 Dec 22, 2022

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

66 Dec 26, 2022

A Dataset for Direct Quotation Extraction and Attribution in News Articles.

Related tags

Overview

DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles

Task Definition

Data

Statistics

Reference

You might also like...

This repository contains codes of ICCV2021 paper: SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation

This respository includes implementations on Manifoldron: Direct Space Partition via Manifold Discovery

Implement of "Training deep neural networks via direct loss minimization" in PyTorch for 0-1 loss

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

This is the dataset and code release of the OpenRooms Dataset.

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Comments

Suggestion of including source of each paragraph

Owner

THUNLP-MT

This is the pytorch implementation of the paper - Axiomatic Attribution for Deep Networks.

PyTorch implementation of VAGAN: Visual Feature Attribution Using Wasserstein GANs

This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

Text mining project; Using distilBERT to predict authors in the classification task authorship attribution.

Using machine learning to predict and analyze high and low reader engagement for New York Times articles posted to Facebook.

A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Listing arxiv - Personalized list of today's articles from ArXiv

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

DLL: Direct Lidar Localization

GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. (CVPR 2021)