DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles
DirectQuote is a corpus containing 19,760 paragraphs and 10,353 direct quotations manually annotated from online news media.
A quotation is a general notion that covers different kinds of speech, thought, and writing in text (Semino and Short,2004). It is a prominent linguistic device for expressing opinions, statements, and assessments attributed to the speaker (Cappelen and Lepore, 2012). Among all kinds of quotations, the entire content of the direct quotation (O’Keefe et al.,2013) is in quotation marks, which means that what the speaker said is transcribed verbatim.
Task Definition
Quotation extractionis defined as extracting reported speech from a third party in the text, also known as reportedspeech extraction. Quotation attribution refers to determining the speaker of the quotation. When annotating speakers, we ensure that valid speakers should be able to belinked to a person entity in a named entity library. Among them, simple patterns are removed to increase the diversity of the corpus.
Data
Region | Name | Numbers |
U.S. | Associated Press | 438 |
Cable News Network | 627 | |
American Broadcasting Company | 240 | |
New York Times | 5,642 | |
CBS Broadcasting | 4,890 | |
UK | British Broadcasting Corporation | 926 |
Reuters | 5,836 | |
The Guardian | 4,302 | |
Canada | The Globe and Mail | 1,955 |
The Star | 13,769 | |
New Zealand | NZ Herald | 115 |
Australia | Australian Broadcasting Corporation | 312 |
Sydney Morning Herald | 93 |
We select representative and multiple news sources across the political spectrum, including 13 well-known online news media from five major English-speaking countries. The corpus adopts the format consistent with CoNLL 2003. We use IOB1 format in the corpus. Raw texts are tokenized by whitespace tokenizer. Every word is classified into the following lables:
LeftSpeaker
Quotation, the corresponding speaker is in the preceding textRightSpeaker
Quotation, the corresponding speaker is in the following textUnknown
Quotation, no corresponding speakerSpeaker
SpeakerOut
Neither
Statistics
Numbers | |
---|---|
News Article | 39,153 |
Paragraph | 19,760 |
Quotation | 10,353 |
Time | 2020.09-2021.03 |
Reference
DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles, Yuanchi Zhang, Yang Liu