An open API service indexing awesome lists of open source software.

https://github.com/thunlp-mt/directquote

A Dataset for Direct Quotation Extraction and Attribution in News Articles.
https://github.com/thunlp-mt/directquote

Last synced: 8 months ago
JSON representation

A Dataset for Direct Quotation Extraction and Attribution in News Articles.

Awesome Lists containing this project

README

          

# DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles

DirectQuote is a corpus containing 19,760 paragraphs and 10,353 direct quotations manually annotated from online news media.

A _quotation_ is a general notion that covers different kinds of speech, thought, and writing in text (Semino and Short,2004). It is a prominent linguistic device for expressing opinions, statements, and assessments attributed to the speaker (Cappelen and Lepore, 2012). Among all kinds of quotations, the entire content of the _direct quotation_ (O’Keefe et al.,2013) is in quotation marks, which means that what the speaker said is transcribed verbatim.

## Task Definition
Quotation extractionis defined as extracting reported speech from a third party in the text, also known as reportedspeech extraction. Quotation attribution refers to determining the speaker of the quotation. When annotating speakers, we ensure that valid speakers should be able to belinked to a person entity in a named entity library. Among them, simple patterns are removed to increase the diversity of the corpus.

## Data


Region
Name
Numbers


U.S.
Associated Press
438


Cable News Network
627


American Broadcasting Company
240


New York Times
5,642


CBS Broadcasting
4,890


UK
British Broadcasting Corporation
926


Reuters
5,836


The Guardian
4,302


Canada
The Globe and Mail
1,955


The Star
13,769


New Zealand
NZ Herald
115


Australia
Australian Broadcasting Corporation
312


Sydney Morning Herald
93

We select representative and multiple news sources across the political spectrum, including 13 well-known online news media from five major English-speaking countries. The corpus adopts the format consistent with CoNLL 2003. We use IOB1 format in the corpus. Raw texts are tokenized by whitespace tokenizer. Every word is classified into the following lables:

* `LeftSpeaker` Quotation, the corresponding speaker is in the preceding text
* `RightSpeaker` Quotation, the corresponding speaker is in the following text
* `Unknown` Quotation, no corresponding speaker
* `Speaker` Speaker
* `Out` Neither

## Statistics
| | Numbers |
| ------------ | --------------- |
| News Article | 39,153 |
| Paragraph | 19,760 |
| Quotation | 10,353 |
| Time | 2020.09-2021.03 |

## Reference
_DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles_, Yuanchi Zhang, Yang Liu