https://github.com/thunlp-mt/directquote

A Dataset for Direct Quotation Extraction and Attribution in News Articles.
https://github.com/thunlp-mt/directquote

Last synced: 10 months ago
JSON representation

A Dataset for Direct Quotation Extraction and Attribution in News Articles.

Host: GitHub
URL: https://github.com/thunlp-mt/directquote
Owner: THUNLP-MT
Created: 2021-09-28T03:37:53.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2021-09-28T03:40:27.000Z (almost 5 years ago)
Last Synced: 2025-07-18T06:56:02.155Z (about 1 year ago)
Size: 2.49 MB
Stars: 14
Watchers: 3
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles

DirectQuote is a corpus containing 19,760 paragraphs and 10,353 direct quotations manually annotated from online news media.

A _quotation_ is a general notion that covers different kinds of speech, thought, and writing in text (Semino and Short,2004). It is a prominent linguistic device for expressing opinions, statements, and assessments attributed to the speaker (Cappelen and Lepore, 2012). Among all kinds of quotations, the entire content of the _direct quotation_ (O’Keefe et al.,2013) is in quotation marks, which means that what the speaker said is transcribed verbatim.

## Task Definition
Quotation extractionis defined as extracting reported speech from a third party in the text, also known as reportedspeech extraction. Quotation attribution refers to determining the speaker of the quotation. When annotating speakers, we ensure that valid speakers should be able to belinked to a person entity in a named entity library. Among them, simple patterns are removed to increase the diversity of the corpus.

## Data

Region
Name
Numbers

U.S.
Associated Press
438

Cable News Network
627

American Broadcasting Company
240

New York Times
5,642

CBS Broadcasting
4,890

UK
British Broadcasting Corporation
926

Reuters
5,836

The Guardian
4,302

Canada
The Globe and Mail
1,955

The Star
13,769

New Zealand
NZ Herald
115

Australia
Australian Broadcasting Corporation
312

Sydney Morning Herald
93

We select representative and multiple news sources across the political spectrum, including 13 well-known online news media from five major English-speaking countries. The corpus adopts the format consistent with CoNLL 2003. We use IOB1 format in the corpus. Raw texts are tokenized by whitespace tokenizer. Every word is classified into the following lables:

* `LeftSpeaker` Quotation, the corresponding speaker is in the preceding text
* `RightSpeaker` Quotation, the corresponding speaker is in the following text
* `Unknown` Quotation, no corresponding speaker
* `Speaker` Speaker
* `Out` Neither

## Statistics
| | Numbers |
| ------------ | --------------- |
| News Article | 39,153 |
| Paragraph | 19,760 |
| Quotation | 10,353 |
| Time | 2020.09-2021.03 |

## Reference
_DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles_, Yuanchi Zhang, Yang Liu

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thunlp-mt/directquote

Awesome Lists containing this project

README