https://github.com/thunlp-mt/directquote
A Dataset for Direct Quotation Extraction and Attribution in News Articles.
https://github.com/thunlp-mt/directquote
Last synced: 8 months ago
JSON representation
A Dataset for Direct Quotation Extraction and Attribution in News Articles.
- Host: GitHub
- URL: https://github.com/thunlp-mt/directquote
- Owner: THUNLP-MT
- Created: 2021-09-28T03:37:53.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-09-28T03:40:27.000Z (over 4 years ago)
- Last Synced: 2025-07-18T06:56:02.155Z (11 months ago)
- Size: 2.49 MB
- Stars: 14
- Watchers: 3
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles
DirectQuote is a corpus containing 19,760 paragraphs and 10,353 direct quotations manually annotated from online news media.
A _quotation_ is a general notion that covers different kinds of speech, thought, and writing in text (Semino and Short,2004). It is a prominent linguistic device for expressing opinions, statements, and assessments attributed to the speaker (Cappelen and Lepore, 2012). Among all kinds of quotations, the entire content of the _direct quotation_ (O’Keefe et al.,2013) is in quotation marks, which means that what the speaker said is transcribed verbatim.
## Task Definition
Quotation extractionis defined as extracting reported speech from a third party in the text, also known as reportedspeech extraction. Quotation attribution refers to determining the speaker of the quotation. When annotating speakers, we ensure that valid speakers should be able to belinked to a person entity in a named entity library. Among them, simple patterns are removed to increase the diversity of the corpus.
## Data
Region
Name
Numbers
U.S.
Associated Press
438
Cable News Network
627
American Broadcasting Company
240
New York Times
5,642
CBS Broadcasting
4,890
UK
British Broadcasting Corporation
926
Reuters
5,836
The Guardian
4,302
Canada
The Globe and Mail
1,955
The Star
13,769
New Zealand
NZ Herald
115
Australia
Australian Broadcasting Corporation
312
Sydney Morning Herald
93
We select representative and multiple news sources across the political spectrum, including 13 well-known online news media from five major English-speaking countries. The corpus adopts the format consistent with CoNLL 2003. We use IOB1 format in the corpus. Raw texts are tokenized by whitespace tokenizer. Every word is classified into the following lables:
* `LeftSpeaker` Quotation, the corresponding speaker is in the preceding text
* `RightSpeaker` Quotation, the corresponding speaker is in the following text
* `Unknown` Quotation, no corresponding speaker
* `Speaker` Speaker
* `Out` Neither
## Statistics
| | Numbers |
| ------------ | --------------- |
| News Article | 39,153 |
| Paragraph | 19,760 |
| Quotation | 10,353 |
| Time | 2020.09-2021.03 |
## Reference
_DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles_, Yuanchi Zhang, Yang Liu