https://github.com/takelab/ashnee
Automatically Scraped Hard News Event Extraction dataset.
https://github.com/takelab/ashnee
Last synced: about 2 months ago
JSON representation
Automatically Scraped Hard News Event Extraction dataset.
- Host: GitHub
- URL: https://github.com/takelab/ashnee
- Owner: TakeLab
- License: mit
- Created: 2023-06-14T08:19:47.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-06-17T14:17:03.000Z (almost 3 years ago)
- Last Synced: 2025-05-31T16:42:44.323Z (10 months ago)
- Size: 2.74 MB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ashnee
**WIP. Additional information and code will be added soon.**
**A**utomatically **S**craped **H**ard **N**ews **E**vent **E**xtraction dataset.
## Statistics
The dataset contains $2279$ articles in total, spread across $26$ hard-news
event types and an additional class *Other*. The table below shows the number of
documents for each event type.
| **Event Type** | **#Documents** | **Event Type** | **#Documents** |
| :-------------------------------- | :------------: | :-------------------------- | :------------: |
| Air crash | 55 | Mass Poisoning | 7 |
| Armed Conflict | 76 | Military Exercise | 70 |
| Bank Robbery | 7 | Mine Collapses | 4 |
| Disease Outbreaks | 59 | Mudslides | 21 |
| Droughts | 18 | Other | 1229 |
| Earthquakes | 56 | Protest_Online Condemnation | 68 |
| Environment Pollution | 39 | Regime Change | 2 |
| Famine | 12 | Riot | 16 |
| Financial Crisis | 27 | Road Crash | 86 |
| Fire | 77 | Shipwreck | 37 |
| Floods | 84 | Strike | 65 |
| Gas explosion | 23 | Train collisions | 6 |
| Hurricanes_Tornado_Storm_Blizzard | 98 | Tsunamis | 0 |
| Insect Disaster | 24 | Volcano Eruption | 13 |
## Data sources
For majority of articles you can find the url in the `ashnee_url.csv` file.
Articles were mainly scraped from the following portals/domains: *dailymail.co.uk*,
*thewest.com.au*, *bbc.com*, **allafrica.com*, *thetimes.co.uk*, *nzherald.co.nz*,
*indiatimes.com*, *sputniknews.com*, *indepedent.co.uk*, *9news.com.au*,
*inquirer.net*, *theguardian.com*, *mb.com.ph*, *punchng.com*, *thestar.com.my*,
*sott.net*, and *news.com.au*.
Most articles were published between 2019. and 2022.
## Models
List of models we fine-tuned for event detection: [roberta-base](https://huggingface.co/roberta-base), [roberta-large](https://huggingface.co/roberta-large), [deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base), [deberta-large](https://huggingface.co/microsoft/deberta-v3-large), [distilroberta-base](https://huggingface.co/distilroberta-base), and [albert-base-v2](https://huggingface.co/albert-base-v2).
List of models we fine-tuned for argument extraction: [roberta-base](https://huggingface.co/deepset/roberta-base-squad2), [roberta-large](https://huggingface.co/deepset/roberta-large-squad2), [deberta-v3-base](https://huggingface.co/deepset/deberta-v3-base-squad2), [deberta-v3-large](https://huggingface.co/deepset/deberta-v3-large-squad2), [distilroberta-base](https://huggingface.co/squirro/distilroberta-base-squad_v2), and [albert-base-v2](https://huggingface.co/squirro/albert-base-v2-squad_v2).