Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zhijing-jin/webnlg_reader
This is a preprocessor/data-cleaner for the WebNLG dataset.
https://github.com/zhijing-jin/webnlg_reader
Last synced: 2 months ago
JSON representation
This is a preprocessor/data-cleaner for the WebNLG dataset.
- Host: GitHub
- URL: https://github.com/zhijing-jin/webnlg_reader
- Owner: zhijing-jin
- Created: 2019-09-02T13:07:05.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-10-17T09:04:45.000Z (over 1 year ago)
- Last Synced: 2023-10-17T13:10:47.593Z (over 1 year ago)
- Language: Python
- Size: 56.6 KB
- Stars: 8
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This is an easy-to-use Python reader for the [enriched WebNLG](https://github.com/ThiagoCF05/webnlg) data.
### How to run
```bash
python data/webnlg/reader.py [--version x.x]
```
`--version` choices: `1.4` | `1.5` (default)The resulted file structure is like this:
```bash
.
├── data
│ └── webnlg
│ ├── reader.py
│ ├── utils.py
│ ├── raw/
│ ├── test.json
│ ├── train.json
│ └── valid.json
└── README.md
```### Contributions
1. Decomposed the WebNLG dataset from document-level into sentence-level
1. Created an Easy-to-use Python reader for WebNLG dataset v1.5, runnable by 2019-SEP-20. (Debugged and adapted from the reader in [chimera](https://github.com/AmitMY/chimera)'s repo.)
1. Manually fixed [spaCy](https://spacy.io/)'s sentence tokenization
1. Deleted parts of sentences where no corresponding triple exists.
1. Deleted irrelevant triples manually
1. Manually fixed all wrong templates (e.g. `template.replace('AEGNT-1', 'AGENT-1')`), made it convenient for template-based models.
1. Carefully replaces `-` with `_` in template names, such as `AGENT-1` to `AGENT_1`. This provides convenience for tokenization.### Overview of dataset
- Dataset sizes: train 24526, valid 3019, test 6622
- Vocab of entities: 3227
- Vocab of ner: 12 (`['agent_1', 'bridge_1', 'bridge_2', 'bridge_3', 'bridge_4', 'patient_1', 'patient_2', 'patient_3', 'patient_4', 'patient_5', 'patient_6', 'patient_7']`)
- Vocab of relations: 726
- Vocab of txt: 6671
- Vocab of tgt: 1897
- Len(tgt): avg 11.5, max 42### Todo
- "was selected by NASA" is a relationship which spans several words, -- it should be made as one word in the triple.
- "(workedAt," is a relationship which has punctuations, -- it should be clean.
- There are still several hundred dirty, unaligned (stripleset, template) pairs. Align them by tracking the `self.cnt_dirty_data` variable when running `reader.py`.
- 'discrimina-tive training' spelling errors
- fix unalignment errors by `grep -nriF '' ''`