Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rikeda71/kwdlc2nerdataset
making Japanese NER Dataset
https://github.com/rikeda71/kwdlc2nerdataset
dataset-generation japanese named-entity-recognition
Last synced: about 1 month ago
JSON representation
making Japanese NER Dataset
- Host: GitHub
- URL: https://github.com/rikeda71/kwdlc2nerdataset
- Owner: rikeda71
- Created: 2019-04-29T07:33:11.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-07-09T06:51:24.000Z (over 5 years ago)
- Last Synced: 2024-12-02T19:00:06.226Z (about 2 months ago)
- Topics: dataset-generation, japanese, named-entity-recognition
- Language: Python
- Homepage:
- Size: 5.86 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.en.md
Awesome Lists containing this project
README
# KWDLC2NERDataset
[KWDLC](http://nlp.ist.i.kyoto-u.ac.jp/index.php?KWDLC) -> Japanese NER Dataset## Requirements
- python3
## Usage
1. Download KWDLC from [link](http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/KWDLC/download\_kwdlc.cgi)
2. Do the following```
git clone https://github.com/s14t284/KWDLC2NERDataset.git
cd KWDLC2NERDataset/
python3 run.py -d /path/to/KWDLC-1.0.tar.bz2
```## About Dataset
### NE types
- ORG: ORGANIZATION
- PSN: PERSON
- LOC: LOCATION
- ART: ARTIFACT
- DAT: DATE
- TIM: TIME
- MON: MONEY
- PER: PERCENT### tagging scheme
- IOB2
## make NER dataset script
```
usage: run.py [-h] [-d DATASET] [-f FILE]optional arguments:
-h, --help show this help message and exit
-d DATASET, --dataset DATASET
KWDLC tar file path. default ./KWDLC-1.0.tar.bz2
-f FILE, --file FILE generated dataset path. default ./dataset.txt
```## References
- 萩行正嗣, 河原大輔, 黒橋禎夫.
多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析,
自然言語処理, Vol.21, No.2, pp.213-248, 2014.- Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi and Manabu Sassano.
Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing,
In Proceedings of the 25th International Conference on Computational Linguistics, pp.269-278, 2014.- Masatsugu Hangyo, Daisuke Kawahara and Sadao Kurohashi.
Building a Diverse Document Leads Corpus Annotated with Semantic Relations,
In Proceedings of the 26th Pacific Asia Conference on Language Information and Computing, pp.535-544, 2012.## Other
I assume no responsibility for using this program