https://github.com/qanastek/ner-mmtd
Named-entity recognition corpora for multilingual voice recognition in the music industry based on the Million Musical Tweets dataset
https://github.com/qanastek/ner-mmtd
corpora dataset english french million-musical-tweets mmtd music named-entity-recognition ner neural-network recognition voice
Last synced: 8 months ago
JSON representation
Named-entity recognition corpora for multilingual voice recognition in the music industry based on the Million Musical Tweets dataset
- Host: GitHub
- URL: https://github.com/qanastek/ner-mmtd
- Owner: qanastek
- License: mit
- Created: 2021-05-07T12:03:21.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-05-08T01:52:04.000Z (over 4 years ago)
- Last Synced: 2025-01-18T09:18:29.642Z (10 months ago)
- Topics: corpora, dataset, english, french, million-musical-tweets, mmtd, music, named-entity-recognition, ner, neural-network, recognition, voice
- Language: Python
- Homepage:
- Size: 1.68 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NER-MMTD
Named-entity recognition corpora for multilingual recognition in the music industry based on the Million Musical Tweets dataset
## Steps
- Clone the project
- Download the MMTD corpora [here](http://www.cp.jku.at/datasets/MMTD/)
- Extract the files `artists.txt` and `track.txt` into `data/raw`
- If you want to generate the corpora again run `python getInfos.py`
## State-of-the-art
Using [Flair](https://github.com/flairNLP/flair) and CRF on 100 epochs (1.5 hours on a E5-2690 v1):
| | precision | recall | f1-core |
|------------------|-----------|--------|---------|
| ARTIST | 98.10% | 100% | 99.06% |
| PAUSE | 100% | 100% | 100% |
| START | 100% | 100% | 100% |
| STOP | 100% | 100% | 100% |
| TRACK | 99.42% | 99.42% | 99.42% |
| F1-score (macro) | | | 99.70% |
## Sources
- http://www.cp.jku.at/datasets/MMTD/
- http://www.diva-portal.se/smash/get/diva2:1010104/FULLTEXT01.pdf
## Citation
If you want to use this corpora in your research, please cite the following ressources:
```BibTeX
@inproceedings{inproceedings,
author = {Hauger, David and Kosir, Andrej and Tkalčič, Marko and Schedl, Markus},
year = {2013},
month = {11},
pages = {},
title = {THE MILLION MUSICAL TWEETS DATASET: WHAT CAN WE LEARN FROM MICROBLOGS}
}
@misc{labrak_yanis_ner_mmtd,
author = {Labrak Yanis},
title = {Named-entity recognition corpora for multilingual voice recognition in the music industry},
month = may,
year = 2021,
version = {1.0},
publisher = {GitHub},
url = {https://github.com/qanastek/NER-MMTD}
}
```