Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fbenites/TRANSLIT
https://github.com/fbenites/TRANSLIT
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/fbenites/TRANSLIT
- Owner: fbenites
- License: cc0-1.0
- Created: 2020-02-17T10:07:30.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T07:25:48.000Z (over 1 year ago)
- Last Synced: 2024-01-18T20:15:51.360Z (5 months ago)
- Language: Python
- Size: 59.6 KB
- Stars: 6
- Watchers: 1
- Forks: 2
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- tamil-nlp-catalog - TRANSLIT: A Large-scale Name Transliteration Resource - {2020, [Paper](https://www.aclweb.org/anthology/2020.lrec-1.399.pdf)} (**Datasets** / Transliteration)
- awesome-urdu - TRANSLIT: A Large-scale Name Transliteration Resource, 2020
README
# TRANSLIT: A Large Name Transliteration Resource
**TRANSLIT** is A Large Name Transliteration Resource. If you find this code useful in your research, please consider citing:
@inproceedings{benitesLREC2020,
Author = {Fernando Benites, Gilbert François Duivesteijn, Pius von Däniken, Mark Cieliebak}
Title = {Large Name Transliteration Resource},
booktitle = {Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2020)},
Year = {2020},
}## We merged together sources that now encompasses 3 Millions surfaces (names) of around 1.6 Million entities
We merged four data sources:
1. [JRC named entities](https://ec.europa.eu/jrc/en/language-technologies/jrc-names)
2. [Amazon Wiki-Names](https://github.com/steveash/NETransliteration-COLING2018)
3. [Google En-Ar transliterations](https://github.com/google/transliteration)
4. [Geonames](https://download.geonames.org/export/dump/alternateNamesV2.zip)We also searched for lang tags of wikipedia for transliterations (wiki-all).
We merged multiple names of an entity and assigned a UUID to it. We saved all the gathered names/entities in the file [TRANSLIT.json](https://github.com/fbenites/TRANSLIT/blob/master/artefacts/TRANSLIT.json), in the artefacts directory.
|Dataset |# entities|# name variations| mean length of chars per name|
|--------------|----------|-----------------|------------------------------|
|JRC |819'209 |1'338'463 |14.3 |
| Geonames | 139'549 | 758'274 | 10.6 |
| SubWikiLang | 609'420 | 1'376'446 | 10.3 |
| En-Ar | 15'858 | 31'716 | 4.4 |
| Wiki-lang-all | 122'180 | 144'588 | 17.0 |
| TRANSLIT (all) | 1'655'972 | 3'008'239 | 11.8 |
## ExperimentsThe experiments of the paper can be retraced with the use of the scripts abalation_study.py, classification_experiments.py and cnn_classification.py in the code directory. For their use, the data in artefact is used. To recreate this data, you need to download the original data (17G zipped) with download_data.sh. Afterward you should run run_preprare_data.sh.
## Troubleshooting
the artefacts are quite large, so git lfs needs to be installed:
$ sudo apt install git-lfs
$ git lfs install --local
$ git lfs fetch