Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
https://github.com/M4t1ss/parallel-corpora-tools
cleaning corpora corpus-tools data-processing data-science filtering language language-processing machine machine-translation natural-language natural-language-processing neural neural-machine-translation nlp nmt translation
Last synced: 14 days ago
JSON representation
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
- Host: GitHub
- URL: https://github.com/M4t1ss/parallel-corpora-tools
- Owner: M4t1ss
- License: mit
- Created: 2017-12-05T19:39:57.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-12-19T05:50:20.000Z (7 months ago)
- Last Synced: 2024-03-03T17:37:18.191Z (4 months ago)
- Topics: cleaning, corpora, corpus-tools, data-processing, data-science, filtering, language, language-processing, machine, machine-translation, natural-language, natural-language-processing, neural, neural-machine-translation, nlp, nmt, translation
- Language: PHP
- Homepage:
- Size: 51.8 KB
- Stars: 40
- Watchers: 5
- Forks: 16
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-machine-translation - Corpora Cleaning Tools - Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems. (Tools 🛠)
README
# Corpora Cleaning Tools
Tools for filtering and cleaning parallel and monolingual corpora
in order to train better (neural) machine translation systems.Inspired by the Data Filtering and Data Pre-processing sections of
[Tilde's](http://tilde.com) [WMT17 paper](http://www.statmt.org/wmt17/pdf/WMT37.pdf).
This repository includes some of the more basic scripts that can help to get rid of
the majority of junk from parallel corpora.Tools included
---------
* [parallel](https://github.com/M4t1ss/parallel-corpora-tools/blob/master/parallel) - tools for parallel corpora
* [mono](https://github.com/M4t1ss/parallel-corpora-tools/blob/master/mono) - tools for monolingual corporaRequirements
---------
* Python with [langid.py](https://github.com/saffsd/langid.py)
* PHP
* [Moses scripts](https://github.com/moses-smt/mosesdecoder)
* [Subword NMT](https://github.com/rsennrich/subword-nmt)```bash
pip install subword-nmt
pip install langid
```
Publications
---------If you use this tool, please cite the following paper:
Matīss Rikters (2018). "[Impact of Corpora Quality on Neural Machine Translation.](https://arxiv.org/abs/1810.08392)" In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018).
```bibtex
@inproceedings{Rikters2018BalticHLT,
author = {Rikters, Matīss},
booktitle={In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018)},
title = {{Impact of Corpora Quality on Neural Machine Translation}},
address={Tartu, Estonia},
year = {2018}
}
```