https://github.com/Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
https://github.com/Helsinki-NLP/OpusFilter
corpus-processing corpus-tools machine-translation natural-language-processing nlp parallel-corpus
Last synced: 2 months ago
JSON representation
OpusFilter - Parallel corpus processing toolkit
- Host: GitHub
- URL: https://github.com/Helsinki-NLP/OpusFilter
- Owner: Helsinki-NLP
- License: mit
- Created: 2019-11-06T13:17:08.000Z (about 6 years ago)
- Default Branch: develop
- Last Pushed: 2025-11-12T10:58:19.000Z (2 months ago)
- Last Synced: 2025-11-12T11:22:13.774Z (2 months ago)
- Topics: corpus-processing, corpus-tools, machine-translation, natural-language-processing, nlp, parallel-corpus
- Language: Python
- Homepage:
- Size: 7.1 MB
- Stars: 111
- Watchers: 10
- Forks: 25
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-machine-translation - OpusFilter - A tool for filtering and combining parallel corpora. (Tools 🛠)
README
# OpusFilter
OpusFilter is a tool for filtering and combining parallel corpora.
Features:
* Corpus preprocessing pipelines configured with [YAML](https://yaml.org/)
* Simple downloading of parallel corpora from [OPUS](http://opus.nlpl.eu/) with [OpusTools](https://github.com/Helsinki-NLP/OpusTools)
* Implementations for many common text file operations on parallel files
* Memory-efficient processing of large files
* Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
* Extendable with your own filters written in Python
OpusFilter has been presented in [ACL 2020 system demonstrations](https://www.aclweb.org/anthology/2020.acl-demos.20).
## Installing
Install the latest release from PyPI:
* `pip install opusfilter` or `pip install opusfilter[all]` (include optional Python libraries)
Install from source:
* `pip install .` or `python setup.py install`
### Troubleshooting
OpusFilter should generally work fine on Python 3.8 to 3.13. In the case of troubles, try installing the exact versions in `requirements.txt`:
* `pip install -r requirements.txt`
## Documentation
The complete OpusFilter documentation is available from [helsinki-nlp.github.io/OpusFilter](https://helsinki-nlp.github.io/OpusFilter/).
You can also build the documents from the source:
* `pip install -r docs/requirements.txt` or `pip install .[docs]`
* `sphinx-build docs docs-html`
## Changelog
A changelog is available in [docs/CHANGELOG.md](docs/CHANGELOG.md).
## Citing
If you use OpusFilter in your research, please cite our [ACL 2020 paper](https://www.aclweb.org/anthology/2020.acl-demos.20):
```bibtex
@inproceedings{aulamo-etal-2020-opusfilter,
title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
doi = "10.18653/v1/2020.acl-demos.20",
pages = "150--156"
}
```
A full bibliography of papers cited in the documentation and code can be found from [docs/references.bib](docs/references.bib).
## Contributing
See [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md).