https://github.com/proycon/colloquery
Web application for searching for phrases/collocations/synonyms in phrase translation tables
https://github.com/proycon/colloquery
computational-linguistics machine-translation mt natural-language-processing nlp
Last synced: 11 months ago
JSON representation
Web application for searching for phrases/collocations/synonyms in phrase translation tables
- Host: GitHub
- URL: https://github.com/proycon/colloquery
- Owner: proycon
- License: agpl-3.0
- Created: 2017-01-31T14:28:50.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2019-10-17T14:49:40.000Z (over 6 years ago)
- Last Synced: 2025-06-24T04:16:14.998Z (11 months ago)
- Topics: computational-linguistics, machine-translation, mt, natural-language-processing, nlp
- Language: Python
- Homepage:
- Size: 321 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
- Codemeta: codemeta.json
Awesome Lists containing this project
README
.. image:: http://applejack.science.ru.nl/lamabadge.php/colloquery
:target: http://applejack.science.ru.nl/languagemachines/
.. image:: https://www.repostatus.org/badges/latest/inactive.svg
:alt: Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.
:target: https://www.repostatus.org/#inactive
Colloquery
============
Colloquery is a web application to search for phrase translations, or
collocations, as well as synonyms,in bilingual phrase translation tables.
It is developed for `Van Dale `_ by the `Centre for Language
and Speech Technology `_, Radboud University Nijmegen, and is licensed under the
Affero GNU Public License.
.. image:: https://raw.github.com/proycon/colloquery/master/screenshot.jpg
:alt: Colloquery screenshot
:align: center
Installation
--------------
First, clone this repository and edit ``settings.py``.
Colloquery is not trivial to set-up and train, as it relies on numerous
external dependencies:
* Python 3
* `MongoDB `_
* `mongoengine `_
* `Django `_
On Debian/Ubuntu systems, these can be installed using ``sudo apt-get install
python3 mongodb python3-mongoengine python3-django``.
For the data generation step, the following additional dependencies are required:
* `colibri-core `_ (shipped as part of
`LaMachine `_)
* `colibri-mt `_
To create phrase translation-tables in the first place, use the Moses training
pipeline, which in turn invokes GIZA++:
* `Moses `_
* `GIZA++ `_
Data Generation
--------------------
* Prepare your parallel corpus files. A parallel corpus consists of two plain-text UTF8 encoded
files, one for the source language (``corpus.fr`` in our example) and one for the target
language (``corpus.en``). Make sure they are tokenised, lower-cased and
contain one sentence per line (you can use `ucto
`_ for this), sentences on the same line in the other file
are considering translations.
* Train a phrase translation table using Moses::
$ /path/to/moses/scripts/training/train-model.perl -external-bin-dir /path/to/moses/bin -root-dir . --parallel --corpus corpus --f fr --e en --first-step 1 --last-step 8
* Invoke the data generation pipeline of Colloquery, adjust the thresholds as
needed (see ``./manage.py generatedata --help``). This assumes a running
and properly configured MongoDB::
./manage.py generatedata --title "YourCorpus" --phrasetable corpus.fr-en.phrasetable --sourcelang fr --targetlang en --targetcorpus corpus.fr --sourcecorpus corpus.en --pst 0.2 --pts 0.2 --divergencethreshold 0.1 --freqthreshold 4
The Moses and data generation pipeline may take considerable time and system
resources (most notably memory). Set sane thresholds to prevent the data from
becoming unmanageably large.