https://github.com/proycon/colloquery

Web application for searching for phrases/collocations/synonyms in phrase translation tables
https://github.com/proycon/colloquery

computational-linguistics machine-translation mt natural-language-processing nlp

Last synced: about 1 year ago
JSON representation

Web application for searching for phrases/collocations/synonyms in phrase translation tables

Host: GitHub
URL: https://github.com/proycon/colloquery
Owner: proycon
License: agpl-3.0
Created: 2017-01-31T14:28:50.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2019-10-17T14:49:40.000Z (almost 7 years ago)
Last Synced: 2025-06-24T04:16:14.998Z (about 1 year ago)
Topics: computational-linguistics, machine-translation, mt, natural-language-processing, nlp
Language: Python
Homepage:
Size: 321 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.rst
- License: LICENSE
- Codemeta: codemeta.json

Awesome Lists containing this project

README

.. image:: http://applejack.science.ru.nl/lamabadge.php/colloquery
:target: http://applejack.science.ru.nl/languagemachines/

.. image:: https://www.repostatus.org/badges/latest/inactive.svg
:alt: Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.
:target: https://www.repostatus.org/#inactive

Colloquery
============

Colloquery is a web application to search for phrase translations, or
collocations, as well as synonyms,in bilingual phrase translation tables.

It is developed for `Van Dale `_ by the `Centre for Language
and Speech Technology `_, Radboud University Nijmegen, and is licensed under the
Affero GNU Public License.

.. image:: https://raw.github.com/proycon/colloquery/master/screenshot.jpg
:alt: Colloquery screenshot
:align: center

Installation
--------------

First, clone this repository and edit ``settings.py``.

Colloquery is not trivial to set-up and train, as it relies on numerous
external dependencies:

* Python 3
* `MongoDB `_
* `mongoengine `_
* `Django `_

On Debian/Ubuntu systems, these can be installed using ``sudo apt-get install
python3 mongodb python3-mongoengine python3-django``.

For the data generation step, the following additional dependencies are required:

* `colibri-core `_ (shipped as part of
`LaMachine `_)
* `colibri-mt `_

To create phrase translation-tables in the first place, use the Moses training
pipeline, which in turn invokes GIZA++:

* `Moses `_
* `GIZA++ `_

Data Generation
--------------------

* Prepare your parallel corpus files. A parallel corpus consists of two plain-text UTF8 encoded
files, one for the source language (``corpus.fr`` in our example) and one for the target
language (``corpus.en``). Make sure they are tokenised, lower-cased and
contain one sentence per line (you can use `ucto
`_ for this), sentences on the same line in the other file
are considering translations.
* Train a phrase translation table using Moses::

$ /path/to/moses/scripts/training/train-model.perl -external-bin-dir /path/to/moses/bin -root-dir . --parallel --corpus corpus --f fr --e en --first-step 1 --last-step 8

* Invoke the data generation pipeline of Colloquery, adjust the thresholds as
needed (see ``./manage.py generatedata --help``). This assumes a running
and properly configured MongoDB::

./manage.py generatedata --title "YourCorpus" --phrasetable corpus.fr-en.phrasetable --sourcelang fr --targetlang en --targetcorpus corpus.fr --sourcecorpus corpus.en --pst 0.2 --pts 0.2 --divergencethreshold 0.1 --freqthreshold 4

The Moses and data generation pipeline may take considerable time and system
resources (most notably memory). Set sane thresholds to prevent the data from
becoming unmanageably large.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/proycon/colloquery

Awesome Lists containing this project

README