https://github.com/rsennrich/multidomain_smt

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/rsennrich/multidomain_smt
Owner: rsennrich
License: gpl-2.0
Created: 2013-09-18T14:29:36.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2014-01-02T13:28:51.000Z (over 11 years ago)
Last Synced: 2025-01-10T16:54:00.705Z (5 months ago)
Language: Python
Size: 3.48 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

multidomain_smt
================

This project was developed at the Laboratoire d'Informatique de l'Université du Maine (http://www-lium.univ-lemans.fr), and the Institute of Computational Linguistics at the University of Zurich (http://www.cl.uzh.ch).

Project Homepage: http://github.com/rsennrich/multidomain_smt

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation

ABOUT
-----

This repository is a sample implementation of the clustering method described in:

Rico Sennrich, Holger Schwenk and Walid Aransa. 2013. A Multi-Domain Translation Model Framework for Statistical Machine Translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), p. 382-840.

REQUIREMENTS
------------

The program requires Python (2.6 or greater), GIZA++ and the Moses toolkit (compiled with XML-RCP-C and DLIB). Set the paths in `config.py`.

USAGE
-----

A number of options have to be set in `config.py`:

- paths to Moses binaries and GIZA++

- source-side language models (or parallel texts) for clustering: LM_TEXTS
- a parallel development set to be clustered: DEV_L1/DEV_L2
- K, the number of clusters in K-means clustering

- a test set TEST_SET

Also, the translation models to be combined need to be pre-trained, converted into the right format with `/path/to/moses/scripts/training/create_count_tables.py`, and referenced in MOSES_CFG.
See `demo/moses.ini` for an example config file, and http://www.statmt.org/moses/?n=Moses.AdvancedFeatures for a documentation of the MultiModelCounts phrase table type.

Executing the program:

python main.py

will do the following:

- cluster the development set into K clusters using source side language models
- extract a set of phrase pairs for each cluster (using GIZA++ for word alignment and heuristic phrase extraction)
- for each cluster, optimize the instance weights of the component models in demo/moses.ini
- translate the test set. for each sentence:
- assign it to the closest cluster
- translate the sentence using the optimized instance weights that correspond to this cluster

the script saves the clustering information and instance weights to a file (`persistent_data.txt` and `persistent_weights.txt`) so that you can repeat the translation step with new texts.

CONTACT
-------

For questions and feeback, please contact [email protected] or use the GitHub repository.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rsennrich/multidomain_smt

Awesome Lists containing this project

README