Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ayushidalmia/phrase-based-model

Implementation of Phrase Based Model to translate sentences from English to German and vice versa
https://github.com/ayushidalmia/phrase-based-model

giza language-model natural-language-processing python translation

Last synced: 2 months ago
JSON representation

Implementation of Phrase Based Model to translate sentences from English to German and vice versa

Host: GitHub
URL: https://github.com/ayushidalmia/phrase-based-model
Owner: ayushidalmia
Created: 2014-05-23T10:01:52.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2014-05-23T10:51:29.000Z (over 10 years ago)
Last Synced: 2023-08-21T19:57:56.197Z (over 1 year ago)
Topics: giza, language-model, natural-language-processing, python, translation
Language: Python
Homepage:
Size: 156 KB
Stars: 12
Watchers: 3
Forks: 9
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Phrase-Based-Translation
========================

This repository consists of project done as part of the course Natural Language Processing - Advanced, Spring 2014.The course was instructed by [Dr. Dipti Misra Sharma](http://www.iiit.ac.in/people/faculty/dipti), [Dr. Ravi Jampani](http://www.cise.ufl.edu/~rjampani/index.html) and [Mr. Akula Arjun Reddy](http://web.iiit.ac.in/~arjunreddy.aug08/)

A detailed report is available here

##Requirements
* Python 2.6 or above
* GIZA++
* Language Model (IRSTLM)

##Problem
In this project, the phrase based model is implemented. A phrase based model is a simple model for machine translation that is based solely on lexical translation, the translation of phrases. This requires a dictionary that maps phrases from one language to another. We first find the alignment of the word. Next, using the bi-text corpus we train the model and calculate the translational probability. Along with the translation probabilities we use the language model to reflect fluency in English.

The source folder consists of the following methods:

###Main functions

* preprocess.py
This module takes as input the bi-text corpuses and the number of sentences. It returns the training and testing dataset along with the sentence pairs.

Run the following command to create a random set of x sentences:

**python preprocess.py sourceCorpus targetCorpus numberOfSentencesForTraining**

It will generate four files:
trainingSource.txt trainingTarget.txt testingSource.txt testingTarget.txt
trainingSource.txt, trainingTarget.txt: contains the given number of sentences
testingSource.txt, testingTarget.txt: contains 5 test sentences which we use later

Next run the word alignment tool, GIZA++ to obtain the alignments.

In order to run GIZA++ do the following:

**./plain2snt.out trainingSource.txt trainingTarget.txt**
**./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt**

If the previous step gives error, then do:

**./snt2cooc.out trainingSource.vcb trainingTarget.vcb trainingSource_trainingTarget.snt > cooc.cooc**
**./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt -CoocurrenceFile cooc.cooc**

This will generate several files. The word alignments are present in A3 file. Repeat this step by swapping the trainingSource.txt and trainingTarget.txt to get the other direction alignment.Let sourceAlignment.txt and targetAlignment.txt be the two files. Then we obtain the phrases as follows:

* phraseExtraction.py
This function reads two files generated by GIZA++ containing the alignment of the source to target and target to source and returns the all possible phrases associated with it. Run the following command to get the phrases:

**python phraseExtraction.py sourceAlignment.txt targetAlignment.txt**
The phrases are generated in the file phrases.txt. Next we calculate the translation probability.

* findTranslationProbability.py
After obtaining the consistent phrases from the phrase extraction algorithm we next move to find the translationProbability. This is done by calculating the relative occurrences of the target phrase for a given source phrase for both directions

Run the following command:

**python findTranslationProbability.py phrases.txt**
It will generate two files:
translationProbabilitySourceGivenTarget.txt
translationProbabilityTargetGivenSource.txt

* languageModelInput.py
This helps in formatting the input file to the language model. It removes all special characters. In order to run this we do the following:

**python languageModelInput.py trainSource.txt trainS.txt**
**python languageModelInput.py trainTarget.txt trainT.txt**

Create the zip file for this which is now input for the language model. It is run as follows:

**./ngt -i="gunzip -c trainS.gz" -n=3 -o=train.www -b=yes**
**./tlm -tr=train.www -n=3 -lm=wb -o=trainS.lm**
**./ngt -i="gunzip -c trainT.gz" -n=3 -o=train.www -b=yes**
**./tlm -tr=train.www -n=3 -lm=wb -o=trainT.lm**

* finalScore.py

After obtaining the translationProbability from the alignment matrix,it combines the translation probability from the language model and returns the findTranslationProbability.

Run the follwowing command for both directions:
**python finalScore.py translationProbabilityTargetGivenSource.txt trainSource.lm
finalTranslationProbabilityTargetGivenSource.txt**
**python finalScore.py translationProbabilitySourceGivenTarget.txt trainTarget.lm finalTranslationProbabilitySourceGivenTarget.txt**

It returns the file final Translation Probabilities

* stackDecoding.py
Once we obtain the final tranlation probabilites we obtain the best phrase translation. This function gives the translation for a given sentence based on hypothesis recombiniation. Run the following command:

**python finalScore.py finalTranslationProbabilityTargetGivenSource.txt testingTarget.txt**
**python finalScore.py finalTranslationProbabilitySourceGivenTarget.txt testingSource.txt**

###Helper Function:
* alignment.py
This is a helper function which generates the word alignment matrix for a pair of sentences.

###Error Analysis
The method errorAnalysis.py takes as input in a very specific format. Given the source sentence, the translated sentence and the actual translation separated by newline, it returns the precision and recall for the input file in evalution.txt