https://github.com/swabhs/transliteration
https://github.com/swabhs/transliteration
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/swabhs/transliteration
- Owner: swabhs
- Created: 2014-02-11T22:38:46.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2014-02-25T17:43:39.000Z (about 11 years ago)
- Last Synced: 2025-01-19T07:43:16.423Z (4 months ago)
- Language: Python
- Size: 1.14 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
===================================================================================================
To train the classifier, run:pypy perceptron.py train.dat featlist.dat gaz.dat brown.dat x test.dat > weights.dat
train.dat - the training data file is a tab-separated file in this format:
de_word de_pos de_tag
featlist.dat - the feature list file is a list of preextracted features from
the training data.
Find it at /usr0/home/sswayamd/transliteration/featlist.datgaz.dat - file containing german/english words from the gazetteer.
Find it at /usr0/home/sswayamd/transliteration/gaz.de-enbrown.dat - file containing the brown clusters.
Find it at /usr0/home/sswayamd/transliteration/brown.datx - number of iterations you want to run
test.dat - the test data file in a tab-separated file in the same format as the
training datafinal.model - output file containing the feature names and feature weights, space-separated,
from every iteration. Feature weights from different iterations are separated
by a blank line.===================================================================================================
To run the decoder:
pypy decode.py test.dat final.model gaz.dat brown.dat > output.dat
A fully trained model can be found at /usr0/home/sswayamd/transliteration/final.model
output.dat - contains the output tags in the same format as the test file
===================================================================================================
PreprocessingChris's script to convert parallel data into training data with BIO tags:
------------------------------------------------------------------------
python mnt.py inp > output1My script to get tab-separated training data, for POS tagging:
-------------------------------------------------------------
python data_extract.py output1 > output2Turboparser to tag the German data:
----------------------------------
./TurboTagger --test --evaluate \
--file_model=/usr2/home/sswayamd/wmt/TurboParser/models/german_tagger.model \
--file_test=output2 \
--file_prediction=output3 \
--logtostderrMore preprocessing:
------------------
paste output3 output2 > output4
awk '{print $1,$2,$4}' output4 > output5
sed 's/ /\t/g' output5 > output.datoutput.dat - contains the data in the format that is accepted by classifier
===================================================================================================