https://github.com/swabhs/transliteration

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/swabhs/transliteration
Owner: swabhs
Created: 2014-02-11T22:38:46.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2014-02-25T17:43:39.000Z (about 11 years ago)
Last Synced: 2025-01-19T07:43:16.423Z (4 months ago)
Language: Python
Size: 1.14 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README

Awesome Lists containing this project

README

        ===================================================================================================

To train the classifier, run:

pypy perceptron.py train.dat featlist.dat gaz.dat brown.dat x test.dat > weights.dat

train.dat    - the training data file is a tab-separated file in this format:

de_word	de_pos	de_tag

featlist.dat - the feature list file is a list of preextracted features from

               the training data. 

               Find it at /usr0/home/sswayamd/transliteration/featlist.dat

gaz.dat      - file containing german/english words from the gazetteer. 

               Find it at /usr0/home/sswayamd/transliteration/gaz.de-en

brown.dat    - file containing the brown clusters. 

               Find it at /usr0/home/sswayamd/transliteration/brown.dat

x            - number of iterations you want to run

test.dat     - the test data file in a tab-separated file in the same format as the

               training data

final.model  - output file containing the feature names and feature weights, space-separated,

               from every iteration. Feature weights from different iterations are separated

               by a blank line.

===================================================================================================

To run the decoder:

pypy decode.py test.dat final.model gaz.dat brown.dat > output.dat

A fully trained model can be found at /usr0/home/sswayamd/transliteration/final.model

output.dat   - contains the output tags in the same format as the test file

===================================================================================================

Preprocessing

Chris's script to convert parallel data into training data with BIO tags:

------------------------------------------------------------------------

python mnt.py inp > output1

My script to get tab-separated training data, for POS tagging:

-------------------------------------------------------------

python data_extract.py output1 > output2

Turboparser to tag the German data:

----------------------------------

./TurboTagger --test --evaluate \

--file_model=/usr2/home/sswayamd/wmt/TurboParser/models/german_tagger.model \

--file_test=output2 \

--file_prediction=output3 \

--logtostderr

More preprocessing:

------------------

paste output3 output2 > output4

awk '{print $1,$2,$4}' output4 > output5

sed 's/ /\t/g' output5 > output.dat

output.dat   - contains the data in the format that is accepted by classifier

===================================================================================================

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/swabhs/transliteration

Awesome Lists containing this project

README