Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/menzenski/razmetka

Train and test part-of-speech taggers and automated morphological segmenters in Python.
https://github.com/menzenski/razmetka

Last synced: 18 days ago
JSON representation

Train and test part-of-speech taggers and automated morphological segmenters in Python.

Host: GitHub
URL: https://github.com/menzenski/razmetka
Owner: menzenski
License: other
Created: 2015-10-29T12:40:36.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2015-12-12T03:08:29.000Z (about 9 years ago)
Last Synced: 2023-08-11T14:13:35.011Z (over 1 year ago)
Language: Python
Homepage:
Size: 42 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # Razmetka

This repository contains a Python utility for training and testing

part-of-speech taggers from a provided training file.

## Training Files

This package assumes a training file structured according to the following

rules:

* Each line contains one sentence (i.e., sentences are separated by the

    newline character `\n`)

* Each sentence is white-space delimited---a space should precede every

    punctuation mark.

* Each token in each sentence consists of three parts: the word/punctuation

    mark itself, the separator character, and the associated tag. For example,

    here's a breakdown of a token `Men_PN1s`:

  * `Men` -- the word as it would appear in an untagged sentence.

  * `_` -- the separator character (the underbar `_` is the default, but

      other separators may be specified. The slash `/` is also common).

  * `PN1s` -- the (part-of-speech) tag, here indicating a first-person singular

      pronoun.

* UTF-8 is the default encoding, but other encodings may be specified.

```

Men_PN1s besh_NU minut_N usul_N oynidim_Vt-PST.dir-1s1 ._PUNCT

Sen_PN2si poluni_N-ACC yéding_Vt-PST.dir-2si2 dédi_Vt-PST.dir-3s2 Tursun_Npr ._PUNCT

Xinjiangda_Ntop-LOC turghan_Vi-REL.PST méning_PN1s.GEN ayalim_N-POSS1s qaytip_Vi-CNV keldi_Vdirc-PST.dir-3s2 ._PUNCT

```

## Provided Files

This repository includes a sample file, `uyghurtagger.train`, structured

according to the standards described above. The Uyghur sentences in this

file are taken from the public online corpus of the

[**Uyghur Light Verbs Project**](https://uyghur.ittc.ku.edu/uylvs.html)

(PI Arienne M. Dwyer, NSF BCS-1053152).

## Usage

Train a Brill tagger on a provided training file:

```Python

import razmetka.tag

btt = razmetka.tag.TTBrillTaggerTrainer(file_name='uyghurtagger.train',

                                        language='Uyghur')

btt.train(verbose=True)

```

Train and test Stanford log-linear taggers from a provided training file

using ten-fold cross-validation:

```Python

import razmetka.tag

tst = razmetka.tag.TaggerTester(file_name='uyghurtagger.train',

                                language='Uyghur')

tst.split_groups()

tst.estimate_tagger_accuracy()

tst.print_results()

```

Repeat the entire ten-fold cross-validation process multiple times:

```Python

import razmetka.tag

razmetka.tag.repeat_tagger_tests(fname='uyghurtagger.train',

                                 number_of_tests=3, language='Uyghur')

```

## Requirements

The `Razmetka` package requires NLTK 3.0+.

## TODOs

* Use `with nltk.compat.TemporaryDirectory() as tempdir:` for storing the

properties files and training files generated when using the Stanford NLP

POS tagger.

## Support

This Python package is being written to support the work of the **Annotating

Turki Manuscripts Online** project (Principal Investigators: Arienne M. Dwyer

and C.M. Sperberg-McQueen), sponsored by the

[Luce Foundation](http://www.hluce.org). The support of the Luce Foundation

is gratefully acknowledged.