Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cltk/latin_training_set_sentence_cltk

Training sets and tokenizer for the Latin language, for use with CLTK
https://github.com/cltk/latin_training_set_sentence_cltk

Last synced: 6 days ago
JSON representation

Training sets and tokenizer for the Latin language, for use with CLTK

Host: GitHub
URL: https://github.com/cltk/latin_training_set_sentence_cltk
Owner: cltk
License: mit
Created: 2014-05-09T23:58:48.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2017-05-26T04:19:21.000Z (over 7 years ago)
Last Synced: 2023-08-09T23:13:28.441Z (over 1 year ago)
Language: Python
Homepage:
Size: 1.56 MB
Stars: 4
Watchers: 7
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        CLTK Latin sentence tokenizer

=============================

About

-----

This repository contains a training set and rule set for tokenizing sentences for Latin, for use with the [Classical Language Toolkit](https://github.com/kylepjohnson/cltk). Unless you want to create a new training set for Latin sentences, there is nothing you need from this repository.

To tokenize Latin sentences with the CLTK, first [import it and use according to the docs here](http://docs.cltk.org/en/latest/import_corpora.html#cltk-sentence-tokenizer-latin) and then see [instructions on tokenizing Latin sentences](http://docs.cltk.org/en/latest/classical_latin.html#sentence-tokenization).

`training_sentences.txt` is comprised Cicero's *Catilinarians* and is 12,245.

Use

---

To create a new training set, manually add tokenized sentences (with each sentence starting a new line) to `training_sentences.txt` and run `train_sentence_tokenizer.py`. The script outputs `latin.pickle`. To use this new file, copy it to your local CLTK data directory at `~/cltk_data/compiled/sentence_tokens_latin/`.

```shell

$ python train_sentence_tokenizer.py 

  Abbreviation: [2.4650] d

  Abbreviation: [12.9953] m

  Abbreviation: [0.9068] sp

  Abbreviation: [49.2998] c

  Abbreviation: [41.9048] p

  Abbreviation: [12.3250] q

  Abbreviation: [2.4650] n

  Abbreviation: [54.2298] l

  Abbreviation: [0.3336] ser

  Abbreviation: [1.8136] ti

  Abbreviation: [0.3336] mam

  Abbreviation: [1.8136] cn

  Abbreviation: [0.9068] ap

  Abbreviation: [4.9300] t

  Abbreviation: [0.3336] kal

  Abbreviation: [0.3336] app

  Abbreviation: [2.4650] k

  Abbreviation: [0.9068] pl

  Sent Starter: [60.3538] 'quodsi'

  Sent Starter: [34.5304] 'itaque'

  Sent Starter: [69.1987] 'nam'

  Sent Starter: [35.8925] 'sed'

  Sent Starter: [45.4471] 'nunc'

  Sent Starter: [56.4065] 'etenim'

```

If you think your training set and tokenizer is an improvement over the CLTK's current, please submit a pull request.

LICENSE

-------

This software is, like the rest of the CLTK, licensed under the MIT license (see LICENSE). The texts for the training sentences comes [from the Latin Library](https://github.com/kylepjohnson/corpus_latin_library) and are their copyright now resides in the public domain [explained here](http://thelatinlibrary.com/about.html).