{"id":18541839,"url":"https://github.com/cltk/latin_training_set_sentence_cltk","last_synced_at":"2025-08-24T16:12:58.898Z","repository":{"id":16868875,"uuid":"19629185","full_name":"cltk/latin_training_set_sentence_cltk","owner":"cltk","description":"Training sets and tokenizer for the Latin language, for use with CLTK","archived":false,"fork":false,"pushed_at":"2017-05-26T04:19:21.000Z","size":1641,"stargazers_count":3,"open_issues_count":0,"forks_count":4,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-24T10:12:39.303Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cltk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-05-09T23:58:48.000Z","updated_at":"2023-10-19T10:27:34.000Z","dependencies_parsed_at":"2022-08-25T12:00:49.436Z","dependency_job_id":null,"html_url":"https://github.com/cltk/latin_training_set_sentence_cltk","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Flatin_training_set_sentence_cltk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Flatin_training_set_sentence_cltk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Flatin_training_set_sentence_cltk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Flatin_training_set_sentence_cltk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cltk","download_url":"https://codeload.github.com/cltk/latin_training_set_sentence_cltk/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248087702,"owners_count":21045570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T20:06:28.276Z","updated_at":"2025-04-09T18:31:13.440Z","avatar_url":"https://github.com/cltk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"CLTK Latin sentence tokenizer\n=============================\n\nAbout\n-----\nThis repository contains a training set and rule set for tokenizing sentences for Latin, for use with the [Classical Language Toolkit](https://github.com/kylepjohnson/cltk). Unless you want to create a new training set for Latin sentences, there is nothing you need from this repository.\n\nTo tokenize Latin sentences with the CLTK, first [import it and use according to the docs here](http://docs.cltk.org/en/latest/import_corpora.html#cltk-sentence-tokenizer-latin) and then see [instructions on tokenizing Latin sentences](http://docs.cltk.org/en/latest/classical_latin.html#sentence-tokenization).\n\n`training_sentences.txt` is comprised Cicero's *Catilinarians* and is 12,245.\n\nUse\n---\n\nTo create a new training set, manually add tokenized sentences (with each sentence starting a new line) to `training_sentences.txt` and run `train_sentence_tokenizer.py`. The script outputs `latin.pickle`. To use this new file, copy it to your local CLTK data directory at `~/cltk_data/compiled/sentence_tokens_latin/`.\n\n```shell\n$ python train_sentence_tokenizer.py \n  Abbreviation: [2.4650] d\n  Abbreviation: [12.9953] m\n  Abbreviation: [0.9068] sp\n  Abbreviation: [49.2998] c\n  Abbreviation: [41.9048] p\n  Abbreviation: [12.3250] q\n  Abbreviation: [2.4650] n\n  Abbreviation: [54.2298] l\n  Abbreviation: [0.3336] ser\n  Abbreviation: [1.8136] ti\n  Abbreviation: [0.3336] mam\n  Abbreviation: [1.8136] cn\n  Abbreviation: [0.9068] ap\n  Abbreviation: [4.9300] t\n  Abbreviation: [0.3336] kal\n  Abbreviation: [0.3336] app\n  Abbreviation: [2.4650] k\n  Abbreviation: [0.9068] pl\n  Sent Starter: [60.3538] 'quodsi'\n  Sent Starter: [34.5304] 'itaque'\n  Sent Starter: [69.1987] 'nam'\n  Sent Starter: [35.8925] 'sed'\n  Sent Starter: [45.4471] 'nunc'\n  Sent Starter: [56.4065] 'etenim'\n```\n\nIf you think your training set and tokenizer is an improvement over the CLTK's current, please submit a pull request.\n\nLICENSE\n-------\nThis software is, like the rest of the CLTK, licensed under the MIT license (see LICENSE). The texts for the training sentences comes [from the Latin Library](https://github.com/kylepjohnson/corpus_latin_library) and are their copyright now resides in the public domain [explained here](http://thelatinlibrary.com/about.html).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcltk%2Flatin_training_set_sentence_cltk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcltk%2Flatin_training_set_sentence_cltk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcltk%2Flatin_training_set_sentence_cltk/lists"}