{"id":18541892,"url":"https://github.com/cltk/ang_models_cltk","last_synced_at":"2025-04-23T19:16:13.287Z","repository":{"id":76838618,"uuid":"137978233","full_name":"cltk/ang_models_cltk","owner":"cltk","description":null,"archived":false,"fork":false,"pushed_at":"2021-02-01T18:43:57.000Z","size":28107,"stargazers_count":6,"open_issues_count":1,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-23T19:16:04.882Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cltk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-20T04:01:54.000Z","updated_at":"2024-10-26T04:54:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"993aebd8-1c2e-4764-b5f1-3d6a1ef5b55e","html_url":"https://github.com/cltk/ang_models_cltk","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fang_models_cltk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fang_models_cltk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fang_models_cltk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fang_models_cltk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cltk","download_url":"https://codeload.github.com/cltk/ang_models_cltk/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250496993,"owners_count":21440231,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T20:06:38.514Z","updated_at":"2025-04-23T19:16:13.256Z","avatar_url":"https://github.com/cltk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Old English Models\n\nTrained morphological taggers for Old English (OE)\n\nTraining Set and Citations\n==========================\n\nBech, Kristin and Kristine Eide. 2014. The ISWOC corpus. Department of Literature, Area Studies and European Languages, University of Oslo. http://iswoc.github.io.\n\nBuilding the Training Set\n==========================\n\nTo download the ISWOC corpus and construct a tagged corpus for training the morphological tagger, issue from the `english_models_cltk` directory:\n\n```bash\n$ ./scripts/make_corpus.bash oe\n```\n\nThis will yield files `oe.$FEATURE`, where $FEATURE is one of `{pos person number tense mood voice gender case degree strength inflection}`, in the `corpora/oe` subdirectory. It will also produce files `oe_train.$FEATURE` and `oe_test.$FEATURE`.  The latter are built from Orosius' history, while the former contain the rest of the texts in the ISWOC corpus.  Validating on an unseen text provides a more realistic estimate of the tagger's accuracy on novel texts.\n\nThe POS and morphological tag scheme is given by the following XML snippet:\n\n```XML\n\t\u003cparts-of-speech\u003e\n      \u003cvalue tag=\"A-\" summary=\"adjective\"/\u003e\n      \u003cvalue tag=\"Df\" summary=\"adverb\"/\u003e\n      \u003cvalue tag=\"S-\" summary=\"article\"/\u003e\n      \u003cvalue tag=\"Ma\" summary=\"cardinal numeral\"/\u003e\n      \u003cvalue tag=\"Nb\" summary=\"common noun\"/\u003e\n      \u003cvalue tag=\"C-\" summary=\"conjunction\"/\u003e\n      \u003cvalue tag=\"Pd\" summary=\"demonstrative pronoun\"/\u003e\n      \u003cvalue tag=\"F-\" summary=\"foreign word\"/\u003e\n      \u003cvalue tag=\"Px\" summary=\"indefinite pronoun\"/\u003e\n      \u003cvalue tag=\"N-\" summary=\"infinitive marker\"/\u003e\n      \u003cvalue tag=\"I-\" summary=\"interjection\"/\u003e\n      \u003cvalue tag=\"Du\" summary=\"interrogative adverb\"/\u003e\n      \u003cvalue tag=\"Pi\" summary=\"interrogative pronoun\"/\u003e\n      \u003cvalue tag=\"Mo\" summary=\"ordinal numeral\"/\u003e\n      \u003cvalue tag=\"Pp\" summary=\"personal pronoun\"/\u003e\n      \u003cvalue tag=\"Pk\" summary=\"personal reflexive pronoun\"/\u003e\n      \u003cvalue tag=\"Ps\" summary=\"possessive pronoun\"/\u003e\n      \u003cvalue tag=\"Pt\" summary=\"possessive reflexive pronoun\"/\u003e\n      \u003cvalue tag=\"R-\" summary=\"preposition\"/\u003e\n      \u003cvalue tag=\"Ne\" summary=\"proper noun\"/\u003e\n      \u003cvalue tag=\"Py\" summary=\"quantifier\"/\u003e\n      \u003cvalue tag=\"Pc\" summary=\"reciprocal pronoun\"/\u003e\n      \u003cvalue tag=\"Dq\" summary=\"relative adverb\"/\u003e\n      \u003cvalue tag=\"Pr\" summary=\"relative pronoun\"/\u003e\n      \u003cvalue tag=\"G-\" summary=\"subjunction\"/\u003e\n      \u003cvalue tag=\"V-\" summary=\"verb\"/\u003e\n      \u003cvalue tag=\"X-\" summary=\"unassigned\"/\u003e\n    \u003c/parts-of-speech\u003e\n    \u003cmorphology\u003e\n      \u003cfield tag=\"person\"\u003e\n        \u003cvalue tag=\"1\" summary=\"first person\"/\u003e\n        \u003cvalue tag=\"2\" summary=\"second person\"/\u003e\n        \u003cvalue tag=\"3\" summary=\"third person\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"uncertain person\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"number\"\u003e\n        \u003cvalue tag=\"s\" summary=\"singular\"/\u003e\n        \u003cvalue tag=\"d\" summary=\"dual\"/\u003e\n        \u003cvalue tag=\"p\" summary=\"plural\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"uncertain number\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"tense\"\u003e\n        \u003cvalue tag=\"p\" summary=\"present\"/\u003e\n        \u003cvalue tag=\"i\" summary=\"imperfect\"/\u003e\n        \u003cvalue tag=\"r\" summary=\"perfect\"/\u003e\n        \u003cvalue tag=\"s\" summary=\"resultative\"/\u003e\n        \u003cvalue tag=\"a\" summary=\"aorist\"/\u003e\n        \u003cvalue tag=\"u\" summary=\"past\"/\u003e\n        \u003cvalue tag=\"l\" summary=\"pluperfect\"/\u003e\n        \u003cvalue tag=\"f\" summary=\"future\"/\u003e\n        \u003cvalue tag=\"t\" summary=\"future perfect\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"uncertain tense\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"mood\"\u003e\n        \u003cvalue tag=\"i\" summary=\"indicative\"/\u003e\n        \u003cvalue tag=\"s\" summary=\"subjunctive\"/\u003e\n        \u003cvalue tag=\"m\" summary=\"imperative\"/\u003e\n        \u003cvalue tag=\"o\" summary=\"optative\"/\u003e\n        \u003cvalue tag=\"n\" summary=\"infinitive\"/\u003e\n        \u003cvalue tag=\"p\" summary=\"participle\"/\u003e\n        \u003cvalue tag=\"d\" summary=\"gerund\"/\u003e\n        \u003cvalue tag=\"g\" summary=\"gerundive\"/\u003e\n        \u003cvalue tag=\"u\" summary=\"supine\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"uncertain mood\"/\u003e\n        \u003cvalue tag=\"y\" summary=\"finiteness unspecified\"/\u003e\n        \u003cvalue tag=\"e\" summary=\"indicative or subjunctive\"/\u003e\n        \u003cvalue tag=\"f\" summary=\"indicative or imperative\"/\u003e\n        \u003cvalue tag=\"h\" summary=\"subjunctive or imperative\"/\u003e\n        \u003cvalue tag=\"t\" summary=\"finite\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"voice\"\u003e\n        \u003cvalue tag=\"a\" summary=\"active\"/\u003e\n        \u003cvalue tag=\"m\" summary=\"middle\"/\u003e\n        \u003cvalue tag=\"p\" summary=\"passive\"/\u003e\n        \u003cvalue tag=\"e\" summary=\"middle or passive\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"unspecified\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"gender\"\u003e\n        \u003cvalue tag=\"m\" summary=\"masculine\"/\u003e\n        \u003cvalue tag=\"f\" summary=\"feminine\"/\u003e\n        \u003cvalue tag=\"n\" summary=\"neuter\"/\u003e\n        \u003cvalue tag=\"p\" summary=\"masculine or feminine\"/\u003e\n        \u003cvalue tag=\"o\" summary=\"masculine or neuter\"/\u003e\n        \u003cvalue tag=\"r\" summary=\"feminine or neuter\"/\u003e\n        \u003cvalue tag=\"q\" summary=\"masculine, feminine or neuter\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"uncertain gender\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"case\"\u003e\n        \u003cvalue tag=\"n\" summary=\"nominative\"/\u003e\n        \u003cvalue tag=\"a\" summary=\"accusative\"/\u003e\n        \u003cvalue tag=\"o\" summary=\"oblique\"/\u003e\n        \u003cvalue tag=\"g\" summary=\"genitive\"/\u003e\n        \u003cvalue tag=\"c\" summary=\"genitive or dative\"/\u003e\n        \u003cvalue tag=\"e\" summary=\"accusative or dative\"/\u003e\n        \u003cvalue tag=\"d\" summary=\"dative\"/\u003e\n        \u003cvalue tag=\"b\" summary=\"ablative\"/\u003e\n        \u003cvalue tag=\"i\" summary=\"instrumental\"/\u003e\n        \u003cvalue tag=\"l\" summary=\"locative\"/\u003e\n        \u003cvalue tag=\"v\" summary=\"vocative\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"uncertain case\"/\u003e\n        \u003cvalue tag=\"z\" summary=\"no case\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"degree\"\u003e\n        \u003cvalue tag=\"p\" summary=\"positive\"/\u003e\n        \u003cvalue tag=\"c\" summary=\"comparative\"/\u003e\n        \u003cvalue tag=\"s\" summary=\"superlative\"/\u003e\n        \u003cvalue tag=\"x\" summary=\"uncertain degree\"/\u003e\n        \u003cvalue tag=\"z\" summary=\"no degree\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"strength\"\u003e\n        \u003cvalue tag=\"w\" summary=\"weak\"/\u003e\n        \u003cvalue tag=\"s\" summary=\"strong\"/\u003e\n        \u003cvalue tag=\"t\" summary=\"weak or strong\"/\u003e\n      \u003c/field\u003e\n      \u003cfield tag=\"inflection\"\u003e\n        \u003cvalue tag=\"n\" summary=\"non-inflecting\"/\u003e\n        \u003cvalue tag=\"i\" summary=\"inflecting\"/\u003e\n      \u003c/field\u003e\n    \u003c/morphology\u003e\n ```\n\n\nRunning Unit Tests\n==================\n\nFirst, install the nltk and import the `punkt` model (`import nltk; nltk.download()`). To ensure that the system is sound, then run:\n\n```bash\n$ python test/python/test_train_morpho_tagger.py\n```\n\n\nEvaluating the Tagging Models\n=============================\n\nThe following script will train four POS (only) tagging models: Unigram, Backoff (Trigram, Bigram, Unigram), Conditional Random Field (CRF), and Averaged Perceptron.\n\nNote: you need to have an installation of `shuf` (Linux) or `gshuf` (Mac; `brew install coreutils`).\n\n```bash\n$ ./scripts/evaluate_all_models.bash oe\n```\n\nFor each model,\n1.  Ten-fold cross validation is run on the whole corpus, where in each fold about 10% of the corpus is randomly selected as a test set, while the rest is used for training the tagger.\n2.  The tagger is trained on `corpora/oe/oe_train.pos` and evaluated on `corpora/oe/oe_test.pos`.\n3.\tThe tagging speed of the trained tagger is measured as the time taken to tag the text of Beowulf.\n\nFor each model trained, the script outputs the average acccuracy for the cross-validation folds, and the test accuracy and Cohen's kappa on the test set.  A confusion matrix for the test set is also produced, for error analysis.\n\nSample output:\n```\n---------- unigram ----------\nCV fold 1 accuracy = 0.806 kappa = 0.757 in 0.757 seconds\nCV fold 2 accuracy = 0.82 kappa = 0.773 in 0.773 seconds\nCV fold 3 accuracy = 0.828 kappa = 0.784 in 0.784 seconds\nCV fold 4 accuracy = 0.803 kappa = 0.753 in 0.753 seconds\nCV fold 5 accuracy = 0.813 kappa = 0.767 in 0.767 seconds\nCV fold 6 accuracy = 0.81 kappa = 0.763 in 0.763 seconds\nCV fold 7 accuracy = 0.817 kappa = 0.773 in 0.773 seconds\nCV fold 8 accuracy = 0.833 kappa = 0.791 in 0.791 seconds\nCV fold 9 accuracy = 0.814 kappa = 0.769 in 0.769 seconds\nCV fold 10 accuracy = 0.829 kappa = 0.784 in 0.784 seconds\n10-fold validation of model unigram = 0.817\nTest of model unigram for feature pos on unseen text:\n        accuracy = 0.687\n        kappa = 0.627\n\nConfusion matrix (rows = gold):\n    A-   C-   DF  DU  G-   NB  NE   PD  PI   PP  PS  PX  PY   R-   V-\nA-  34    0    2   0   0    1   0    0   0    0   0   0   0    0    0\nC-   0  115    0   0   2    0   0    0   0    2   0   0   0    0    0\nDF   0    0  119   0   2    0   0    7   0    0   0   0   0    2    0\nDU   0    0    0   2   0    0   0    0   0    0   0   0   0    0    0\nG-   0    0    8   0  56    0   0    8   0    0   0   0   0    5    0\nNB   0    0    0   0   0  139   0    0   0    0   0   1   0    0    0\nNE   0    0    0   0   0    0   9    0   0    0   0   0   0    0    0\nPD   0    0   14   0  31    0   0  104   0    0   0   0   0    0    0\nPI   0    0    0   0   2    0   0    0   1    0   0   0   0    0    0\nPP   0    0    0   0   0    0   0    0   0  130   2   0   0    0    0\nPS   0    0    0   0   0    0   0    0   0    0  28   0   0    0    1\nPX   0    0    0   0   0    6   0    0   0    0   0   5   0    0    0\nPY   0    1    0   0   0    1   0    0   0    0   0   0  88    0    0\nR-   0    0    5   0   0    0   0    0   0    0   0   0   4  160    0\nV-   0    0    0   0   0    0   0    0   0    0   0   0   0    9  178\n\nTime for model unigram to tag Beowulf = 0.039\n\n---------- backoff ----------\nCV fold 1 accuracy = 0.834 kappa = 0.795 in 0.795 seconds\nCV fold 2 accuracy = 0.833 kappa = 0.791 in 0.791 seconds\nCV fold 3 accuracy = 0.825 kappa = 0.781 in 0.781 seconds\nCV fold 4 accuracy = 0.832 kappa = 0.790 in 0.790 seconds\nCV fold 5 accuracy = 0.815 kappa = 0.771 in 0.771 seconds\nCV fold 6 accuracy = 0.826 kappa = 0.782 in 0.782 seconds\nCV fold 7 accuracy = 0.833 kappa = 0.793 in 0.793 seconds\nCV fold 8 accuracy = 0.837 kappa = 0.795 in 0.795 seconds\nCV fold 9 accuracy = 0.824 kappa = 0.781 in 0.781 seconds\nCV fold 10 accuracy = 0.832 kappa = 0.790 in 0.790 seconds\n10-fold validation of model backoff = 0.829\nTest of model backoff for feature pos on unseen text:\n        accuracy = 0.701\n        kappa = 0.644\n\nConfusion matrix (rows = gold):\n    A-   C-   DF  DU  G-   NB  NE   PD  PI   PP  PS  PX  PY   R-   V-\nA-  34    0    2   0   0    1   0    0   0    0   0   0   0    0    0\nC-   0  115    0   0   2    0   0    0   0    2   0   0   0    0    0\nDF   0    0  118   0   6    0   0    4   0    0   0   0   0    2    0\nDU   0    0    0   2   0    0   0    0   0    0   0   0   0    0    0\nG-   0    0    1   0  60    0   0   13   0    0   0   0   0    3    0\nNB   0    0    0   0   0  140   0    0   0    0   0   0   0    0    0\nNE   0    0    0   0   0    0   9    0   0    0   0   0   0    0    0\nPD   0    0    7   0  17    0   0  125   0    0   0   0   0    0    0\nPI   0    0    0   0   2    0   0    0   1    0   0   0   0    0    0\nPP   0    0    0   0   0    0   0    0   0  130   2   0   0    0    0\nPS   0    0    0   0   0    0   0    0   0    1  27   0   0    0    1\nPX   0    0    0   0   0    4   0    0   0    0   0   7   0    0    0\nPY   0    1    0   0   0    1   0    0   0    0   0   0  88    0    0\nR-   0    0    4   0   1    0   0    0   0    0   0   0   4  158    2\nV-   0    0    0   0   0    0   0    0   0    0   0   0   0    7  180\n\nTime for model backoff to tag Beowulf = 0.127\n\n---------- crf ----------\nCV fold 1 accuracy = 0.895 kappa = 0.869 in 0.869 seconds\nCV fold 2 accuracy = 0.901 kappa = 0.878 in 0.878 seconds\nCV fold 3 accuracy = 0.896 kappa = 0.871 in 0.871 seconds\nCV fold 4 accuracy = 0.895 kappa = 0.870 in 0.870 seconds\nCV fold 5 accuracy = 0.894 kappa = 0.870 in 0.870 seconds\nCV fold 6 accuracy = 0.907 kappa = 0.885 in 0.885 seconds\nCV fold 7 accuracy = 0.901 kappa = 0.875 in 0.875 seconds\nCV fold 8 accuracy = 0.91 kappa = 0.889 in 0.889 seconds\nCV fold 9 accuracy = 0.9 kappa = 0.874 in 0.874 seconds\nCV fold 10 accuracy = 0.897 kappa = 0.873 in 0.873 seconds\n10-fold validation of model crf = 0.900\nTest of model crf for feature pos on unseen text:\n        accuracy = 0.827\n        kappa = 0.794\n\nConfusion matrix (rows = gold):\n    A-   C-   DF  DU  G-   NB  NE   PD  PI   PP  PS  PX  PY   R-   V-\nA-  28    0    1   0   0   22   0    0   0    0   1   0   5    0   21\nC-   0  115    0   0   2    1   0    0   0    2   0   0   0    0    1\nDF   2    0  121   0   2    8   3    0   0    0   0   0   4    2   22\nDU   0    0    0   2   0    0   0    0   0    0   0   0   0    0    0\nG-   0    1    7   0  50    1   0    7   0    3   0   0   0    3    6\nNB   7    0    2   0   0  231   0    1   0    2   3   0   1    1   16\nNE   1    0    6   0   0    4  82    0   0    1   0   0   0    3    0\nPD   1    0    9   0   4    1   2  132   0    0   0   0   0    0    2\nPI   0    0    2   0   0    0   0    0   1    0   0   0   0    0    0\nPP   1    0    0   0   0    3   0    0   0  126   2   0   0    0    0\nPS   0    0    0   0   0    0   0    1   0    0  28   0   0    0    0\nPX   0    0    0   0   0    6   0    0   0    0   0   3   0    0    2\nPY   8    0    3   0   0   14  15    0   0    0   1   0  78    3    3\nR-   0    0    4   0   0    3   0    0   0    0   0   0   4  156    7\nV-   1    0    0   0   0   20   0    0   0    1   0   0   1    3  246\n\nTime for model crf to tag Beowulf = 0.350\n\n---------- perceptron ----------\nCV fold 1 accuracy = 0.924 kappa = 0.904 in 0.904 seconds\nCV fold 2 accuracy = 0.925 kappa = 0.905 in 0.905 seconds\nCV fold 3 accuracy = 0.923 kappa = 0.904 in 0.904 seconds\nCV fold 4 accuracy = 0.929 kappa = 0.911 in 0.911 seconds\nCV fold 5 accuracy = 0.916 kappa = 0.896 in 0.896 seconds\nCV fold 6 accuracy = 0.93 kappa = 0.912 in 0.912 seconds\nCV fold 7 accuracy = 0.928 kappa = 0.910 in 0.910 seconds\nCV fold 8 accuracy = 0.919 kappa = 0.900 in 0.900 seconds\nCV fold 9 accuracy = 0.918 kappa = 0.897 in 0.897 seconds\nCV fold 10 accuracy = 0.931 kappa = 0.914 in 0.914 seconds\n10-fold validation of model perceptron = 0.924\nTest of model perceptron for feature pos on unseen text:\n        accuracy = 0.857\n        kappa = 0.830\n\nConfusion matrix (rows = gold):\n    A-   C-   DF  DU  G-   NB  NE   PD  PI   PP  PS  PX  PY   R-   V-\nA-  34    0    2   0   0   17   1    0   0    0   0   0   2    0   22\nC-   0  117    1   0   2    1   0    0   0    0   0   0   0    0    0\nDF   4    1  127   0   4    8   1    6   0    0   0   0   1    0   12\nDU   0    0    0   2   0    0   0    0   0    0   0   0   0    0    0\nG-   0    0    4   0  59    1   0    9   2    0   0   0   0    2    1\nNB   2    0    2   0   1  244   0    1   0    0   0   1   2    0   11\nNE   0    0    2   0   1    7  87    0   0    0   0   0   0    0    0\nPD   1    0   10   0   6    1   0  133   0    0   0   0   0    0    0\nPI   0    2    0   0   0    0   0    0   1    0   0   0   0    0    0\nPP   1    0    0   0   0    3   0    0   0  126   2   0   0    0    0\nPS   0    0    1   0   0    0   0    0   0    1  27   0   0    0    0\nPX   0    0    0   0   0    4   0    0   0    0   0   7   0    0    0\nPY   7    0    1   0   0   20  15    1   0    0   0   0  77    0    4\nR-   1    0    3   0   0    4   2    0   0    0   0   0   4  160    0\nV-   2    0    0   0   1   18   0    0   0    0   0   0   1    6  244\n\nTime for model perceptron to tag Beowulf = 1.886\n```\n\nWe see that the Perceptron tagger is the most accurate but also slowest. \n\nTraining Taggers\n=================\n\nThe python module at `src/python/train.py` is used to train POS taggers.  \n\nFor help on using the script, use:\n\n```bash\nusage: train.py [-h] [-l LANGUAGE] [-u UNTAGGED_TEXT_FILE] [-v]\n                [-m {unigram,bigram,trigram,backoff,crf,perceptron,all}]\n                [-f {pos,person,number,tense,mood,gender,case,degree,strength,inflection}]\n                [-s SEMI_SUPERVISED_FILE] [-c SEMI_SUPERVISED_CONF]\n\nTrain morphological tagger(s).\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -l LANGUAGE, --language LANGUAGE\n                        train models for this language\n  -u UNTAGGED_TEXT_FILE, --untagged_text_file UNTAGGED_TEXT_FILE\n                        untagged text for testing\n  -v, --verbose\n  -m {unigram,bigram,trigram,backoff,crf,perceptron,all}, --model_type {unigram,bigram,trigram,backoff,crf,perceptron,all}\n                        model type to train\n  -f {pos,person,number,tense,mood,gender,case,degree,strength,inflection}, --feature {pos,person,number,tense,mood,gender,case,degree,strength,inflection}\n                        Morphological feature to train\n  -s SEMI_SUPERVISED_FILE, --semi-supervised_file SEMI_SUPERVISED_FILE\n                        Untagged text for semi-supervised training\n  -c SEMI_SUPERVISED_CONF, --semi-supervised-conf SEMI_SUPERVISED_CONF\n                        Confidence level of tags for semi-supervised training\n```\n\nIn the simplest case, all supported models are trained on the POS feature and stored in `taggers/oe/{FEATURE}`. Unless the `-v` flag is set, the script is silent.  The language defaults to `oe`.  \n\n```bash\npython src/python/train.py -m all\n\nls -l taggers/pos\ntotal 6296\n-rwxrwxrwx 1 jds jds  198414 Jun 29 01:11 backoff.pickle\n-rwxrwxrwx 1 jds jds  466397 Jun 29 01:11 bigram.pickle\n-rwxrwxrwx 1 jds jds  552396 Jun 29 01:11 crf.pickle\n-rwxrwxrwx 1 jds jds 4294112 Jun 29 01:12 perceptron.pickle\n-rwxrwxrwx 1 jds jds  764556 Jun 29 01:11 trigram.pickle\n-rwxrwxrwx 1 jds jds  163393 Jun 29 01:11 unigram.pickle\n```\n\nWith `-v` set, the output lists the location of the saved tagger and a sample of its output:\n\n```bash\npython src/python/train.py -m crf -v\nModel crf saved at taggers/pos/crf.pickle.  Training accuracy = 0.925\nSample tagging output: [('Hwæt', 'I-'), ('!', 'C-'), ('We', 'NE'), ('Gardena', 'NE'), ('in', 'R-'), ('geardagum', 'NB'), (',', 'C-'), ('þeodcyninga', 'NB'), (',', 'C-'), ('þrym', 'PY')]\n```\n\nTo train a tagger for a different feature, use the `-f` command-line argument:\n\n```bash\npython src/python/train.py -m perceptron -f case -v\nModel perceptron for feature case saved at taggers/case/perceptron.pickle. Training accuracy = 0.983\nSample tagging output: [('Hwæt', 'A'), ('!', '-'), ('We', 'N'), ('Gardena', '-'), ('in', '-'), ('geardagum', 'D'), (',', '-'), ('þeodcyninga', 'G'), (',', 'G'), ('þrym', 'G')]\n```\n\nLoading the Tagger\n==================\n\n[point to CLTK documentation elsewhere?]","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcltk%2Fang_models_cltk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcltk%2Fang_models_cltk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcltk%2Fang_models_cltk/lists"}