{"id":19725107,"url":"https://github.com/rsennrich/clevertagger","last_synced_at":"2025-04-29T23:30:35.987Z","repository":{"id":2566724,"uuid":"3546355","full_name":"rsennrich/clevertagger","owner":"rsennrich","description":"morphologically informed POS tagging for German","archived":false,"fork":false,"pushed_at":"2021-09-05T13:26:09.000Z","size":321,"stargazers_count":25,"open_issues_count":0,"forks_count":5,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-05T20:11:12.662Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rsennrich.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-02-25T18:05:48.000Z","updated_at":"2025-02-23T12:47:09.000Z","dependencies_parsed_at":"2022-08-29T02:41:41.927Z","dependency_job_id":null,"html_url":"https://github.com/rsennrich/clevertagger","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fclevertagger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fclevertagger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fclevertagger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fclevertagger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rsennrich","download_url":"https://codeload.github.com/rsennrich/clevertagger/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251599690,"owners_count":21615570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T23:28:11.104Z","updated_at":"2025-04-29T23:30:35.191Z","avatar_url":"https://github.com/rsennrich.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"clevertagger - morphologically informed POS tagging for German\n==============================================================\n\nABOUT\n-----\n\nclevertagger is a German part-of-speech tagger based on a CRF tool and SMOR.\nIts main component is a module that extracts features from SMOR's morphological analysis.\nThe combination of machine learning and FST-based morphological features promises a robust performance even for words that have not been observed during training,\nin particular morphologically complex (and rare) adjectives, verbs and nouns, which tend to have high error rates with conventional taggers.\n\n`smor_getpos.py` can also be used as a stand-alone script to convert the SMOR output into a list of possible part-of-speech tags in the STTS tagset.\n\nAUTHOR\n------\n\nRico Sennrich, Institute of Computational Linguistics, University of Zurich (http://www.cl.uzh.ch).\n\n\nLICENSE\n-------\n\nclevertagger is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (see LICENSE).\n\n`tokenizer.perl` and `nonbreaking_prefix.de` are from the Moses toolkit and licensed under the LGPL (http://www.statmt.org/moses/)\n\n`preprocessing/sentence_splitter` is from the NLTK and licensed under the Apache License 2.0 (https://github.com/nltk/nltk)\n\n\nREQUIREMENTS\n------------\n\n- Linux (currently SFST is Unix/Linux only)\n- Python \u003e= 2.6\n- one of these CRF tools:\n  - Wapiti http://wapiti.limsi.fr/\n  - CRF++ http://crfpp.googlecode.com/svn/trunk/doc/index.html (no trained models available)\n- SFST \u003e= 1.3 http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html\n\nOptional dependencies:\n\n- Perl (for tokenizer)\n\nINSTALLATION INSTRUCTIONS\n-------------------------\n\n1. Install the dependencies listed above.\n2. Obtain an SMOR tranducer and a corresponding CRF model. Both are available at http://kitt.ifi.uzh.ch/kitt/zmorge/ .\n3. Set the options `SMOR_MODEL` and `CRF_MODEL` in `config.py` (and adjust other options if necessary).\n\n\nUSAGE\n-----\n\nAssuming that you have trained a CRF++/Wapiti model, you can call clevertagger like this:\n\n    ./clevertagger \u003c input_file\n\nFurther options are displayed through\n\n    ./clevertagger -h\n\nBy default, clevertagger expects tokenized input (one word per line; empty line for sentence boundaries);\nfor untokenized input, use the `--tokenize` option. A sentence splitter is included in `preprocessing`. To process raw text, call:\n\n    preprocess/sentence_splitter \u003c input_file | ./clevertagger --tokenize\n\nclevertagger also supports the n-best-tagging features of CRF++/Wapiti.\nUse the option `-n` to get multiple analyses for each sentence, and `-t` to get multiple analyses for each token.\n\nYou can also use clevertagger as a Python module with a persistent tagger class;\nit expects a list of tokenized sentences as input:\n\n    import clevertagger\n    tagger = clevertagger.Clevertagger()\n\n    for sentence in tagger.tag(['Das ist ein Test .', 'Das auch .']):\n        print sentence + '\\n'\n\n\n\nTRAINING INSTRUCTIONS\n---------------------\n\nA new CRF model can be trained with a training text in the format illustrated by `sample_training_file.txt`,\ni.e. one word per line, token and tag separated by spaces/tab; empty lines for sentence boundaries.\n\nThen, execute the following two commands.\nThe second one may take you several days, depending on corpus size and the number of cores (set the number processes (-p) accordingly).\n\n    ./clevertagger -e \u003c training_file \u003e crf_training_file\n\nFor Wapiti, a typical training command is:\n\n    wapiti train --compact -p crf_config --nthread 10 crf_training_file crfmodel\n\nFor CRF++, a typical command is:\n\n    crf_learn -f 3 -c 1.5 -p 10 crf_config crf_training_file crfmodel\n\n\nFinally, change the option `CRF_MODEL` in `config.py` to point to the trained model, or move the trained model in this directory.\n\nPERFORMANCE\n-----------\n\nSome evaluation results from (Sennrich, Volk and Schneider 2013), with TnT/clevertagger models trained on Tüba-D/Z (and the standard TreeTagger model),\nand using Morphisto for morphological analysis:\n\nTagging accuracy (in %)\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eTagger\u003c/th\u003e\n    \u003cth\u003eTüBa-D/Z\u003c/th\u003e\n    \u003cth\u003eSofies Welt\u003c/th\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003eTreeTagger\u003c/td\u003e\n    \u003ctd\u003e94.9\u003c/td\u003e\n    \u003ctd\u003e95.0\u003c/td\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003eTnT\u003c/td\u003e\n    \u003ctd\u003e97.0\u003c/td\u003e\n    \u003ctd\u003e94.7\u003c/td\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003eclevertagger\u003c/td\u003e\n    \u003ctd\u003e97.6\u003c/td\u003e\n    \u003ctd\u003e96.6\u003c/td\u003e\n  \u003c/tr\u003e\n\n\u003c/table\u003e\n\nTagging performance depends on the quality of the morphological analysis, and is slightly better with the SMOR lexicon.\n\nA more indirect evaluation measuring parsing performance of [ParZu](https://github.com/rsennrich/ParZu) on a 3000-sentence test set using different taggers:\n\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eTagger\u003c/th\u003e\n    \u003cth\u003eprecision\u003c/th\u003e\n    \u003cth\u003erecall\u003c/th\u003e\n    \u003cth\u003ef-measure\u003c/th\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003eTreeTagger\u003c/td\u003e\n    \u003ctd\u003e85.6\u003c/td\u003e\n    \u003ctd\u003e83.7\u003c/td\u003e\n    \u003ctd\u003e84.6\u003c/td\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003eclevertagger\u003c/td\u003e\n    \u003ctd\u003e87.9\u003c/td\u003e\n    \u003ctd\u003e86.7\u003c/td\u003e\n    \u003ctd\u003e87.3\u003c/td\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003eclevertagger (50-best)\u003c/td\u003e\n    \u003ctd\u003e88.0\u003c/td\u003e\n    \u003ctd\u003e87.7\u003c/td\u003e\n    \u003ctd\u003e87.8\u003c/td\u003e\n  \u003c/tr\u003e\n\n  \u003ctr\u003e\n    \u003ctd\u003egold tags\u003c/td\u003e\n    \u003ctd\u003e89.8\u003c/td\u003e\n    \u003ctd\u003e89.3\u003c/td\u003e\n    \u003ctd\u003e89.5\u003c/td\u003e\n  \u003c/tr\u003e\n\n\u003c/table\u003e\n\n\nPUBLICATIONS\n------------\n\nThe tagger is described in:\n\nRico Sennrich, Martin Volk and Gerold Schneider (2013):\n   Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis.\n   In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2013, Hissar, Bulgaria.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frsennrich%2Fclevertagger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frsennrich%2Fclevertagger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frsennrich%2Fclevertagger/lists"}