{"id":13740833,"url":"https://github.com/recski/HunTag","last_synced_at":"2025-05-08T20:32:47.157Z","repository":{"id":3576420,"uuid":"4639089","full_name":"recski/HunTag","owner":"recski","description":"a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models","archived":false,"fork":false,"pushed_at":"2016-01-18T08:52:12.000Z","size":621,"stargazers_count":22,"open_issues_count":6,"forks_count":10,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-02T13:17:22.781Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/recski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-06-12T15:25:11.000Z","updated_at":"2023-05-31T13:55:27.000Z","dependencies_parsed_at":"2022-08-26T13:12:03.833Z","dependency_job_id":null,"html_url":"https://github.com/recski/HunTag","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recski%2FHunTag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recski%2FHunTag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recski%2FHunTag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recski%2FHunTag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/recski","download_url":"https://codeload.github.com/recski/HunTag/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253145084,"owners_count":21861184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:00:52.734Z","updated_at":"2025-05-08T20:32:46.806Z","avatar_url":"https://github.com/recski.png","language":"Python","readme":"Huntag - a sequential tagger for NLP combining the linear classificator Liblinear and Hidden Markov Models\nBased on training data, Huntag can perform any kind of sequential sentence\ntagging and has been used for NP chunking and Named Entity Recognition.\n\n#Requirements\nHunTag uses the Liblinear package, which can be downloaded from:\n\nhttp://www.csie.ntu.edu.tw/~cjlin/liblinear/\n\nIn order for HunTag to work, Liblinear should be compiled with python bindings and the directory containing the python files `liblinearutil.py' and `liblinear.py' should be added to the environment variable PYTHONPATH.\n\nIMPORTANT: after installing Liblinear, the python bindings must be patched by cd-ing to the python subdirectory of your liblinear installation and running\npatch \u003c (*path-to-HunTag*)/liblinear.patch\nThis allows liblinear to handle the more memory-efficient cType input used by HunTag\n\n#Pre-trained models\nPre-trained models for Hungarian NP-chunking and NER are available from the [HunTag webpage](http://hlt.bme.hu/en/software/huntag)\n\n#Data format\n\nInput data must be a tab-separated file with one word per line and an empty\nline to mark sentence boundaries. Each line must contain the same number of\nfields and the last field must contain the correct tag for the word, which\nmay be in the BI format used at CoNLL shared tasks (e. g. B-NP to mark the\nfirst word of a noun phrase, I-NP to mark the rest and O to mark words\noutside an NP) or in the so-called BIE1 format which has a seperate symbol\nfor words constituting a chunk themselves (1-NP) and one for the last words\nof multi-word phrases (E-NP). The first two characters of answer tags\nshould always conform to one of these two conventions, the rest may be any\nstring describing the category. \n\n#Features\n\nThe flexibility of Huntag comes from the fact that it will generate any kind\nof features from the input data given the appropriate python functions.\nSeveral dozens of features used regularly in NLP tasks are already\nimplemented in the file features.py, however the user is encouraged to add\nany number of her own.\n\nOnce the desired features are implemented, a data set and a configuration\nfile containing the list of feature functions to be used are all Huntag\nneeds to perform training and tagging.\n\n#Config file\nThe configuration file lists the features that are to be used for a given task. The feature file may start with a command specifying the default radius for features. This is non-mandatory. Example:\n!defaultRadius 5\n\nNext, it can give values to variables that shall be used by the featurizing methods.\nFor example, the following three lines set the parameters of the feature called krpatt\n\nlet krpatt minLength 2\nlet krpatt maxLength 99\nlet krpatt lang hu\n\nThe second field specifies the name of the feature, the third a key, the fourth a numeric value. The dictionary of key-value pairs will be passed to the feature.\n\nAfter this come the actual assignments of feture names to features. Examples:\n\ntoken ngr ngrams 0\nsentence bwsamecases isBetweenSameCases 1\nlex street hunner/lex/streetname.lex 0\ntoken lemmalowered lemmaLowered 0,2\n\nThe first keyword can have three values, token, lex and sentence. For example, in the first example line above, the feature name ngr will be assigned to the python function ngrams() that returns a feature string for the given token. The third argument is a column or comma-separated list of columns. It specifies which fields of the input should be passed to the feature function. Counting starts from zero.\n\nFor sentence features, the input is aggregated sentence-wise into a list, and this list is then passed to the feature function. This function should return a list consisting of one feature string for each of the tokens of the sentence.\n\nFor lex features, the second argument specifies a lexicon file rather than a python function name. The specified token field is matched against this lexicon file.\n\n\n#Usage\nHunTag may be run in any of the following three modes:\n\n##train\nused to train a Liblinear model given a training corpus and a set of feature functions. When run in TRAIN mode, HunTag creates three files, one containing the liblinear mode and two listing features and labels and the integers they are mapped to when passed to liblinear. With the --model option set to NAME, the three files will be stored under NAME.model, NAME.featureNumbers and NAME.labelNumbers respectively.\n\ncat TRAINING_DATA | python huntag.py train OPTIONS\n\nMandatory options:\n    -c FILE, --config-file=FILE\n        read feature configuration from FILE\n    -m NAME, --model=NAME\n        name of liblinear model and lists\n    -p PARAMS --parameters=PARAMS\n        pass PARAMS to liblinear trainer\n\nNon-mandatory options:    \n    -f FILE, --feature-file=FILE\n        write training events to FILE\n\n\n##bigram-train\nUsed to train a bigram language model using a given field of the training data\n\ncat TRAINING_DATA | python huntag.py bigram-train OPTIONS\n\nMandatory options:\n    -b FILE, --bigram-model=FILE\n        name of bigram model file to be written\n    -t FIELD, --tag-field=FIELD\n        specify FIELD containing the tags to build bigram\n\n##tag\nUsed to tag input. Given a maxent model providing the value P(t|w) for all tags t and words (set of feature values) w, and a bigram language model supplying P(t|t0) for all pairs of tags, HunTag will assign to each sentence the most likely tag sequence.\n\ncat INPUT | python huntag.py tag OPTIONS\n\nMandatory options:\n    -m NAME, --model=NAME\n        name of liblinear model file and lists\n    -b FILE, --bigram-model=FILE\n        name of bigram model file\n    -c FILE, --config-file=FILE\n        read feature configuration from FILE\n\nNon-mandatory options:\n    -l L, --language-model-weight=L\n        set weight of the language model to L (default is 1)\n\n#Authors\n\nHuntag was created by Gábor Recski and Dániel Varga. It is a reimplementation and generalization of a Named Entity Recognizer built by Dániel Varga and Eszter Simon.\n\n#License\n\nHuntag is made available under the GNU Lesser General Public License v3.0. If you received Huntag in a package that also contain the Hungarian training corpora for Named Entity Recoginition and chunking task, then please note that these corpora are derivative works based on the Szeged Treebank, and they are made available under the same restrictions that apply to the original Szeged Treebank\n\n#Reference\n\nIf you use the tool, please cite the following paper:\n\nGábor Recski, Dániel Varga (2009): A Hungarian NP-chunker In: *The Odd Yearbook. ELTE SEAS Undergraduate Papers in Linguistics*. Budapest: ELTE School of English and American Studies. pp. 87-93\n\n```\n@article{Recski:2009a,\n   author={Recski, G\\'abor and D\\'aniel Varga},\n   title={{A Hungarian NP Chunker}},\n   journal = {The Odd Yearbook. ELTE SEAS Undergraduate Papers in Linguistics},\n   publisher = {ELTE {S}chool of {E}nglish and {A}merican {S}tudies},\n   city = {Budapest},\n   pages= {87--93}, \n   year={2009}\n}\n```\n\nIf you use some specialized version for Hungarian, please also cite the following paper:\n\nDóra Csendes, János Csirik, Tibor Gyimóthy and András Kocsor (2005): The Szeged Treebank. In: *Text, Speech and Dialogue. Lecture Notes in Computer Science* Volume 3658/2005, Springer: Berlin. pp. 123-131.\n\n```\n@inproceedings{Csendes:2005,\n   author={Csendes, D{\\'o}ra and Csirik, J{\\'a}nos and Gyim{\\'o}thy, Tibor and Kocsor, Andr{\\'a}s},\n   title={The {S}zeged {T}reebank},\n   booktitle={Lecture Notes in Computer Science: Text, Speech and Dialogue},\n   year={2005},\n   pages={123-131},\n   publisher={Springer}\n}\n```\n","funding_links":[],"categories":["Software","Tools"],"sub_categories":["Utilities","Taggers / Chunkers"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecski%2FHunTag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frecski%2FHunTag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecski%2FHunTag/lists"}