{"id":24074981,"url":"https://github.com/pprett/nut","last_synced_at":"2025-09-29T06:32:05.656Z","repository":{"id":1116089,"uuid":"986315","full_name":"pprett/nut","owner":"pprett","description":"Natural language Understanding Toolkit","archived":false,"fork":false,"pushed_at":"2014-05-07T18:28:00.000Z","size":26131,"stargazers_count":118,"open_issues_count":1,"forks_count":25,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-01-09T18:24:51.477Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"hquery/query-composer","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pprett.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2010-10-14T08:08:20.000Z","updated_at":"2024-05-26T19:41:18.000Z","dependencies_parsed_at":"2022-08-16T12:05:14.355Z","dependency_job_id":null,"html_url":"https://github.com/pprett/nut","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pprett%2Fnut","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pprett%2Fnut/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pprett%2Fnut/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pprett%2Fnut/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pprett","download_url":"https://codeload.github.com/pprett/nut/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234597657,"owners_count":18857983,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-09T18:25:09.031Z","updated_at":"2025-09-29T06:32:03.710Z","avatar_url":"https://github.com/pprett.png","language":"C","funding_links":[],"categories":["Python"],"sub_categories":["General-Purpose Machine Learning"],"readme":"Natural language Understanding Toolkit\n======================================\n\nTOC\n---\n\n  * Requirements_\n  * Installation_\n  * Documentation_\n     - CLSCL_\n     - NER_\n  * References_\n\n.. _Requirements:\n\nRequirements\n------------\n\nTo install nut you need:\n\n   * Python 2.5 or 2.6\n   * Numpy (\u003e= 1.1)\n   * Sparsesvd (\u003e= 0.1.4) [#f1]_ (only CLSCL_)\n\n.. _Installation:\n\nInstallation\n------------\n\nTo clone the repository run, \n\n   git clone git://github.com/pprett/nut.git\n\nTo build the extension modules inplace run,\n\n   python setup.py build_ext --inplace\n\nAdd project to python path,\n\n   export PYTHONPATH=$PYTHONPATH:$HOME/workspace/nut\n\n.. _Documentation:\n\nDocumentation\n-------------\n\n.. _CLSCL:\n\nCLSCL\n~~~~~\n\nAn implementation of Cross-Language Structural Correspondence Learning (CLSCL). \nSee [Prettenhofer2010]_ for a detailed description and \n[Prettenhofer2011]_ for more experiments and enhancements.\n\nThe data for cross-language sentiment classification that has been used in the above\nstudy can be found here [#f2]_.\n\nclscl_train\n???????????\n\nTraining script for CLSCL. See `./clscl_train --help` for further details. \n\nUsage::\n\n    $ ./clscl_train en de cls-acl10-processed/en/books/train.processed cls-acl10-processed/en/books/unlabeled.processed cls-acl10-processed/de/books/unlabeled.processed cls-acl10-processed/dict/en_de_dict.txt model.bz2 --phi 30 --max-unlabeled=50000 -k 100 -m 450 --strategy=parallel\n\n    |V_S| = 64682\n    |V_T| = 106024\n    |V| = 170706\n    |s_train| = 2000\n    |s_unlabeled| = 50000\n    |t_unlabeled| = 50000\n    debug: DictTranslator contains 5012 translations.\n    mutualinformation took 5.624 sec\n    select_pivots took 7.197 sec\n    |pivots| = 450\n    create_inverted_index took 59.353 sec\n    Run joblib.Parallel\n    [Parallel(n_jobs=-1)]: Done   1 out of 450 |elapsed:    9.1s remaining: 67.8min\n    [Parallel(n_jobs=-1)]: Done   5 out of 450 |elapsed:   15.2s remaining: 22.6min\n    [..]\n    [Parallel(n_jobs=-1)]: Done 449 out of 450 |elapsed: 14.5min remaining:    1.9s\n    train_aux_classifiers took 881.803 sec\n    density: 0.1154\n    Ut.shape = (100,170706)\n    learn took 903.588 sec\n    project took 175.483 sec\n\n.. note:: If you have access to a hadoop cluster, you can use `--strategy=hadoop` to train the pivot classifiers even faster, however, make sure that the hadoop nodes have Bolt (feature-mask branch) [#f3]_ installed. \n\nclscl_predict\n?????????????\n\nPrediction script for CLSCL.\n\nUsage::\n\n    $ ./clscl_predict cls-acl10-processed/en/books/train.processed model.bz2 cls-acl10-processed/de/books/test.processed 0.01\n    |V_S| = 64682\n    |V_T| = 106024\n    |V| = 170706\n    load took 0.681 sec\n    load took 0.659 sec\n    classes = {negative,positive}\n    project took 2.498 sec\n    project took 2.716 sec\n    project took 2.275 sec\n    project took 2.492 sec\n    ACC: 83.05\n\n.. _NER:\n\nNamed-Entity Recognition\n~~~~~~~~~~~~~~~~~~~~~~~~\n\nA simple greedy left-to-right sequence labeling approach to named entity recognition (NER). \n\npre-trained models\n??????????????????\n\nWe provide pre-trained named entity recognizers for place, person, and organization names in English and German. To tag a sentence simply use::\n\n    \u003e\u003e\u003e from nut.io import compressed_load\n    \u003e\u003e\u003e from nut.util import WordTokenizer\n\n    \u003e\u003e\u003e tagger = compressed_load(\"model_demo_en.bz2\")\n    \u003e\u003e\u003e tokenizer = WordTokenizer()\n    \u003e\u003e\u003e tokens = tokenizer.tokenize(\"Peter Prettenhofer lives in Austria .\")\n\n    \u003e\u003e\u003e # see tagger.tag.__doc__ for input format\n    \u003e\u003e\u003e sent = [((token, \"\", \"\"), \"\") for token in tokens]\n    \u003e\u003e\u003e g = tagger.tag(sent)  # returns a generator over tags\n    \u003e\u003e\u003e print(\" \".join([\"/\".join(tt) for tt in zip(tokens, g)]))\n    Peter/B-PER Prettenhofer/I-PER lives/O in/O Austria/B-LOC ./O\n\nYou can also use the convenience demo script `ner_demo.py`::\n\n    $ python ner_demo.py model_en_v1.bz2\n\nThe feature detector modules for the pre-trained models are `en_best_v1.py` and `de_best_v1.py` and can be found in the package `nut.ner.features`.\nIn addition to baseline features (word presence, shape, pre-/suffixes) they use distributional features (brown clusters), non-local features (extended prediction history), and gazetteers (see [Ratinov2009]_). The models have been trained on CoNLL03 [#f4]_. Both models use neither syntactic features (e.g. part-of-speech tags, chunks) nor word lemmas, thus, minimizing the required pre-processing. Both models provide state-of-the-art performance on the CoNLL03 shared task benchmark for English [Ratinov2009]_::\n\n    processed 46435 tokens with 4946 phrases; found: 4864 phrases; correct: 4455.\n    accuracy:  98.01%; precision:  91.59%; recall:  90.07%; FB1:  90.83\n                  LOC: precision:  91.69%; recall:  90.53%; FB1:  91.11  1648\n                  ORG: precision:  87.36%; recall:  85.73%; FB1:  86.54  1630\n                  PER: precision:  95.84%; recall:  94.06%; FB1:  94.94  1586\n\nand German [Faruqui2010]_::\n\n    processed 51943 tokens with 2845 phrases; found: 2438 phrases; correct: 2168.\n    accuracy:  97.92%; precision:  88.93%; recall:  76.20%; FB1:  82.07\n                  LOC: precision:  87.67%; recall:  79.83%; FB1:  83.57  957\n                  ORG: precision:  82.62%; recall:  65.92%; FB1:  73.33  466\n                  PER: precision:  93.00%; recall:  78.02%; FB1:  84.85  1015\n\n\nTo evaluate the German model on the out-domain data provided by [Faruqui2010]_ use the raw flag (`-r`) to write raw predictions (without B- and I- prefixes)::\n\n    ./ner_predict -r model_de_v1.bz2 clner/de/europarl/test.conll - | clner/scripts/conlleval -r\n    loading tagger... [done]\n    use_eph:  True\n    use_aso:  False\n    processed input in 40.9214s sec.\n    processed 110405 tokens with 2112 phrases; found: 2930 phrases; correct: 1676.\n    accuracy:  98.50%; precision:  57.20%; recall:  79.36%; FB1:  66.48\n                  LOC: precision:  91.47%; recall:  71.13%; FB1:  80.03  563\n                  ORG: precision:  43.63%; recall:  83.52%; FB1:  57.32  1673\n                  PER: precision:  62.10%; recall:  83.85%; FB1:  71.36  694\n\n\nNote that the above results cannot be compared directly to the resuls of [Faruqui2010]_ since they use a slighly different setting (incl. MISC entity).\n\nner_train\n?????????\n\nTraining script for NER. See ./ner_train --help for further details. \n\nTo train a conditional markov model with a greedy left-to-right decoder, the feature \ntemplates of [Rationov2009]_ and extended prediction history \n(see [Ratinov2009]_) use::\n\n    ./ner_train clner/en/conll03/train.iob2 model_rr09.bz2 -f rr09 -r 0.00001 -E 100 --shuffle --eph\n    ________________________________________________________________________________\n    Feature extraction\n    \n    min count:  1\n    use eph:  True\n    build_vocabulary took 24.662 sec\n    feature_extraction took 25.626 sec\n    creating training examples... build_examples took 42.998 sec\n    [done]\n    ________________________________________________________________________________\n    Training\n    \n    num examples: 203621\n    num features: 553249\n    num classes: 9\n    classes:  ['I-LOC', 'B-ORG', 'O', 'B-PER', 'I-PER', 'I-MISC', 'B-MISC', 'I-ORG', 'B-LOC']\n    reg: 0.00001000\n    epochs: 100\n    9 models trained in 239.28 seconds. \n    train took 282.374 sec\n    \n\nner_predict\n???????????\n\nYou can use the prediction script to tag new sentences formatted in CoNLL format \nand write the output to a file or to stdout. \nYou can pipe the output directly to `conlleval` to assess the model performance::\n\n    ./ner_predict model_rr09.bz2 clner/en/conll03/test.iob2 - | clner/scripts/conlleval\n    loading tagger... [done]\n    use_eph:  True\n    use_aso:  False\n    processed input in 11.2883s sec.\n    processed 46435 tokens with 5648 phrases; found: 5605 phrases; correct: 4799.\n    accuracy:  96.78%; precision:  85.62%; recall:  84.97%; FB1:  85.29\n                  LOC: precision:  87.29%; recall:  88.91%; FB1:  88.09  1699\n                 MISC: precision:  79.85%; recall:  75.64%; FB1:  77.69  665\n                  ORG: precision:  82.90%; recall:  78.81%; FB1:  80.80  1579\n                  PER: precision:  88.81%; recall:  91.28%; FB1:  90.03  1662\n\n.. _References:\nReferences\n----------\n\n.. [#f1] http://pypi.python.org/pypi/sparsesvd/0.1.4\n.. [#f2] http://www.webis.de/research/corpora/corpus-webis-cls-10/cls-acl10-processed.tar.gz\n.. [#f3] https://github.com/pprett/bolt/tree/feature-mask\n.. [#f4] For German we use the updated version of CoNLL03 by Sven Hartrumpf. \n\n.. [Prettenhofer2010] Prettenhofer, P. and Stein, B., `Cross-language text classification using structural correspondence learning \u003chttp://www.aclweb.org/anthology/P/P10/P10-1114.pdf\u003e`_. In Proceedings of ACL '10.\n\n.. [Prettenhofer2011] Prettenhofer, P. and Stein, B., `Cross-lingual adaptation using structural correspondence learning \u003chttp://tist.acm.org/papers/TIST-2010-06-0137.R1.html\u003e`_. ACM TIST (to appear). `[preprint] \u003chttp://arxiv.org/pdf/1008.0716v2\u003e`_\n\n.. [Ratinov2009] Ratinov, L. and Roth, D., `Design challenges and misconceptions in named entity recognition \u003chttp://www.aclweb.org/anthology/W/W09/W09-1119.pdf\u003e`_. In Proceedings of CoNLL '09.\n\n.. [Faruqui2010] Faruqui, M. and Padó S., `Training and Evaluating a German Named Entity Recognizer with Semantic Generalization`. In Proceedings of KONVENS '10\n\n.. _Developer Notes:\nDeveloper Notes\n---------------\n\n  * If you copy a new version of `bolt` into the `externals` directory make sure to run cython on the `*.pyx` files. If you fail to do so you will get a `PickleError` in multiprocessing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpprett%2Fnut","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpprett%2Fnut","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpprett%2Fnut/lists"}