{"id":15014128,"url":"https://github.com/bjascob/lemminflect","last_synced_at":"2025-04-04T13:11:43.204Z","repository":{"id":51387518,"uuid":"186178954","full_name":"bjascob/LemmInflect","owner":"bjascob","description":"A python module for English lemmatization and inflection.","archived":false,"fork":false,"pushed_at":"2023-09-14T13:17:58.000Z","size":4527,"stargazers_count":267,"open_issues_count":6,"forks_count":25,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-04T12:56:36.159Z","etag":null,"topics":["inflection","lemmatization","nlp","nlp-machine-learning","python","spacy","spacy-extensions"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bjascob.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-11T20:33:58.000Z","updated_at":"2025-04-03T08:12:21.000Z","dependencies_parsed_at":"2024-06-18T15:44:19.131Z","dependency_job_id":null,"html_url":"https://github.com/bjascob/LemmInflect","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjascob%2FLemmInflect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjascob%2FLemmInflect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjascob%2FLemmInflect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bjascob%2FLemmInflect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bjascob","download_url":"https://codeload.github.com/bjascob/LemmInflect/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247182401,"owners_count":20897381,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inflection","lemmatization","nlp","nlp-machine-learning","python","spacy","spacy-extensions"],"created_at":"2024-09-24T19:45:13.834Z","updated_at":"2025-04-04T13:11:43.187Z","avatar_url":"https://github.com/bjascob.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ![(icon)](docs/img/favicon.ico) \u0026nbsp; LemmInflect\n\n**A python module for English lemmatization and inflection.**\n\n\n## About\nLemmInflect uses a dictionary approach to lemmatize English words and inflect them into forms\nspecified by a user supplied [Universal Dependencies](https://universaldependencies.org/u/pos/)\nor [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)\ntag.  The library works with out-of-vocabulary (OOV) words by applying neural network techniques\nto classify word forms and choose the appropriate morphing rules.\n\nThe system acts as a standalone module or as an extension to the [spaCy](https://spacy.io/) NLP system.\n\nThe dictionary and morphology rules are derived from the\n[NIH's SPECIALIST Lexicon](https://lhncbc.nlm.nih.gov/LSG/Projects/lexicon/current/web/index.html)\nwhich contains an extensive set information on English word forms.\n\nA more simplistic inflection only system is available as [pyInflect](https://github.com/bjascob/pyInflect).\nLemmInflect was created to address some of the shortcoming of that project and add features, such as...\n\n* Independence from the spaCy lemmatizer\n* Neural nets to disambiguate out of vocab morphology\n* Unigrams to dismabiguate spellings and multiple word forms\n\n\n## Documentation\nFor the latest documentation, see **[ReadTheDocs](https://lemminflect.readthedocs.io/en/latest/)**.\n\n\n## Accuracy of the Lemmatizer\nThe accuracy of LemmInflect and several other popular NLP utilities was tested using the\n[Automatically Generated Inflection Database (AGID)](http://wordlist.aspell.net/other) as a\nbaseline. This is not a \"gold\" standard dataset but it has an extensive list of\nlemmas and their corresponding inflections and appears to be generaly a \"good\" set for testing.\nEach inflection was lemmatized by the test software and then compared to the original value in the\ncorpus. The test included 119,194 different inflected words.\n\n```\n| Package                      | Verb  |  Noun | ADJ/ADV | Overall |  Speed   |\n|-----------------------------------------------------------------------------|\n| LemmInflect 0.2.3            | 96.1% | 95.4% |  93.9%  |  95.6%  |  42.0 uS |\n| Stanza 1.5.0 + CoreNLP 4.5.4 | 94.0% | 96.4% |  93.1%  |  95.5%  |  30.0 us |\n| spaCy 3.5.0                  | 79.5% | 88.9% |  60.5%  |  84.7%  | 393.0 uS |\n| NLTK 3.8.1                   | 53.3% | 52.2% |  53.3%  |  52.6%  |  12.0 uS |\n|-----------------------------------------------------------------------------|\n```\nSpeed is in micro-seconds per lemma and was conducted on a i9-7940x CPU. Note, since Stanza is making\ncalls to the java CoreNLP software, all 120K test cases were grouped into a single call. For Spacy,\nall pipeline components were disabled except the lemmatizer. The high per lemma time is probably\na reflection of the general overhead of the pipeline architecture.\n\n\n## Requirements and Installation\nThe only external requirement to run LemmInflect is `numpy` which is used for the matrix math that drives the neural nets.  These nets are relatively small and don't require significant CPU power to run.\n\nTo install do..\n\n`pip3 install lemminflect`\n\nThe project was built and tested under Python 3 and Ubuntu but should run on any Linux, Windows, Mac, etc.. system.  It is untested under Python 2 but may function in that environment with minimal or no changes.\n\nThe code base also includes library functions and scripts to create the various data files and neural nets.  This includes such things as...\n* Unigram Extraction from the Gutenberg and Billion Word Corpra\n* Python scripts for loading and parsing the SPECIALIST Lexicon\n* Nerual network training based on Keras and Tensorflow\n\nNone of these are required for run-time operation.  However, if you want of modify the system, see the [documentation](https://lemminflect.readthedocs.io/en/latest/test_dev/) for more info.\n\n\n## Library Usage\nTo lemmatize a word use the method `getLemma()`.  This takes a word and a Universal Dependencies tag and returns the lemmas as a list of possible spellings.  The dictionary system is used first, and if no lemma is found, the rules system is employed.\n```\n\u003e from lemminflect import getLemma\ngetLemma('watches', upos='VERB')\n('watch',)\n```\nTo inflect words, use the method `getInflection`.   This takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with that tag.  Similary to above, the dictionary is used first and then inflection rules are applied if needed..\n```\n\u003e from lemminflect import getInflection\n\u003e getInflection('watch', tag='VBD')\n('watched',)\n\n\u003e getInflection('xxwatch', tag='VBD')\n('xxwatched',)\n```\nThe library provides lower-level functions to access the dictionary and the OOV rules directly.  For a detailed description see [Lemmatizer](https://lemminflect.readthedocs.io/en/latest/lemmatizer/) or [Inflections](https://lemminflect.readthedocs.io/en/latest/inflections/).\n\n\n## Usage as a Spacy Extension\nTo use as an extension, you need spaCy version 2.0 or later.  Versions 1.9 and earlier do not support the extension methods used here.\n\nTo setup the extension, first import `lemminflect`.  This will create new `lemma` and `inflect` methods for each spaCy `Token`. The methods operate similarly to the methods described above, with the exception that a string is returned, containing the most common spelling, rather than a tuple.\n```\n\u003e import spacy\n\u003e import lemminflect\n\u003e nlp = spacy.load('en_core_web_sm')\n\u003e doc = nlp('I am testing this example.')\n\u003e doc[2]._.lemma()\ntest\n\n\u003e doc[4]._.inflect('NNS')\nexamples\n```\n\n## Issues\nIf you find a bug, please report it on the [GitHub issues list](https://github.com/bjascob/LemmInflect/issues).  However be aware that when in comes to returning the correct inflection there are a number of different types of issues that can arise.  Some of these are not  readily fixable.  Issues with inflected forms include...\n* Multiple spellings for an inflection (ie.. arthroplasties, arthroplastyes or arthroplastys)\n* Mass form and plural types (ie.. people vs peoples)\n* Forms that depend on context (ie.. further vs farther)\n* Infections that are not fully specified by the tag (ie.. be/VBD can be \"was\" or \"were\")\n\nOne common issue is that some forms of the verb \"be\" are not completely specified by the treekbank tag.  For instance be/VBD inflects to either \"was\" or \"were\" and be/VBP inflects to either \"am\", or \"are\".  In order to disambiguate these forms, other words in the sentence need to be inspected.  At this time, LemmInflect doesn't include this functionality.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbjascob%2Flemminflect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbjascob%2Flemminflect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbjascob%2Flemminflect/lists"}