{"id":18862114,"url":"https://github.com/yohasebe/lemmatizer","last_synced_at":"2025-04-06T00:09:40.796Z","repository":{"id":5244979,"uuid":"6422522","full_name":"yohasebe/lemmatizer","owner":"yohasebe","description":"Lemmatizer for text in English.  Inspired by Python's nltk.corpus.reader.wordnet.morphy","archived":false,"fork":false,"pushed_at":"2021-10-14T08:28:17.000Z","size":1967,"stargazers_count":108,"open_issues_count":2,"forks_count":15,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-29T23:09:43.414Z","etag":null,"topics":["lemmatizer","nlp","ruby","rubynlp","wordnet"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"showlowtech/azure-mobile-services","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yohasebe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-10-27T23:16:49.000Z","updated_at":"2024-09-15T01:44:53.000Z","dependencies_parsed_at":"2022-09-07T06:20:28.792Z","dependency_job_id":null,"html_url":"https://github.com/yohasebe/lemmatizer","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yohasebe%2Flemmatizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yohasebe%2Flemmatizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yohasebe%2Flemmatizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yohasebe%2Flemmatizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yohasebe","download_url":"https://codeload.github.com/yohasebe/lemmatizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247415967,"owners_count":20935387,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lemmatizer","nlp","ruby","rubynlp","wordnet"],"created_at":"2024-11-08T04:33:20.770Z","updated_at":"2025-04-06T00:09:40.761Z","avatar_url":"https://github.com/yohasebe.png","language":"Ruby","funding_links":[],"categories":["Language Parsing Tools","NLP Pipeline Subtasks"],"sub_categories":["NLP / NLU","Lexical Processing"],"readme":"lemmatizer\n==========\nLemmatizer for text in English.  Inspired by Python's [nltk.corpus.reader.wordnet.morphy](orpusReader.morphy) package.\n\nBased on code posted by mtbr at his blog entry [WordNet-based lemmatizer](http://d.hatena.ne.jp/mtbr/20090303/prfrnlprubyWordNetbasedlemmatizer)\n\nVersion 0.2 has added functionality to add user supplied data at runtime \n\nInstallation\n------------\n    sudo gem install lemmatizer\n    \n\nUsage\n-----\n```ruby\nrequire \"lemmatizer\"\n  \nlem = Lemmatizer.new\n  \np lem.lemma(\"dogs\",    :noun ) # =\u003e \"dog\"\np lem.lemma(\"hired\",   :verb ) # =\u003e \"hire\"\np lem.lemma(\"hotter\",  :adj  ) # =\u003e \"hot\"\np lem.lemma(\"better\",  :adv  ) # =\u003e \"well\"\n  \n# when part-of-speech symbol is not specified as the second argument, \n# lemmatizer tries :verb, :noun, :adj, and :adv one by one in this order.\np lem.lemma(\"fired\")           # =\u003e \"fire\"\np lem.lemma(\"slow\")            # =\u003e \"slow\"\n```\n\nLimitations\n-----------\n```ruby\n# Lemmatizer leaves alone words that its dictionary does not contain.\n# This keeps proper names such as \"James\" intact.\np lem.lemma(\"MacBooks\", :noun) # =\u003e \"MacBooks\" \n  \n# If an inflected form is included as a lemma in the word index,\n# lemmatizer may not give an expected result.\np lem.lemma(\"higher\", :adj) # =\u003e \"higher\" not \"high\"!\n\n# The above has to happen because \"higher\" is itself an entry word listed in dict/index.adj .\n# To fix this, modify the original dict directly (lib/dict/index.{noun|verb|adj|adv}) \n# or supply with custom dict files (recommended).\n```\n\nSupplying with user dict\n-----------\n```ruby\n# You can supply custom dict files consisting of lines in the format of \u003cpos\u003e\\s+\u003cform\u003e\\s+\u003clemma\u003e.\n# The data in user supplied files overrides the preset data. Here's the sample. \n\n# --- sample.dict1.txt (don't include hash symbol on the left) ---\n# adj   higher   high\n# adj   highest  high\n# noun  MacBooks MacBook\n# ---------------------------------------------------------------\n\nlem = Lemmatizer.new(\"sample.dict1.txt\")\n\np lem.lemma(\"higher\", :adj)     # =\u003e \"high\"\np lem.lemma(\"highest\", :adj)    # =\u003e \"high\"\np lem.lemma(\"MacBooks\", :noun)  # =\u003e \"MacBook\"\n\n# The argument to Lemmatizer.new can be either of the following:\n# 1) a path string to a dict file (e.g. \"/path/to/dict.txt\")\n# 2) an array of paths to dict files (e.g. [\"./dict/noun.txt\", \"./dict/verb.txt\"])\n```\n\nResolving abbreviations\n-----------\n```ruby\n# You can use 'abbr' tag in user dicts to resolve abbreviations in text.\n\n# --- sample.dict2.txt (don't include hash symbol on the left) ---\n# abbr  utexas   University of Texas\n# abbr  mit      Massachusetts Institute of Technology\n# ---------------------------------------------------------------\n\n# \u003cNOTE\u003e\n# 1. Expressions on the right (substitutes) can contain white spaces, \n#    while expressions in the middle (words to be replaced) cannot.\n# 2. Double/Single quotations could be used with substitute expressions,\n#    but not with original expressions.\n\nlem = Lemmatizer.new(\"sample.dict2.txt\")\n\np lem.lemma(\"utexas\", :abbr) # =\u003e \"University of Texas\"\np lem.lemma(\"mit\", :abbr)    # =\u003e \"Massachusetts Institute of Technology\"\n```\n\nAuthor\n------\n\n* Yoichiro Hasebe \u003cyohasebe@gmail.com\u003e\n\nThanks for assistance and contributions:\n\n* Vladimir Ivic \u003chttp://vladimirivic.com\u003e\n\nLicense\n-------\nLicensed under the MIT license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyohasebe%2Flemmatizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyohasebe%2Flemmatizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyohasebe%2Flemmatizer/lists"}