{"id":33146859,"url":"https://github.com/mikahama/uralicNLP","last_synced_at":"2025-11-15T21:01:14.366Z","repository":{"id":54791453,"uuid":"113453169","full_name":"mikahama/uralicNLP","owner":"mikahama","description":"An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also supporting some non-Uralic languages such as Spanish, French, Arabic, Swedish, Norwegian, Russian and English. LLMs, FSTs and More!","archived":false,"fork":false,"pushed_at":"2025-11-03T15:44:47.000Z","size":39830,"stargazers_count":84,"open_issues_count":0,"forks_count":7,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-11-09T09:02:34.313Z","etag":null,"topics":["clustering","conll-u","constraint-grammar","dutch","finnish","french","fst","german","large-language-model","lemmatizer","llm","moksha","morphological-analysis","morphological-generation","nlp-library","russian","sami","spanish","swedish","uralic-languages"],"latest_commit_sha":null,"homepage":"http://uralicnlp.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mikahama.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["mikahama"]}},"created_at":"2017-12-07T13:18:45.000Z","updated_at":"2025-11-05T06:32:34.000Z","dependencies_parsed_at":"2023-01-31T19:01:45.463Z","dependency_job_id":"27ab2cce-29c8-471e-97b2-355e533d26b8","html_url":"https://github.com/mikahama/uralicNLP","commit_stats":{"total_commits":148,"total_committers":8,"mean_commits":18.5,"dds":"0.43243243243243246","last_synced_commit":"7cd5116ebdd473cb5228c570953eadf354ef3323"},"previous_names":[],"tags_count":24,"template":false,"template_full_name":null,"purl":"pkg:github/mikahama/uralicNLP","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikahama%2FuralicNLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikahama%2FuralicNLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikahama%2FuralicNLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikahama%2FuralicNLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mikahama","download_url":"https://codeload.github.com/mikahama/uralicNLP/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikahama%2FuralicNLP/sbom","scorecard":{"id":644883,"data":{"date":"2025-08-11","repo":{"name":"github.com/mikahama/uralicNLP","commit":"3597b4c9f9d36d1f78ead4740ad2fd9b9e8676c6"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":4.7,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/codeql-analysis.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Security-Policy","score":10,"reason":"security policy file detected","details":["Info: security policy file detected: SECURITY.md:1","Info: Found linked content: SECURITY.md:1","Info: Found disclosure, vulnerability, and/or timelines in security policy: SECURITY.md:1","Info: Found text in security policy: SECURITY.md:1"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"SAST","score":10,"reason":"SAST tool detected: CodeQL","details":["Info: SAST configuration detected: CodeQL","Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"1 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:28: update your workflow using https://app.stepsecurity.io/secureworkflow/mikahama/uralicNLP/codeql-analysis.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:41: update your workflow using https://app.stepsecurity.io/secureworkflow/mikahama/uralicNLP/codeql-analysis.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:48: update your workflow using https://app.stepsecurity.io/secureworkflow/mikahama/uralicNLP/codeql-analysis.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:62: update your workflow using https://app.stepsecurity.io/secureworkflow/mikahama/uralicNLP/codeql-analysis.yml/master?enable=pin","Info:   0 out of   4 GitHub-owned GitHubAction dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-21T11:53:07.533Z","repository_id":54791453,"created_at":"2025-08-21T11:53:07.533Z","updated_at":"2025-08-21T11:53:07.533Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":284552389,"owners_count":27024735,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-15T02:00:06.050Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","conll-u","constraint-grammar","dutch","finnish","french","fst","german","large-language-model","lemmatizer","llm","moksha","morphological-analysis","morphological-generation","nlp-library","russian","sami","spanish","swedish","uralic-languages"],"created_at":"2025-11-15T13:00:39.926Z","updated_at":"2025-11-15T21:01:14.348Z","avatar_url":"https://github.com/mikahama.png","language":"Python","readme":"\u003ch1 align=\"center\"\u003eUralicNLP\u003c/h1\u003e\n\u003cp align=\"center\"\u003eNatural language processing for many languages\u003c/p\u003e\n\n[![Updates](https://pyup.io/repos/github/mikahama/uralicNLP/shield.svg)](https://pyup.io/repos/github/mikahama/uralicNLP/)  [![Downloads](https://static.pepy.tech/badge/uralicnlp)](https://pepy.tech/project/uralicnlp) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01345/status.svg)](https://doi.org/10.21105/joss.01345)\n\n\nUralicNLP can produce **morphological analyses**, **generate morphological forms**, **lemmatize words** and **give lexical information** about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow \u0026 Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami. Currently, UralicNLP uses stable builds for the supported languages. \n\n[See the catalog of supported languages](http://models.uralicnlp.com/nightly/)\n\nSome of the supported languages: 🇸🇦 🇪🇸 🇮🇹 🇵🇹 🇩🇪 🇫🇷 🇳🇱 🇬🇧 🇷🇺 🇫🇮 🇸🇪 🇳🇴 🇩🇰 🇱🇻 🇪🇪\n\nCheck out [**UralicGUI** - a graphical user interface for UralicNLP](https://github.com/mikahama/uralicGUI).\n\n☕ Check out UralicNLP [official Java version](https://github.com/mikahama/uralicNLP-Java)\n\n♯ Check out UralicNLP [official C# version](https://github.com/mikahama/uralicNLP.net)\n\n## Installation\n\nThe library can be installed from [PyPi](https://pypi.python.org/pypi/uralicNLP/).\n\n    pip install uralicNLP\n   \nIf you want to use the Constraint Grammar features (*from uralicNLP.cg3 import Cg3*), you will also need to install VISL CG-3.\n\n## MCP\n\nWho said LLMs don't speak endangered languages? UralicNLP now supports MCP! Connect UralicNLP main functionality directly to your favorite MCP supporting LLM! [Read more in the UralicMCP wiki](https://github.com/mikahama/uralicNLP/wiki/UralicMCP).\n\n## Large language models (LLMs)\n\nUralicNLP supports a wide range of LLMs and it can even embed text in some endangered languages [Check out LLMs](https://github.com/mikahama/uralicNLP/wiki/Large-Language-Models).\n\nUralicNLP can cluster texts into semantically similar categories. [Learn more about clustering](https://github.com/mikahama/uralicNLP/wiki/Semantics).\n\n## List supported languages\nThe API is under constant development and new languages will be added to the nightly builds system. That's why UralicNLP provides a functionality for looking up the list of currently supported languages. The method returns 3 letter ISO codes for the languages.\n\n    from uralicNLP import uralicApi\n    uralicApi.supported_languages()\n    \u003e\u003e{'cg': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'ron', 'olo', 'bxr', 'hun', 'crk', 'chr', 'vep', 'deu', 'mrj', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'tat', 'smj'], 'dictionary': ['vot', 'lav', 'rus', 'est', 'nob', 'ron', 'olo', 'hun', 'koi', 'chr', 'deu', 'mrj', 'sjd', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'fkv', 'mhr', 'kpv', 'sme', 'sje', 'hdn', 'fin', 'mns', 'mdf', 'vro', 'udm', 'smj'], 'morph': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'swe', 'ron', 'eng', 'olo', 'bxr', 'hun', 'koi', 'crk', 'chr', 'vep', 'deu', 'mrj', 'ara', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'mhr', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'vro', 'udm', 'tat', 'smj']}\n\nThe *dictionary* key lists the languages that are supported by the lexical lookup, whereas *morph* lists the languages that have morphological FSTs and *cg* lists the languages that have a CG.\n\n## Download models \n\nOn the command line:\n\n    python -m uralicNLP.download --languages fin eng\n\nFrom python code:\n\n    from uralicNLP import uralicApi\n    uralicApi.download(\"fin\")\n\nWhen models are installed, *generate()*, *analyze()* and *lemmatize()* methods will automatically use them instead of the server side API. [More information about the models](https://github.com/mikahama/uralicNLP/wiki/Models).\n\n## Lemmatize words\nA word form can be lemmatized with UralicNLP. This does not do any disambiguation but rather returns a list of all the possible lemmas.\n\n    from uralicNLP import uralicApi\n    uralicApi.lemmatize(\"вирев\", \"myv\")\n    \u003e\u003e['вирев', 'вирь']\n    uralicApi.lemmatize(\"luutapiiri\", \"fin\", word_boundaries=True)\n    \u003e\u003e['luuta|piiri', 'luu|tapiiri']\n  \nAn example of lemmatizing the word *вирев* in Erzya (myv). By default, a **descriptive** analyzer is used. Use *uralicApi.lemmatize(\"вирев\", \"myv\", descriptive=False)* for a non-descriptive analyzer. If *word_boundaries* is set to True, the lemmatizer will mark word boundaries with a |.\n\n## Morphological analysis\nApart from just getting the lemmas, it's also possible to perform a complete morphological analysis.\n\n    from uralicNLP import uralicApi\n    uralicApi.analyze(\"voita\", \"fin\")\n    \u003e\u003e[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]\n  \nAn example of analyzing the word *voita* in Finnish (fin). The default analyzer is **descriptive**. To use a normative analyzer instead, use *uralicApi.analyze(\"voita\", \"fin\", descriptive=False)*.\n\n## Morphological generation\nFrom a lemma and a morphological analysis, it's possible to generate the desired word form. \n\n    from uralicNLP import uralicApi\n    uralicApi.generate(\"käsi+N+Sg+Par\", \"fin\")\n    \u003e\u003e[['kättä', 0.0]]\n  \nAn example of generating the singular partitive form for the Finnish noun *käsi*. The result is *kättä*. The default generator is a **regular normative** generator. *uralicApi.generate(\"käsi+N+Sg+Par\", \"fin\", dictionary_forms=True)* uses a normative dictionary generator and *uralicApi.generate(\"käsi+N+Sg+Par\", \"fin\", descriptive=True)* a descriptive generator.\n\n## Morphological segmentation\nUralicNLP makes it possible to split a word form into morphemes. (Note: this does not work with all languages)\n\n    from uralicNLP import uralicApi\n    uralicApi.segment(\"luutapiirinikin\", \"fin\")\n    \u003e\u003e[['luu', 'tapiiri', 'ni', 'kin'], ['luuta', 'piiri', 'ni', 'kin']]\n\nIn the example, the word _luutapiirinikin_ has two possible interpretations luu|tapiiri and luuta|piiri, the segmentation is done for both interpretations.\n\n## Disambiguation\n\nThis section has been moved to [UralicNLP wiki page on disambiguation](https://github.com/mikahama/uralicNLP/wiki/Disambiguation).\n\n## Dictionaries\n\nLearn more about dictionaries in [the wiki page on dictionaries](https://github.com/mikahama/uralicNLP/wiki/Dictionaries).\n\n## Parsing UD CoNLL-U annotated TreeBank data\n\nUralicNLP comes with tools for parsing and searching CoNLL-U formatted data. Please refer to [the Wiki for the UD parser documentation](https://github.com/mikahama/uralicNLP/wiki/UD-parser).\n\n\n## Other functionalities\n\n- [Machine Translation](https://github.com/mikahama/uralicNLP/wiki/Machine-Translation)\n- [Finnish Dependency Parsing](https://github.com/mikahama/uralicNLP/wiki/Dependency-parsing)\n- [ISO code to language name](https://github.com/mikahama/uralicNLP/wiki/uralicNLP.string_processing#iso_to_name)\n- [Tokenization](https://github.com/mikahama/uralicNLP/wiki/Tokenization)\n\n# Cite\n\nIf you use UralicNLP in an academic publication, please cite it as follows:\n\nHämäläinen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345\n\n    @article{uralicnlp_2019, \n        title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},\n        DOI={10.21105/joss.01345}, \n        journal={Journal of Open Source Software}, \n        author={Mika Hämäläinen}, \n        year={2019}, \n        volume={4},\n        number={37},\n        pages={1345}\n    }\n\nFor citing the FSTs and CGs, see *uralicApi.model_info(language)*.\n\nThe FST and CG tools and dictionaries come mostly from the [GiellaLT repositories](https://github.com/giellalt) and [Apertium](https://github.com/apertium).\n\n","funding_links":["https://github.com/sponsors/mikahama"],"categories":["Natural Language Processing","Uralic"],"sub_categories":["Internationalization and Localization (i18n/l10n)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikahama%2FuralicNLP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmikahama%2FuralicNLP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikahama%2FuralicNLP/lists"}