{"id":14988051,"url":"https://github.com/apache/opennlp-models","last_synced_at":"2025-06-14T13:37:26.185Z","repository":{"id":240868846,"uuid":"734715688","full_name":"apache/opennlp-models","owner":"apache","description":"Apache OpenNLP Models","archived":false,"fork":false,"pushed_at":"2025-06-02T05:16:35.000Z","size":235,"stargazers_count":9,"open_issues_count":0,"forks_count":2,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-06-08T11:52:21.330Z","etag":null,"topics":["apache","compling","languagetechnology","nlp","opennlp","textprocessing"],"latest_commit_sha":null,"homepage":"https://opennlp.apache.org/","language":"Shell","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apache.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-22T12:17:41.000Z","updated_at":"2025-06-02T05:16:34.000Z","dependencies_parsed_at":"2024-05-21T07:21:10.171Z","dependency_job_id":"25a93cfd-3ec3-448a-b3bc-49279fb153db","html_url":"https://github.com/apache/opennlp-models","commit_stats":{"total_commits":25,"total_committers":4,"mean_commits":6.25,"dds":"0.31999999999999995","last_synced_commit":"eddbca21b2b2c9f4099dae086cd54bfdef292426"},"previous_names":["apache/opennlp-models"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fopennlp-models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fopennlp-models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fopennlp-models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fopennlp-models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apache","download_url":"https://codeload.github.com/apache/opennlp-models/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fopennlp-models/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":258875793,"owners_count":22771411,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","compling","languagetechnology","nlp","opennlp","textprocessing"],"created_at":"2024-09-24T14:16:00.561Z","updated_at":"2025-06-14T13:37:26.166Z","avatar_url":"https://github.com/apache.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!--\nLicensed to the Apache Software Foundation (ASF) under one or more\ncontributor license agreements.  See the NOTICE file distributed with\nthis work for additional information regarding copyright ownership.\nThe ASF licenses this file to You under the Apache License, Version 2.0\n(the \"License\"); you may not use this file except in compliance with\nthe License.  You may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n--\u003e\n\nWelcome to Apache OpenNLP Models!\n===========\n\n[![GitHub license](https://img.shields.io/badge/license-Apache%202-blue.svg)](https://raw.githubusercontent.com/apache/opennlp-models/main/LICENSE)\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.apache.opennlp/opennlp-models/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.apache.opennlp/opennlp-models)\n[![Build Status](https://github.com/apache/opennlp-models/workflows/Java%20CI/badge.svg)](https://github.com/apache/opennlp-models/actions)\n[![Contributors](https://img.shields.io/github/contributors/apache/opennlp-models)](https://github.com/apache/opennlp-models/graphs/contributors)\n[![GitHub pull requests](https://img.shields.io/github/issues-pr-raw/apache/opennlp-models.svg)](https://github.com/apache/opennlp-models/pulls)\n[![Stack Overflow](https://img.shields.io/badge/stack%20overflow-opennlp-f1eefe.svg)](https://stackoverflow.com/questions/tagged/opennlp)\n\nThe Apache OpenNLP library provides binary models for processing of natural language text. \nThis repository is intended for the distribution of model files as a Maven artifacts.\n\n## Useful Links\n\nFor additional information, visit the [OpenNLP Home Page](https://opennlp.apache.org/models.html).\n\nYou can use OpenNLP with many languages. Additional demo models are provided [here](https://opennlp.sourceforge.net/models-1.5/).\n\nThe models are fully compatible with the latest [OpenNLP release](https://opennlp.apache.org/download.html). They can be used for testing or getting started.\n\n\u003e [!NOTE]  \n\u003e Please train your own models for all other, specialized use cases.\n\nDocumentation, including JavaDocs, code usage and command-line interface examples are available [here](https://opennlp.apache.org/docs/)\n\nYou can also follow our [mailing lists](https://opennlp.apache.org/mailing-lists.html) for news and updates.\n\n## Overview\n\nWe provide **Tokenizer**, **Sentence Detector** and **Part-of-Speech Tagger** models for the following 32 languages:\n\n   - Armenian\n   - Basque\n   - Bulgarian\n   - Catalan\n   - Croatian\n   - Czech\n   - Danish\n   - Dutch\n   - English\n   - Estonian\n   - Finnish\n   - French   \n   - Georgian\n   - German\n   - Greek\n   - Icelandic\n   - Italian\n   - Kazakh\n   - Korean\n   - Latvian\n   - Norwegian\n   - Polish\n   - Portuguese\n   - Romanian\n   - Russian\n   - Serbian\n   - Slovak\n   - Slovenian\n   - Spanish\n   - Swedish\n   - Turkish\n   - Ukrainian\n\nThese models are compatible with OpenNLP `\u003e= 1.0.0`. Further details are available at the [OpenNLP Models](https://opennlp.apache.org/models.html) \npage and in the [CHANGELOG](https://dist.apache.org/repos/dist/release/opennlp/models/ud-models-1.2/CHANGES).\n\nIn addition, we provide a **Language Detector**, which is able to detect 103 languages in ISO 693-3 standard. \nWorks well with longer texts that have at least 2 sentences or more from the same language. \n\nIt is compatible with OpenNLP `\u003e= 1.8.3`. Model details are available [here](https://downloads.apache.org/opennlp/models/langdetect/1.8.3/).\n\n## Getting Started\n\nThe [Universal Dependencies](https://universaldependencies.org) (UD) community provides a framework for consistent annotation of grammar across different human languages.\nThe project is developing cross-linguistically consistent treebank annotation for 150+ languages.           \n\n### Referencing published Models\n\nYou can import UD-based model artifacts directly via Maven, SBT or Gradle, for instance:\n\n#### Maven\n\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.apache.opennlp\u003c/groupId\u003e\n    \u003cartifactId\u003eopennlp-models-pos-de\u003c/artifactId\u003e\n    \u003cversion\u003e${opennlp.models.version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nfor all **32** supported languages, listed on the Apache OpenNLP [Model page](https://opennlp.apache.org/models.html).\n\nThe broader langdetect model can be referenced like this:   \n\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.apache.opennlp\u003c/groupId\u003e\n    \u003cartifactId\u003eopennlp-models-langdetect\u003c/artifactId\u003e\n    \u003cversion\u003e${opennlp.models.version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n#### SBT\n\n```\nlibraryDependencies += \"org.apache.opennlp\" % \"opennlp-models-langdetect\" % \"${opennlp.version}\"\n```\n\n#### Gradle\n\n```\ncompile group: \"org.apache.opennlp\", name: \"opennlp-models-langdetect\", version: \"${opennlp.version}\"\n```\n\nFor more details please check our [documentation](https://opennlp.apache.org/docs/)\n\n\n### Training Models\n\nAll released _sentence detection_, _tokenization_, _lemmatizer_, and _POS tagging_ models were and can be trained via the `ud-train.sh` script.\nIt is located in the _opennlp-models-training-ud_ directory in this repository. \n\n#### Preparing the environment\n\nBefore training UD-based OpenNLP models, the execution environment needs the latest [OpenNLP release](https://opennlp.apache.org/download.html) and the latest set of [UD treebanks](https://universaldependencies.org/#download).\nDownload the corresponding archive files and uncompress them both in the same directory in which the training script resides.\nRename both folders according to the `OPENNLP_HOME` and `UD_HOME` variables. \n\n\u003e [!IMPORTANT]\n\u003e Check and adjust the version string in both variables, that is, to the versions you have actually downloaded. \n\n#### Selecting model types\n\nNext, select what type of models should be trained. By default, the script defines:\n\n```\nTRAIN_TOKENIZER=\"true\"\nTRAIN_POSTAGGER=\"true\"\nTRAIN_SENTDETECT=\"true\"\nTRAIN_LEMMATIZER=\"true\"\n```\n\nSimply switch off a certain type, by setting the corresponding variable to false.\n\n#### Selecting languages\n\nBy default, treebanks of 32 supported languages are included in the `MODELS` variable of the script.\nIf only a smaller or different (sub-)set is required, this variable can simply be edited.\nThe format must be followed: `\u003cLanguage\u003e|\u003c2-digit-locale-code\u003e|\u003cUD treebank name\u003e`, for example: `English|en|EWT` or `Swedish|sv|Talbanken`.\n\n\u003e [!NOTE]\n\u003e The full list of supported languages and related treebanks is available [here](https://universaldependencies.org/#current-ud-languages).\n\u003e Yet, even listed on the UD page, training OpenNLP models might not succeed. If it succeeds, check the evaluation logs (_*.eval_) if the computed accuracy meets your expectations.\n                       \n#### Adjusting training parameters\n\nOnce you're done with the preparations, check the `ud-train.conf` file. With this config file, you can adjust the number of threads used for certain training steps. \nMoreover, it is possible to adjust the number of iterations (default: 150) to achieve (slightly) better model performance.\n\n#### Executing 'ud-train.sh'\n\nMake sure to make the `ud-train.sh` script executable. \nOn Unix-oid environments this can simply be achieved by setting the execute bit: `chmod 744 ud-train.sh`.\n\n\u003e [!TIP]\n\u003e As model training(s) can be a long-running task, depending on CPU type and number of CPU cores,\n\u003e the script should be started inside a [`screen`](https://www.man7.org/linux/man-pages/man1/screen.1.html) instance.\n\nFinally, execute the script via invoking `./ud-train.sh` and start brewing and enjoying some :coffee:.\n\nThe script logs each training (and evaluation) step per selected language / treebank, thus allowing progress tracking. \n\n#### Evaluating trained Models\n\nAfter a training step succeeds, a corresponding evaluation step is executed. If you want to skip it, set `EVAL_AFTER_TRAINING` to `false`.\nIn case the evaluation is run, the resulting performance (accuracy) is written to files ending with `.eval`.                                                                                                                        \n\n### Adding new Models\n\nWhen adding new models to the `pom.xml`, ensure to add new models to the `expected-models.txt` file located in `opennlp-models-test`.\nIn addition, make sure a sha256 hash is computed on each binary artifact. \nThe corresponding value must be set or updated correctly for each model type and language.                                       \n\n## Contributing\n\nThe Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.\n\nIf you would like to get involved please follow the instructions [here](https://github.com/apache/opennlp/blob/main/.github/CONTRIBUTING.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fopennlp-models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapache%2Fopennlp-models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fopennlp-models/lists"}