https://github.com/apache/opennlp-models
Apache OpenNLP Models
https://github.com/apache/opennlp-models
apache compling languagetechnology nlp opennlp textprocessing
Last synced: 4 months ago
JSON representation
Apache OpenNLP Models
- Host: GitHub
- URL: https://github.com/apache/opennlp-models
- Owner: apache
- License: apache-2.0
- Created: 2023-12-22T12:17:41.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-06-02T05:16:35.000Z (4 months ago)
- Last Synced: 2025-06-08T11:52:21.330Z (4 months ago)
- Topics: apache, compling, languagetechnology, nlp, opennlp, textprocessing
- Language: Shell
- Homepage: https://opennlp.apache.org/
- Size: 229 KB
- Stars: 9
- Watchers: 12
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
Welcome to Apache OpenNLP Models!
===========[](https://raw.githubusercontent.com/apache/opennlp-models/main/LICENSE)
[](https://maven-badges.herokuapp.com/maven-central/org.apache.opennlp/opennlp-models)
[](https://github.com/apache/opennlp-models/actions)
[](https://github.com/apache/opennlp-models/graphs/contributors)
[](https://github.com/apache/opennlp-models/pulls)
[](https://stackoverflow.com/questions/tagged/opennlp)The Apache OpenNLP library provides binary models for processing of natural language text.
This repository is intended for the distribution of model files as a Maven artifacts.## Useful Links
For additional information, visit the [OpenNLP Home Page](https://opennlp.apache.org/models.html).
You can use OpenNLP with many languages. Additional demo models are provided [here](https://opennlp.sourceforge.net/models-1.5/).
The models are fully compatible with the latest [OpenNLP release](https://opennlp.apache.org/download.html). They can be used for testing or getting started.
> [!NOTE]
> Please train your own models for all other, specialized use cases.Documentation, including JavaDocs, code usage and command-line interface examples are available [here](https://opennlp.apache.org/docs/)
You can also follow our [mailing lists](https://opennlp.apache.org/mailing-lists.html) for news and updates.
## Overview
We provide **Tokenizer**, **Sentence Detector** and **Part-of-Speech Tagger** models for the following 32 languages:
- Armenian
- Basque
- Bulgarian
- Catalan
- Croatian
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- Georgian
- German
- Greek
- Icelandic
- Italian
- Kazakh
- Korean
- Latvian
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Serbian
- Slovak
- Slovenian
- Spanish
- Swedish
- Turkish
- UkrainianThese models are compatible with OpenNLP `>= 1.0.0`. Further details are available at the [OpenNLP Models](https://opennlp.apache.org/models.html)
page and in the [CHANGELOG](https://dist.apache.org/repos/dist/release/opennlp/models/ud-models-1.2/CHANGES).In addition, we provide a **Language Detector**, which is able to detect 103 languages in ISO 693-3 standard.
Works well with longer texts that have at least 2 sentences or more from the same language.It is compatible with OpenNLP `>= 1.8.3`. Model details are available [here](https://downloads.apache.org/opennlp/models/langdetect/1.8.3/).
## Getting Started
The [Universal Dependencies](https://universaldependencies.org) (UD) community provides a framework for consistent annotation of grammar across different human languages.
The project is developing cross-linguistically consistent treebank annotation for 150+ languages.### Referencing published Models
You can import UD-based model artifacts directly via Maven, SBT or Gradle, for instance:
#### Maven
```
org.apache.opennlp
opennlp-models-pos-de
${opennlp.models.version}```
for all **32** supported languages, listed on the Apache OpenNLP [Model page](https://opennlp.apache.org/models.html).
The broader langdetect model can be referenced like this:
```
org.apache.opennlp
opennlp-models-langdetect
${opennlp.models.version}```
#### SBT
```
libraryDependencies += "org.apache.opennlp" % "opennlp-models-langdetect" % "${opennlp.version}"
```#### Gradle
```
compile group: "org.apache.opennlp", name: "opennlp-models-langdetect", version: "${opennlp.version}"
```For more details please check our [documentation](https://opennlp.apache.org/docs/)
### Training Models
All released _sentence detection_, _tokenization_, _lemmatizer_, and _POS tagging_ models were and can be trained via the `ud-train.sh` script.
It is located in the _opennlp-models-training-ud_ directory in this repository.#### Preparing the environment
Before training UD-based OpenNLP models, the execution environment needs the latest [OpenNLP release](https://opennlp.apache.org/download.html) and the latest set of [UD treebanks](https://universaldependencies.org/#download).
Download the corresponding archive files and uncompress them both in the same directory in which the training script resides.
Rename both folders according to the `OPENNLP_HOME` and `UD_HOME` variables.> [!IMPORTANT]
> Check and adjust the version string in both variables, that is, to the versions you have actually downloaded.#### Selecting model types
Next, select what type of models should be trained. By default, the script defines:
```
TRAIN_TOKENIZER="true"
TRAIN_POSTAGGER="true"
TRAIN_SENTDETECT="true"
TRAIN_LEMMATIZER="true"
```Simply switch off a certain type, by setting the corresponding variable to false.
#### Selecting languages
By default, treebanks of 32 supported languages are included in the `MODELS` variable of the script.
If only a smaller or different (sub-)set is required, this variable can simply be edited.
The format must be followed: `|<2-digit-locale-code>|`, for example: `English|en|EWT` or `Swedish|sv|Talbanken`.> [!NOTE]
> The full list of supported languages and related treebanks is available [here](https://universaldependencies.org/#current-ud-languages).
> Yet, even listed on the UD page, training OpenNLP models might not succeed. If it succeeds, check the evaluation logs (_*.eval_) if the computed accuracy meets your expectations.
#### Adjusting training parametersOnce you're done with the preparations, check the `ud-train.conf` file. With this config file, you can adjust the number of threads used for certain training steps.
Moreover, it is possible to adjust the number of iterations (default: 150) to achieve (slightly) better model performance.#### Executing 'ud-train.sh'
Make sure to make the `ud-train.sh` script executable.
On Unix-oid environments this can simply be achieved by setting the execute bit: `chmod 744 ud-train.sh`.> [!TIP]
> As model training(s) can be a long-running task, depending on CPU type and number of CPU cores,
> the script should be started inside a [`screen`](https://www.man7.org/linux/man-pages/man1/screen.1.html) instance.Finally, execute the script via invoking `./ud-train.sh` and start brewing and enjoying some :coffee:.
The script logs each training (and evaluation) step per selected language / treebank, thus allowing progress tracking.
#### Evaluating trained Models
After a training step succeeds, a corresponding evaluation step is executed. If you want to skip it, set `EVAL_AFTER_TRAINING` to `false`.
In case the evaluation is run, the resulting performance (accuracy) is written to files ending with `.eval`.### Adding new Models
When adding new models to the `pom.xml`, ensure to add new models to the `expected-models.txt` file located in `opennlp-models-test`.
In addition, make sure a sha256 hash is computed on each binary artifact.
The corresponding value must be set or updated correctly for each model type and language.## Contributing
The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.
If you would like to get involved please follow the instructions [here](https://github.com/apache/opennlp/blob/main/.github/CONTRIBUTING.md)