https://github.com/yahoojapan/vespa-kuromoji-linguistics
https://github.com/yahoojapan/vespa-kuromoji-linguistics
kuromoji vespa vespa-linguistics
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/yahoojapan/vespa-kuromoji-linguistics
- Owner: yahoojapan
- License: apache-2.0
- Created: 2018-03-08T05:17:44.000Z (about 7 years ago)
- Default Branch: main
- Last Pushed: 2024-04-03T07:04:32.000Z (about 1 year ago)
- Last Synced: 2025-03-26T16:11:57.675Z (2 months ago)
- Topics: kuromoji, vespa, vespa-linguistics
- Language: Java
- Size: 50.8 KB
- Stars: 15
- Watchers: 6
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Vespa Linguistics with Kuromoji Tokenizer
## Overview
This package provides Japanese tokenizer with Vespa using Kuromoji.
Kuromoji is one of the famous Japanese tokenizer, it is implemented by Java and used by various services such as Solr, Elasticsearch, and so on.
For more details, please see official website of Kuromoji.* [Kuromoji](http://www.atilika.org/)
## Create Package
### Requirement
JDK (>= 11) and maven are required to build package.
### Build
Execute mvn command as below, and you can get package as target/kuromoji-linguistics-${VERSION}-deploy.jar
```
$ mvn package -Dvespa.version='7.594.36' # You can specify 7.594.36 or later.
```## Use Package
### Deploy
Put the built package to components directory of your service. If there is no components directory, create it. For example, the structure will be like below with sampleapps.
* sampleapps/search/music/
* services.xml
* components/
* kuromoji-linguistics-${VERSION}-deploy.jar### Configuration
Because the package will be used by searcher and indexer, it is recommended to define <component> in all <jdisc> sections of services.xml.
```
search
true
```
You can configure package by <config name="language.lib.kuromoji.kuromoji"> (optional). Parameters and default settings are below.
|parameter|type|default|description|
|:--------|:---|:------|:----------|
|mode|string|search|mode of Kuromoji (normal OR search OR extended)|
|kanji.length_threshold|int|2|threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature).|
|kanji.penalty|int|3000|additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature).|
|other.length_threshold|int|7|threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature).|
|other.penalty|int|1700|additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature).|
|nakaguro_split|bool|false|whether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)|
|user_dict|string|-|path of user dictionary|
|tokenlist_name|string|default|target specialtokens name|
|all_language|bool|false|apply kuromoji tokenizer to all language or only Japanese|
|ignore_case|bool|true|ignore upper/lower case difference|### Activate
Simply use deploy command to activate package. For example, commands will be like below with sampleapps.
```
$ vespa-deploy prepare sampleapps/search/music/
$ vespa-deploy activate
```Now, you can use the tokenizer with "language=ja" options !
## License
Code licensed under the Apache 2.0 license. See LICENSE for terms.
## Contributor License Agreement
This project requires contributors to agree to a [Contributor License
Agreement (CLA)](https://gist.github.com/yahoojapanoss/9bf8afd6ea67f32d29b4082abf220340).Note that only for contributions to the vespa-kuromoji-linguistics repository on the GitHub (https://github.com/yahoojapan/vespa-kuromoji-linguistics),
the contributors of them shall be deemed to have agreed to the CLA without individual written agreements.