Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/WorksApplications/elasticsearch-sudachi
The Japanese analysis plugin for elasticsearch
https://github.com/WorksApplications/elasticsearch-sudachi
elasticsearch-plugin morphological-analyser
Last synced: 3 months ago
JSON representation
The Japanese analysis plugin for elasticsearch
- Host: GitHub
- URL: https://github.com/WorksApplications/elasticsearch-sudachi
- Owner: WorksApplications
- License: apache-2.0
- Created: 2017-11-09T10:21:47.000Z (about 7 years ago)
- Default Branch: develop
- Last Pushed: 2024-06-14T07:52:10.000Z (5 months ago)
- Last Synced: 2024-06-17T13:33:15.330Z (5 months ago)
- Topics: elasticsearch-plugin, morphological-analyser
- Language: Kotlin
- Size: 1.3 MB
- Stars: 175
- Watchers: 15
- Forks: 41
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# analysis-sudachi
analysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.
![build](https://github.com/WorksApplications/elasticsearch-sudachi/workflows/build/badge.svg)
[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=WorksApplications_elasticsearch-sudachi&metric=alert_status)](https://sonarcloud.io/dashboard?id=WorksApplications_elasticsearch-sudachi)# What's new?
- [3.2.2]
- Use `lazyTokenizeSentences` for the analysis to fix the problem of input chunking (#137).Check [changelog](./CHANGELOG.md) for more.
# Build (if necessary)
1. Build analysis-sudachi.
```
$ ./gradlew -PengineVersion=es:8.13.4 build
```Use `-PengineVersion=os:2.14.0` for OpenSearch.
## Supported ElasticSearch versions
1. 8.0.* until 8.13.* supported, integration tests in CI
2. 7.17.* (latest patch version) - supported, integration tests in CI
3. 7.11.* until 7.16.* - best effort support, not tested in CI
4. 7.10.* integration tests for the latest patch version
5. 7.9.* and below - not tested in CI at all, may be broken
6. 7.3.* and below - broken, not supported## Supported OpenSearch versions
1. 2.6.* until 2.14.* supported, integration tests in CI
# Installation
1. Move current dir to $ES_HOME
2. Install the Plugina. Using the release package
```
$ bin/elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.1/analysis-sudachi-8.13.4-3.1.1.zip
```
b. Using self-build package
```
$ bin/elasticsearch-plugin install file:///path/to/analysis-sudachi-8.13.4-3.1.1.zip
```
(Specify the absolute path in URI format)
3. Download sudachi dictionary archive from https://github.com/WorksApplications/SudachiDict
4. Extract dic file and place it to config/sudachi/system_core.dic
(You must install system_core.dic in this place if you use Elasticsearch 7.6 or later)
5. Execute "bin/elasticsearch"## Update Sudachi
If you want to update Sudachi that is included in a plugin you have installed, do the following
1. Download the latest version of Sudachi from [the release page](https://github.com/WorksApplications/Sudachi/releases).
2. Extract the Sudachi JAR file from the zip.
3. Delete the sudachi JAR file in $ES_HOME/plugins/analysis-sudachi and replace it with the JAR file you extracted in step 2.# Analyzer
An analyzer `sudachi` is provided.
This is equivalent to the following custom analyzer.```json
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default_sudachi_analyzer": {
"type": "custom",
"tokenizer": "sudachi_tokenizer",
"filter": [
"sudachi_baseform",
"sudachi_part_of_speech",
"sudachi_ja_stop"
]
}
}
}
}
}
}
```See following sections for the detail of the tokenizer and each filters.
# Tokenizer
The `sudachi_tokenizer` tokenizer tokenizes input texts using Sudachi.
- split_mode: Select splitting mode of Sudachi. (A, B, C) (string, default: C)
- C: Extracts named entities
- Ex) 選挙管理委員会
- B: Into the middle units
- Ex) 選挙,管理,委員会
- A: The shortest units equivalent to the UniDic short unit
- Ex) 選挙,管理,委員,会
- discard\_punctuation: Select to discard punctuation or not. (bool, default: true)
- settings\_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to es\_config. (string, default: null)
- resources\_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to es\_config. (string, default: null)
- additional_settings: Describes a configuration JSON string for Sudachi. This JSON string will be merged into the default configuration. If this property is set, `settings_path` will be overridden.## Dictionary
By default, `ES_HOME/config/sudachi/sudachi_core.dic` is used.
You can specify the dictionary either in the file specified by `settings_path` or by `additional_settings`.
Due to the security manager, you need to put resources (setting file, dictionaries, and others) under the elasticsearch config directory.## Example
tokenizer configuration
```json
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer",
"split_mode": "C",
"discard_punctuation": true,
"resources_path": "/etc/elasticsearch/config/sudachi"
}
},
"analyzer": {
"sudachi_analyzer": {
"type": "custom",
"tokenizer": "sudachi_tokenizer"
}
}
}
}
}
}
```dictionary settings
```json
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer",
"additional_settings": "{\"systemDict\":\"system_full.dic\",\"userDict\":[\"user.dic\"]}"
}
},
"analyzer": {
"sudachi_analyzer": {
"type": "custom",
"tokenizer": "sudachi_tokenizer"
}
}
}
}
}
}
```# Filters
## sudachi\_split
The `sudachi_split` token filter works like `mode` of kuromoji.
- mode
- "search": Additional segmentation useful for search. (Use C and A mode)
- Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ
- "extended": Similar to search mode, but also unigram unknown words.
- Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラNote: In search query, split subwords are handled as a phrase (in the same way to multi-word synonyms). If you want to search with both A/C unit, use multiple tokenizers instead.
### PUT sudachi_sample
```json
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": ["my_searchfilter"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
},
"filter":{
"my_searchfilter": {
"type": "sudachi_split",
"mode": "search"
}
}
}
}
}
}
```### POST sudachi_sample/_analyze
```json
{
"analyzer": "sudachi_analyzer",
"text": "関西国際空港"
}
```Which responds with:
```json
{
"tokens" : [
{
"token" : "関西国際空港",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0,
"positionLength" : 3
},
{
"token" : "関西",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "国際",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "空港",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 2
}
]
}
```## sudachi\_part\_of\_speech
The `sudachi_part_of_speech` token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:
The `stoptags` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
Sudachi POS information is a csv list, consisting 6 items;
- 1-4 `part-of-speech hierarchy (品詞階層)`
- 5 `inflectional type (活用型)`
- 6 `inflectional form (活用形)`With the `stoptags`, you can filter out the result in any of these forward matching forms;
- 1 - e.g., `名詞`
- 1,2 - e.g., `名詞,固有名詞`
- 1,2,3 - e.g., `名詞,固有名詞,地名`
- 1,2,3,4 - e.g., `名詞,固有名詞,地名,一般`
- 5 - e.g., `五段-カ行`
- 6 - e.g., `終止形-一般`
- 5,6 - e.g., `五段-カ行,終止形-一般`### PUT sudachi_sample
```json
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": ["my_posfilter"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
},
"filter":{
"my_posfilter":{
"type":"sudachi_part_of_speech",
"stoptags":[
"助詞",
"助動詞",
"補助記号,句点",
"補助記号,読点"
]
}
}
}
}
}
}
```### POST sudachi_sample/_analyze
```json
{
"analyzer": "sudachi_analyzer",
"text": "寿司がおいしいね"
}
```Which responds with:
```json
{
"tokens": [
{
"token": "寿司",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "おいしい",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 2
}
]
}
```## sudachi\_ja\_stop
The `sudachi_ja_stop` token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.
### PUT sudachi_sample
```json
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": ["my_stopfilter"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
},
"filter":{
"my_stopfilter":{
"type":"sudachi_ja_stop",
"stopwords":[
"_japanese_",
"は",
"です"
]
}
}
}
}
}
}
```### POST sudachi_sample/_analyze
```json
{
"analyzer": "sudachi_analyzer",
"text": "私は宇宙人です。"
}
```Which responds with:
```json
{
"tokens": [
{
"token": "私",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "宇宙",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "人",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
}
]
}
```## sudachi\_baseform
The `sudachi_baseform` token filter replaces terms with their Sudachi dictionary form. This acts as a lemmatizer for verbs and adjectives.
This will be overridden by `sudachi_split`, `sudachi_normalizedform` or `sudachi_readingform` token filters.
### PUT sudachi_sample
```json
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": ["sudachi_baseform"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
}
}
}
}
}
```### POST sudachi_sample/_analyze
```json
{
"analyzer": "sudachi_analyzer",
"text": "飲み"
}
```Which responds with:
```json
{
"tokens": [
{
"token": "飲む",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}
```## sudachi\_normalizedform
The `sudachi_normalizedform` token filter replaces terms with their Sudachi normalized form. This acts as a normalizer for spelling variants.
This filter lemmatizes verbs and adjectives too. You don't need to use `sudachi_baseform` filter with this filter.This will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_readingform` token filters.
### PUT sudachi_sample
```json
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": ["sudachi_normalizedform"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
}
}
}
}
}
```### POST sudachi_sample/_analyze
```json
{
"analyzer": "sudachi_analyzer",
"text": "呑み"
}
```Which responds with:
```json
{
"tokens": [
{
"token": "飲む",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}
```## sudachi\_readingform
The `sudachi_readingform` token filter replaces the terms with their reading form in either katakana or romaji.
This will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_normalizedform` token filters.
Accepts the following setting:
- use_romaji
- Whether romaji reading form should be output instead of katakana. Defaults to false.### PUT sudachi_sample
```json
{
"settings": {
"index": {
"analysis": {
"filter": {
"romaji_readingform": {
"type": "sudachi_readingform",
"use_romaji": true
},
"katakana_readingform": {
"type": "sudachi_readingform",
"use_romaji": false
}
},
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer"
}
},
"analyzer": {
"romaji_analyzer": {
"tokenizer": "sudachi_tokenizer",
"filter": ["romaji_readingform"]
},
"katakana_analyzer": {
"tokenizer": "sudachi_tokenizer",
"filter": ["katakana_readingform"]
}
}
}
}
}
}
```### POST sudachi_sample/_analyze
```json
{
"analyzer": "katakana_analyzer",
"text": "寿司"
}
```Returns `スシ`.
```json
{
"analyzer": "romaji_analyzer",
"text": "寿司"
}
```Returns `susi`.
# Synonym
There is a temporary way to use Sudachi Dictionary's synonym resource ([Sudachi 同義語辞書](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md)) with Elasticsearch.
Please refer to [this document](docs/synonym.md) for the detail.
# License
Copyright (c) 2017-2024 Works Applications Co., Ltd.
Originally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch
Originally under lucene, https://lucene.apache.org/