{"id":13676635,"url":"https://github.com/WorksApplications/elasticsearch-sudachi","last_synced_at":"2025-04-29T07:33:06.578Z","repository":{"id":37334831,"uuid":"110100447","full_name":"WorksApplications/elasticsearch-sudachi","owner":"WorksApplications","description":"The Japanese analysis plugin for elasticsearch","archived":false,"fork":false,"pushed_at":"2025-03-31T09:11:57.000Z","size":1397,"stargazers_count":196,"open_issues_count":7,"forks_count":42,"subscribers_count":13,"default_branch":"develop","last_synced_at":"2025-03-31T10:24:28.028Z","etag":null,"topics":["elasticsearch-plugin","morphological-analyser"],"latest_commit_sha":null,"homepage":null,"language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WorksApplications.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"WorksApplications"}},"created_at":"2017-11-09T10:21:47.000Z","updated_at":"2025-03-31T09:12:01.000Z","dependencies_parsed_at":"2023-12-21T12:40:58.802Z","dependency_job_id":"6765831c-c5b0-4e5e-a9e5-1838db611432","html_url":"https://github.com/WorksApplications/elasticsearch-sudachi","commit_stats":null,"previous_names":[],"tags_count":130,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2Felasticsearch-sudachi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2Felasticsearch-sudachi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2Felasticsearch-sudachi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2Felasticsearch-sudachi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WorksApplications","download_url":"https://codeload.github.com/WorksApplications/elasticsearch-sudachi/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251456056,"owners_count":21592285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elasticsearch-plugin","morphological-analyser"],"created_at":"2024-08-02T13:00:30.725Z","updated_at":"2025-04-29T07:33:01.563Z","avatar_url":"https://github.com/WorksApplications.png","language":"Kotlin","readme":"# analysis-sudachi\n\nanalysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.\n\n![build](https://github.com/WorksApplications/elasticsearch-sudachi/workflows/build/badge.svg)\n[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=WorksApplications_elasticsearch-sudachi\u0026metric=alert_status)](https://sonarcloud.io/dashboard?id=WorksApplications_elasticsearch-sudachi)\n\n# What's new?\n\n- [3.2.2]\n  - Use `lazyTokenizeSentences` for the analysis to fix the problem of input chunking (#137).\n\nCheck [changelog](./CHANGELOG.md) for more.\n\n# Build (if necessary)\n\n1. Build analysis-sudachi.\n```\n   $ ./gradlew -PengineVersion=es:8.13.4 build\n```\n\nUse `-PengineVersion=os:2.14.0` for OpenSearch.\n\n## Supported ElasticSearch versions\n\n1. 8.0.* until 8.13.* supported, integration tests in CI\n2. 7.17.* (latest patch version) - supported, integration tests in CI\n3. 7.11.* until 7.16.* - best effort support, not tested in CI\n4. 7.10.* integration tests for the latest patch version\n5. 7.9.* and below - not tested in CI at all, may be broken\n6. 7.3.* and below - broken, not supported\n\n## Supported OpenSearch versions\n\n1. 2.6.* until 2.14.* supported, integration tests in CI\n\n# Installation\n\n1. Move current dir to $ES_HOME\n2. Install the Plugin\n\n   a. Using the release package\n   ```\n   $ bin/elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.1/analysis-sudachi-8.13.4-3.1.1.zip\n   ```\n   b. Using self-build package\n   ```\n   $ bin/elasticsearch-plugin install file:///path/to/analysis-sudachi-8.13.4-3.1.1.zip\n   ```\n   (Specify the absolute path in URI format)\n3. Download sudachi dictionary archive from https://github.com/WorksApplications/SudachiDict\n4. Extract dic file and place it to config/sudachi/system_core.dic\n   (You must install system_core.dic in this place if you use Elasticsearch 7.6 or later)\n5. Execute \"bin/elasticsearch\"\n\n## Update Sudachi\n\nIf you want to update Sudachi that is included in a plugin you have installed, do the following\n\n1. Download the latest version of Sudachi from [the release page](https://github.com/WorksApplications/Sudachi/releases).\n2. Extract the Sudachi JAR file from the zip.\n3. Delete the sudachi JAR file in $ES_HOME/plugins/analysis-sudachi and replace it with the JAR file you extracted in step 2.\n\n# Analyzer\n\nAn analyzer `sudachi` is provided.\nThis is equivalent to the following custom analyzer.\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"analyzer\": {\n          \"default_sudachi_analyzer\": {\n            \"type\": \"custom\",\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"filter\": [\n              \"sudachi_baseform\",\n              \"sudachi_part_of_speech\",\n              \"sudachi_ja_stop\"\n            ]\n          }\n        }\n      }\n    }\n  }\n}\n```\n\nSee following sections for the detail of the tokenizer and each filters.\n\n# Tokenizer\n\nThe `sudachi_tokenizer` tokenizer tokenizes input texts using Sudachi.\n\n- split_mode: Select splitting mode of Sudachi. (A, B, C) (string, default: C)\n  - C: Extracts named entities\n      - Ex) 選挙管理委員会\n  - B: Into the middle units\n      - Ex) 選挙,管理,委員会\n  - A: The shortest units equivalent to the UniDic short unit\n      - Ex) 選挙,管理,委員,会\n- discard\\_punctuation: Select to discard punctuation or not. (bool, default: true)\n- settings\\_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to es\\_config. (string, default: null)\n- resources\\_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to es\\_config. (string, default: null)\n- additional_settings: Describes a configuration JSON string for Sudachi. This JSON string will be merged into the default configuration. If this property is set, `settings_path` will be overridden.\n\n## Dictionary\n\nBy default, `ES_HOME/config/sudachi/sudachi_core.dic` is used.\nYou can specify the dictionary either in the file specified by `settings_path` or by `additional_settings`.\nDue to the security manager, you need to put resources (setting file, dictionaries, and others) under the elasticsearch config directory.\n\n## Example\n\ntokenizer configuration\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\",\n            \"split_mode\": \"C\",\n            \"discard_punctuation\": true,\n            \"resources_path\": \"/etc/elasticsearch/config/sudachi\"\n          }\n        },\n        \"analyzer\": {\n          \"sudachi_analyzer\": {\n            \"type\": \"custom\",\n            \"tokenizer\": \"sudachi_tokenizer\"\n          }\n        }\n      }\n    }\n  }\n}\n```\n\ndictionary settings\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\",\n            \"additional_settings\": \"{\\\"systemDict\\\":\\\"system_full.dic\\\",\\\"userDict\\\":[\\\"user.dic\\\"]}\"\n          }\n        },\n        \"analyzer\": {\n          \"sudachi_analyzer\": {\n            \"type\": \"custom\",\n            \"tokenizer\": \"sudachi_tokenizer\"\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n# Filters\n\n## sudachi\\_split\n\nThe `sudachi_split` token filter works like `mode` of kuromoji.\n\n- mode\n  - \"search\": Additional segmentation useful for search. (Use C and A mode)\n    - Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ\n  - \"extended\": Similar to search mode, but also unigram unknown words.\n    - Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ\n\nNote: In search query, split subwords are handled as a phrase (in the same way to multi-word synonyms). If you want to search with both A/C unit, use multiple tokenizers instead.\n\n### PUT sudachi_sample\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\"\n          }\n        },\n        \"analyzer\": {\n          \"sudachi_analyzer\": {\n            \"filter\": [\"my_searchfilter\"],\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"type\": \"custom\"\n          }\n        },\n        \"filter\":{\n          \"my_searchfilter\": {\n            \"type\": \"sudachi_split\",\n            \"mode\": \"search\"\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n### POST sudachi_sample/_analyze\n\n```json\n{\n    \"analyzer\": \"sudachi_analyzer\",\n    \"text\": \"関西国際空港\"\n}\n```\n\nWhich responds with:\n\n```json\n{\n  \"tokens\" : [\n    {\n      \"token\" : \"関西国際空港\",\n      \"start_offset\" : 0,\n      \"end_offset\" : 6,\n      \"type\" : \"word\",\n      \"position\" : 0,\n      \"positionLength\" : 3\n    },\n    {\n      \"token\" : \"関西\",\n      \"start_offset\" : 0,\n      \"end_offset\" : 2,\n      \"type\" : \"word\",\n      \"position\" : 0\n    },\n    {\n      \"token\" : \"国際\",\n      \"start_offset\" : 2,\n      \"end_offset\" : 4,\n      \"type\" : \"word\",\n      \"position\" : 1\n    },\n    {\n      \"token\" : \"空港\",\n      \"start_offset\" : 4,\n      \"end_offset\" : 6,\n      \"type\" : \"word\",\n      \"position\" : 2\n    }\n  ]\n}\n```\n\n## sudachi\\_part\\_of\\_speech\n\nThe `sudachi_part_of_speech` token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:\n\nThe `stoptags` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.\n\nSudachi POS information is a csv list, consisting 6 items;\n\n- 1-4 `part-of-speech hierarchy (品詞階層)`\n- 5 `inflectional type (活用型)`\n- 6 `inflectional form (活用形)`\n\nWith the `stoptags`, you can filter out the result in any of these forward matching forms;\n\n- 1 - e.g., `名詞`\n- 1,2 - e.g., `名詞,固有名詞`\n- 1,2,3 - e.g., `名詞,固有名詞,地名`\n- 1,2,3,4 - e.g., `名詞,固有名詞,地名,一般`\n- 5 - e.g., `五段-カ行`\n- 6 - e.g., `終止形-一般`\n- 5,6 - e.g., `五段-カ行,終止形-一般`\n\n### PUT sudachi_sample\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\"\n          }\n        },\n        \"analyzer\": {\n          \"sudachi_analyzer\": {\n            \"filter\": [\"my_posfilter\"],\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"type\": \"custom\"\n          }\n        },\n        \"filter\":{\n          \"my_posfilter\":{\n            \"type\":\"sudachi_part_of_speech\",\n            \"stoptags\":[\n              \"助詞\",\n              \"助動詞\",\n              \"補助記号,句点\",\n              \"補助記号,読点\"\n            ]\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n### POST sudachi_sample/_analyze\n\n```json\n{\n  \"analyzer\": \"sudachi_analyzer\",\n  \"text\": \"寿司がおいしいね\"\n}\n```\n\nWhich responds with:\n\n```json\n{\n  \"tokens\": [\n    {\n      \"token\": \"寿司\",\n      \"start_offset\": 0,\n      \"end_offset\": 2,\n      \"type\": \"word\",\n      \"position\": 0\n    },\n    {\n      \"token\": \"おいしい\",\n      \"start_offset\": 3,\n      \"end_offset\": 7,\n      \"type\": \"word\",\n      \"position\": 2\n    }\n  ]\n}\n```\n\n## sudachi\\_ja\\_stop\n\nThe `sudachi_ja_stop` token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.\n\n### PUT sudachi_sample\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\"\n          }\n        },\n        \"analyzer\": {\n          \"sudachi_analyzer\": {\n            \"filter\": [\"my_stopfilter\"],\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"type\": \"custom\"\n          }\n        },\n        \"filter\":{\n          \"my_stopfilter\":{\n            \"type\":\"sudachi_ja_stop\",\n            \"stopwords\":[\n              \"_japanese_\",\n              \"は\",\n              \"です\"\n            ]\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n### POST sudachi_sample/_analyze\n\n```json\n{\n  \"analyzer\": \"sudachi_analyzer\",\n  \"text\": \"私は宇宙人です。\"\n}\n```\n\nWhich responds with:\n\n```json\n{\n  \"tokens\": [\n    {\n      \"token\": \"私\",\n      \"start_offset\": 0,\n      \"end_offset\": 1,\n      \"type\": \"word\",\n      \"position\": 0\n    },\n    {\n      \"token\": \"宇宙\",\n      \"start_offset\": 2,\n      \"end_offset\": 4,\n      \"type\": \"word\",\n      \"position\": 2\n    },\n    {\n      \"token\": \"人\",\n      \"start_offset\": 4,\n      \"end_offset\": 5,\n      \"type\": \"word\",\n      \"position\": 3\n    }\n  ]\n}\n```\n\n## sudachi\\_baseform\n\nThe `sudachi_baseform` token filter replaces terms with their Sudachi dictionary form. This acts as a lemmatizer for verbs and adjectives.\n\nThis will be overridden by `sudachi_split`, `sudachi_normalizedform` or `sudachi_readingform` token filters.\n\n### PUT sudachi_sample\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\"\n          }\n        },\n        \"analyzer\": {\n          \"sudachi_analyzer\": {\n            \"filter\": [\"sudachi_baseform\"],\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"type\": \"custom\"\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n### POST sudachi_sample/_analyze\n\n```json\n{\n  \"analyzer\": \"sudachi_analyzer\",\n  \"text\": \"飲み\"\n}\n```\n\nWhich responds with:\n\n```json\n{\n  \"tokens\": [\n    {\n      \"token\": \"飲む\",\n      \"start_offset\": 0,\n      \"end_offset\": 2,\n      \"type\": \"word\",\n      \"position\": 0\n    }\n  ]\n}\n```\n\n## sudachi\\_normalizedform\n\nThe `sudachi_normalizedform` token filter replaces terms with their Sudachi normalized form. This acts as a normalizer for spelling variants.\nThis filter lemmatizes verbs and adjectives too. You don't need to use `sudachi_baseform` filter with this filter.\n\nThis will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_readingform` token filters.\n\n### PUT sudachi_sample\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\"\n          }\n        },\n        \"analyzer\": {\n          \"sudachi_analyzer\": {\n            \"filter\": [\"sudachi_normalizedform\"],\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"type\": \"custom\"\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n### POST sudachi_sample/_analyze\n\n```json\n{\n  \"analyzer\": \"sudachi_analyzer\",\n  \"text\": \"呑み\"\n}\n```\n\nWhich responds with:\n\n```json\n{\n  \"tokens\": [\n    {\n      \"token\": \"飲む\",\n      \"start_offset\": 0,\n      \"end_offset\": 2,\n      \"type\": \"word\",\n      \"position\": 0\n    }\n  ]\n}\n```\n\n## sudachi\\_readingform\n\nThe `sudachi_readingform` token filter replaces the terms with their reading form in either katakana or romaji.\n\nThis will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_normalizedform` token filters.\n\nAccepts the following setting:\n\n- use_romaji\n  - Whether romaji reading form should be output instead of katakana. Defaults to false.\n\n### PUT sudachi_sample\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"filter\": {\n          \"romaji_readingform\": {\n            \"type\": \"sudachi_readingform\",\n            \"use_romaji\": true\n          },\n          \"katakana_readingform\": {\n            \"type\": \"sudachi_readingform\",\n            \"use_romaji\": false\n          }\n        },\n        \"tokenizer\": {\n          \"sudachi_tokenizer\": {\n            \"type\": \"sudachi_tokenizer\"\n          }\n        },\n        \"analyzer\": {\n          \"romaji_analyzer\": {\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"filter\": [\"romaji_readingform\"]\n          },\n          \"katakana_analyzer\": {\n            \"tokenizer\": \"sudachi_tokenizer\",\n            \"filter\": [\"katakana_readingform\"]\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n### POST sudachi_sample/_analyze\n\n```json\n{\n  \"analyzer\": \"katakana_analyzer\",\n  \"text\": \"寿司\"\n}\n```\n\nReturns `スシ`.\n\n```json\n{\n  \"analyzer\": \"romaji_analyzer\",\n  \"text\": \"寿司\"\n}\n```\n\nReturns `susi`.\n\n\n# Synonym\n\nThere is a temporary way to use Sudachi Dictionary's synonym resource ([Sudachi 同義語辞書](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md)) with Elasticsearch.\n\nPlease refer to [this document](docs/synonym.md) for the detail.\n\n\n# License\n\nCopyright (c) 2017-2024 Works Applications Co., Ltd.\nOriginally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch\nOriginally under lucene, https://lucene.apache.org/\n","funding_links":["https://github.com/sponsors/WorksApplications"],"categories":["Kotlin"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWorksApplications%2Felasticsearch-sudachi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FWorksApplications%2Felasticsearch-sudachi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWorksApplications%2Felasticsearch-sudachi/lists"}