{"id":16819838,"url":"https://github.com/wanasit/kotori","last_synced_at":"2025-07-21T16:03:53.907Z","repository":{"id":66160075,"uuid":"262215376","full_name":"wanasit/kotori","owner":"wanasit","description":"A Japanese tokenizer and morphological analysis engine written in Kotlin","archived":false,"fork":false,"pushed_at":"2020-08-30T06:18:17.000Z","size":25313,"stargazers_count":54,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-04T03:22:49.347Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wanasit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-08T03:13:20.000Z","updated_at":"2024-12-25T04:22:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"6324ddbd-9d59-4f93-a534-42352927fcf6","html_url":"https://github.com/wanasit/kotori","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/wanasit/kotori","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanasit%2Fkotori","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanasit%2Fkotori/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanasit%2Fkotori/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanasit%2Fkotori/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wanasit","download_url":"https://codeload.github.com/wanasit/kotori/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanasit%2Fkotori/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266332369,"owners_count":23912660,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-21T11:47:31.412Z","response_time":64,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T10:54:45.901Z","updated_at":"2025-07-21T16:03:53.875Z","avatar_url":"https://github.com/wanasit.png","language":"Kotlin","funding_links":[],"categories":["人工智能"],"sub_categories":["自然语言处理"],"readme":"# Kotori\nA Japanese tokenizer and morphological analysis engine written in Kotlin\n\n### Usage\n\n```kotlin\nimport com.github.wanasit.kotori.Tokenizer\n\nfun main(args: Array\u003cString\u003e) {\n    val tokenizer = Tokenizer.createDefaultTokenizer()\n    val words = tokenizer.tokenize(\"お寿司が食べたい。\").map { it.text }\n\n    println(words) // [お, 寿司, が, 食べ, たい, 。]\n}\n```\n\n### Installation\n\nKotori packages are hosted by [bintray](https://bintray.com/beta/#/wanasit/maven/Kotori?tab=overview) and JCenter.\nYou can download and install it via Gradle or Maven.\n\nGradle:\n```groovy\nrepositories {\n    jcenter()\n}\n\ndependencies {\n    ...\n    implementation 'com.github.wanasit.kotori:kotori:0.0.3'\n}\n```\n\nMaven:\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.github.wanasit.kotori\u003c/groupId\u003e\n  \u003cartifactId\u003ekotori\u003c/artifactId\u003e\n  \u003cversion\u003eVERSION_NUMBER\u003c/version\u003e\n  \u003ctype\u003epom\u003c/type\u003e\n\u003c/dependency\u003e\n```\n\nYou can also install Kotori via [Jitpack](https://jitpack.io/#wanasit/kotori). \n\n### Dictionary \n\nKotori has a built-in dictionary, based-on `mecab-ipadic-2.7.0-20070801`.\n\n```kotlin\nval dictionary = Dictionary.readDefaultFromResource()\nval tokenizer = Tokenizer.create(dictionary)\n\ntokenizer.tokenize(\"お寿司が食べたい。\")\n```\n\nHowever, it also works out-of-box with any Mecab dictionary. For example:\n* IPADIC ([2.7.0-20070801](http://atilika.com/releases/mecab-ipadic/mecab-ipadic-2.7.0-20070801.tar.gz))\n* UniDic ([2.1.2](http://atilika.com/releases/unidic-mecab/unidic-mecab-2.1.2_src.zip))\n* JUMANDIC ([7.0-20130310](http://atilika.com/releases/mecab-jumandic/mecab-jumandic-7.0-20130310.tar.gz))\n\n```kotlin\nval dictionary = MeCabDictionary.readFromDirectory(\"~/Download/mecab-ipadic-2.7.0-20070801\")\nval tokenizer = Tokenizer.create(dictionary)\n\ntokenizer.tokenize(\"お寿司が食べたい。\")\n```\n\nNote: [Sudachi](https://github.com/WorksApplications/Sudachi) dictionaries and plugins support are under development.\n\n### Performance\n\nKotori is heavily inspired by [Kuromoji](https://github.com/atilika/kuromoji) and [Sudachi](https://github.com/WorksApplications/Sudachi), \nbut its tokenization is even faster than other JVM-based tokenizers (based-on our *probably unfair* benchmark).\n\nThe following is statistic from tokenizing Japanese sentences from [Tatoeba](https://tatoeba.org/eng/) \n(193,898 sentences entries, 3,561,854 total characters) on Macbook Pro 2020 (2.4 GHz 8-Core Intel Core i9).\n\n|   |  Token Count  | Time (ns per document) |  Time (ns per token)  |\n|---|---:|---:|---:|\n|Kuromoji (IPADIC) | 2,264,560 | 10,095 | 864 |\n|**Kotori (IPADIC)**   | 2,264,705 | **8,190**| **701** |\n|Sudachi (sudachi-dictionary-20200330-small)  | 2,308,873 | 27,352 | 2296 |\n|Kotori (sudachi-dictionary-20200330-small)   | 2,157,820 | 13,079 | 1175 |\n\n#### (Speculative) What makes Kotori fast\n\n* **Minimal String.substring() usage**. [After JDK 7](https://www.programcreek.com/2013/09/the-substring-method-in-jdk-6-and-jdk-7/), \nthe function makes string copy and has O(n) overhead. Some tokenizers that design before the change (e.g. Kuromoji) still have a lot of substrings.\n\n* **A customized Trie data structure**. \n`TransitionArrayTrie` can be quickly built just-in-time when creating a tokenizer,\nbut it has pretty good performance on Japanese in UTF-16.\n\n#### (Speculative) What makes Kotori slow\n\n* **Kotori doesn't rely on any pre-built data structure** (e.g. `DoubleArrayTrie`). \nIt reads a dictionary as list-of-terms format and builds Trie just-in-time.\nThis is a design decision to make Kotori open to multiple dictionary formats in exchange for some bootup time.\n\n* Kotlin (written by the inexperience library author) is slower than Java, \nmostly, because Kotlin's `Array\u003cT?\u003e` has some overhead comparing to Java's native `T[]`.\n\n#### Benchmark\n\nBenchmark can be run as a gradle task.\n\n```bash\n./gradlew benchmark\n./gradlew benchmark --args='--tokenizer=kuromoji'\n./gradlew benchmark --args='--tokenizer=kotori --dictionary=sudachi-small'\n```\n\nCheck [the source code](https://github.com/wanasit/kotori/blob/master/kotori-benchmark/src/main/kotlin/com/github/wanasit/kotori/benchmark/Benchmark.kt) \nin `kotori-benchmark` project for more details.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwanasit%2Fkotori","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwanasit%2Fkotori","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwanasit%2Fkotori/lists"}