{"id":47219100,"url":"https://github.com/buda-base/lucene-zh","last_synced_at":"2026-03-13T17:08:38.145Z","repository":{"id":37422317,"uuid":"122181343","full_name":"buda-base/lucene-zh","owner":"buda-base","description":"Simple Lucene analyzer for Traditional, Simplified and Pinyin","archived":false,"fork":false,"pushed_at":"2022-09-04T09:47:46.000Z","size":365,"stargazers_count":3,"open_issues_count":4,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2026-01-14T03:52:55.827Z","etag":null,"topics":["java","lucene-analyzer","pinyin","simplified-chinese","traditional-chinese"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/buda-base.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-20T10:01:56.000Z","updated_at":"2022-06-21T09:39:36.000Z","dependencies_parsed_at":"2022-08-19T15:31:15.661Z","dependency_job_id":null,"html_url":"https://github.com/buda-base/lucene-zh","commit_stats":null,"previous_names":["buddhistdigitalresourcecenter/lucene-zh"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/buda-base/lucene-zh","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-zh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-zh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-zh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-zh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/buda-base","download_url":"https://codeload.github.com/buda-base/lucene-zh/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buda-base%2Flucene-zh/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30471145,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T11:00:43.441Z","status":"ssl_error","status_checked_at":"2026-03-13T11:00:23.173Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","lucene-analyzer","pinyin","simplified-chinese","traditional-chinese"],"created_at":"2026-03-13T17:08:37.311Z","updated_at":"2026-03-13T17:08:38.129Z","avatar_url":"https://github.com/buda-base.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lucene Analyzers for Buddhist Chinese \n\nThis repository contains bricks to process Chinese in Lucene, mainly:\n- stopwords\n- SC/TC conversion\n- Hanzi variants\n- Hanzi to Pinyin conversion\n- Pinyin syllable tokenizer\n\n## Installation\n\nYou can install this analyzer from Maven:\n\n```xml\n    \u003cdependency\u003e\n      \u003cgroupId\u003eio.bdrc.lucene\u003c/groupId\u003e\n      \u003cartifactId\u003elucene-zh\u003c/artifactId\u003e\n      \u003cversion\u003e0.2.0\u003c/version\u003e\n    \u003c/dependency\u003e\n```\n\n## Building from source\n\nA compiled Trie is needed in order to build a complete jar, the base command line to build a jar is thus:\n\n```\nmvn clean compile exec:java package\n```\n\nThe following options alter the packaging:\n\n- `-DincludeDeps=true` includes `io.bdrc.lucene:stemmer` in the produced jar file\n- `-DperformRelease=true` signs the jar file with gpg\n\n## Indexing Pipeline\n\n```\nTC ⟾ normalized TC ⟾ SC ⟾ normalized SC ⟾ normalized PY_strict ⟾ normalized PY_lazy\n```\n\n`normalized TC/SC`: any combination of the following treatments: synonyms, alternatives and stopwords.\n\n`normalized PY`: PY is lower-cased and split in syllables (to match the general policy of indexing individual ideograms). We call `PY_strict` Pinyin with tone indication (diacritics or numbers) and `PY_lazy` Pinyin with no tone indication.\n\n## Constructors\n\n```\nChineseAnalyzer(String indexEncoding,String inputEncoding,boolean stopwords, int variants) \n    indexEncoding- \"TC\", \"SC\", \"PY_strict\" or \"PY_lazy\"\n    inputEncoding- \"TC\", \"SC\", \"PY_strict\" or \"PY_lazy\"\n    stopWords    - true to filter stopwords, false otherwise\n    variants     - 0: no variant; 1: synonyms; 2: alternatives; 3: both\n```\n\n```\nChineseAnalyzer(String profile)\n```\n\n| Profiles            | inputEncoding | indexEncoding | stopWords | variants |\n| :------------------ | :------------ | :------------ | :-------- | :------: |\n| `exactTC`          | TC            | TC            | false     | 0        |\n| `TC`               | TC            | TC            | true      | 3        |\n| `TC2SC`            | TC            | SC            | true      | 3        |\n| `TC2PYstrict`     | TC            | PYstrict      | true      | 3        |\n| `TC2PYlazy`       | TC            | PYlazy        | true      | 3        |\n| `SC`               | SC            | SC            | true      | 3        |\n| `SC2PYstrict`     | SC            | PYstrict      | true      | 3        |\n| `SC2PYlazy`       | SC            | PYlazy        | true      | 3        |\n| `PYstrict`        | PYstrict      | PYstrict      | false     | 0        |\n| `PYstrict2PYlazy`| PYstrict      | PYlazy        | false     | 0        |\n| `PYlazy`          | PYlazy        | PYlazy        | false     | 0        |\n\n\n## Components\n\n### Tokenizers\n\n#### StandardTokenizer\n\nProduces ideogram-based tokens(it incorporated the historical Chinese Tokenizer). \n\n#### WhitespaceTokenizer\n\nUsed together with `PinyinSyllabifyingFilter` in order to avoid giving it big strings.\n\n### Filters\n\n#### PinyinNormalizingFilter (MappingCharFilter)\n\nTODO: when we have more pinyin data or when we know how users type their queries, assess if the normalization is sufficient or not.\n\n#### TC2SCFilter (TokenFilter)\n\nLeverages Unihan data to replace token content with the SC equivalent.\n\n#### ZhToPinyinFilter (TokenFilter)\n\nReplaces the token content(TC and SC) with the pinyin transcription. \n\n#### LazyPinyinFilter (TokenFilter)\n\nRemoves tone marks in Pinyin.\n\n#### LowerCaseFilter (MappingCharFilter)\n\nUsed as a pre-processing step for PY indexing\n\n#### PinyinSyllabifyingFilter (TokenFilter)\n\nProduces syllable-based tokens using `PinyinAlphabetTokenizer`.\nSupports both strict and lazy pinyin.\n\n#### ZhSynonymsFilter (MappingCharFilter)\n\nLeverages Unihan's kSemanticVariant field to index the same variant for all synonyms.\n\n#### ZhAlternatesFilter (MappingCharFilter)\n\nLeverages Unihan's kZVariant field to index the same variant for stylistic variants of the same ideogram.\n\n## Sizes\n\nIn the Unihan database for Unicode 10, 88884 codepoints have an entry (adding full ideograms and parts of surrogate pairs).\n82829 entries have no information about being TC nor SC, 3037 are specifically TC, 3007 are specifically SC and 11 have information about both TC and SC.\n\nThere are 1655 possible syllables in PY and 469 in PY with no diacritics.\n\n## Resources\n\n`src/main/resources` is the output of lucene-zh-data, generated by `make`, except for `pinyin-alphabet.dict`, coming from [here](https://github.com/medcl/elasticsearch-analysis-pinyin/tree/master/src/main/resources).\n\n`src/main/resources/zh-stopwords.txt` is [this stop-list](https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt) \n\n`src/main/resources/zh-stopwords_analyzed.txt` is the same list as above with the corresponding SC, PYstrict and PYlazy corresponding strings. It was generated using `PrettyPrintResult.java`.\n\n## Licence\nThe code is Copyright 2018 Buddhist Digital Resource Center, and is provided under [Apache License 2.0](LICENSE).\n\nFiles in `src/main/resources` remain under [Unicode License](http://unicode.org/copyright.html), except `zh-stopwords.txt`, under [MIT Licence](https://opensource.org/licenses/MIT) and `pinyin-alphabet.dict`, under [Apache 2 Licence](LICENCE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuda-base%2Flucene-zh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbuda-base%2Flucene-zh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuda-base%2Flucene-zh/lists"}