{"id":14981073,"url":"https://github.com/infinilabs/analysis-pinyin","last_synced_at":"2025-05-13T23:08:52.489Z","repository":{"id":3359723,"uuid":"4405468","full_name":"infinilabs/analysis-pinyin","owner":"infinilabs","description":"🛵 This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.","archived":false,"fork":false,"pushed_at":"2025-04-25T03:24:36.000Z","size":33274,"stargazers_count":3026,"open_issues_count":118,"forks_count":555,"subscribers_count":113,"default_branch":"master","last_synced_at":"2025-05-06T23:35:30.618Z","etag":null,"topics":["analyzer","conversion","easysearch","elasticsearch","opensearch","pinyin","pinyin-analysis"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/infinilabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"patreon":"medcl","custom":["https://www.buymeacoffee.com/medcl"]}},"created_at":"2012-05-22T10:45:42.000Z","updated_at":"2025-05-06T13:38:27.000Z","dependencies_parsed_at":"2023-01-16T18:31:36.335Z","dependency_job_id":"242b2778-4224-4459-9efa-6f6e59d09453","html_url":"https://github.com/infinilabs/analysis-pinyin","commit_stats":{"total_commits":156,"total_committers":17,"mean_commits":9.176470588235293,"dds":0.1858974358974359,"last_synced_commit":"275cabd6979c1770d868d810b07ad83abc64550c"},"previous_names":["infinilabs/analysis-pinyin"],"tags_count":217,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infinilabs%2Fanalysis-pinyin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infinilabs%2Fanalysis-pinyin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infinilabs%2Fanalysis-pinyin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infinilabs%2Fanalysis-pinyin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/infinilabs","download_url":"https://codeload.github.com/infinilabs/analysis-pinyin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253870865,"owners_count":21976613,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analyzer","conversion","easysearch","elasticsearch","opensearch","pinyin","pinyin-analysis"],"created_at":"2024-09-24T14:02:51.506Z","updated_at":"2025-05-13T23:08:47.450Z","avatar_url":"https://github.com/infinilabs.png","language":"Java","funding_links":["https://patreon.com/medcl","https://www.buymeacoffee.com/medcl"],"categories":["人工智能"],"sub_categories":[],"readme":"Pinyin Analysis for Elasticsearch and OpenSearch\n==================================\n\n![](./assets/banner.png)\n\nThis Pinyin Analysis plugin facilitates the conversion between Chinese characters and Pinyin. It supports major versions of Elasticsearch and OpenSearch. Maintained and supported with ❤️ by [INFINI Labs](https://infinilabs.com).\n\nThe plugin comprises an analyzer named `pinyin`, a tokenizer named `pinyin`, and a token filter named `pinyin`.\n\n# Optional Parameters\n\n- `keep_first_letter`: When enabled, retains only the first letter of each Chinese character. For example, `刘德华` becomes `ldh`. Default: true.\n\n- `keep_separate_first_letter`: When enabled, keeps the first letters of each Chinese character separately. For example, `刘德华` becomes `l`,`d`,`h`. Default: false. Note: This may increase query fuzziness due to term frequency.\n\n- `limit_first_letter_length`: Sets the maximum length of the first letter result. Default: 16.\n\n- `keep_full_pinyin`: When enabled, preserves the full Pinyin of each Chinese character. For example, `刘德华` becomes [`liu`,`de`,`hua`]. Default: true.\n\n- `keep_joined_full_pinyin`: When enabled, joins the full Pinyin of each Chinese character. For example, `刘德华` becomes [`liudehua`]. Default: false.\n\n- `keep_none_chinese`: Keeps non-Chinese letters or numbers in the result. Default: true.\n\n- `keep_none_chinese_together`: Keeps non-Chinese letters together. Default: true. For example, `DJ音乐家` becomes `DJ`,`yin`,`yue`,`jia`. When set to `false`, `DJ音乐家` becomes `D`,`J`,`yin`,`yue`,`jia`. Note: `keep_none_chinese` should be enabled first.\n\n- `keep_none_chinese_in_first_letter`: Keeps non-Chinese letters in the first letter. For example, `刘德华AT2016` becomes `ldhat2016`. Default: true.\n\n- `keep_none_chinese_in_joined_full_pinyin`: Keeps non-Chinese letters in joined full Pinyin. For example, `刘德华2016` becomes `liudehua2016`. Default: false.\n\n- `none_chinese_pinyin_tokenize`: Breaks non-Chinese letters into separate Pinyin terms if they are Pinyin. Default: true. For example, `liudehuaalibaba13zhuanghan` becomes `liu`,`de`,`hua`,`a`,`li`,`ba`,`ba`,`13`,`zhuang`,`han`. Note: `keep_none_chinese` and `keep_none_chinese_together` should be enabled first.\n\n- `keep_original`: When enabled, keeps the original input as well. Default: false.\n\n- `lowercase`: Lowercases non-Chinese letters. Default: true.\n\n- `trim_whitespace`: Default: true.\n\n- `remove_duplicated_term`: When enabled, removes duplicated terms to save index space. For example, `de的` becomes `de`. Default: false. Note: Position-related queries may be influenced.\n\n- `ignore_pinyin_offset`: After version 6.0, offsets are strictly constrained, and overlapped tokens are not allowed. With this parameter, overlapped tokens will be allowed by ignoring the offset. Please note, all position-related queries or highlights will become incorrect. You should use multi-fields and specify different settings for different query purposes. If you need offsets, please set it to false. Default: true.\n\n\n# How to Install\n\nYou can download the packaged plugins from here: `https://release.infinilabs.com/`, \n\nor you can use the `plugin` cli to install the plugin like this:\n\nFor Elasticsearch\n\n```\nbin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/8.4.1\n```\n\nFor OpenSearch\n\n```\nbin/opensearch-plugin install https://get.infini.cloud/opensearch/analysis-pinyin/2.12.0\n```\n\n\nTips: replace your own version number related to your elasticsearch or opensearch.\n\n\n# Getting Started\n\n1.Create a index with custom pinyin analyzer\n\u003cpre\u003e\nPUT /medcl/ \n{\n    \"settings\" : {\n        \"analysis\" : {\n            \"analyzer\" : {\n                \"pinyin_analyzer\" : {\n                    \"tokenizer\" : \"my_pinyin\"\n                    }\n            },\n            \"tokenizer\" : {\n                \"my_pinyin\" : {\n                    \"type\" : \"pinyin\",\n                    \"keep_separate_first_letter\" : false,\n                    \"keep_full_pinyin\" : true,\n                    \"keep_original\" : true,\n                    \"limit_first_letter_length\" : 16,\n                    \"lowercase\" : true,\n                    \"remove_duplicated_term\" : true\n                }\n            }\n        }\n    }\n}\n\u003c/pre\u003e\n\n2.Test Analyzer, analyzing a chinese name, such as 刘德华\n\u003cpre\u003e\nGET /medcl/_analyze\n{\n  \"text\": [\"刘德华\"],\n  \"analyzer\": \"pinyin_analyzer\"\n}\u003c/pre\u003e\n\u003cpre\u003e\n{\n  \"tokens\" : [\n    {\n      \"token\" : \"liu\",\n      \"start_offset\" : 0,\n      \"end_offset\" : 1,\n      \"type\" : \"word\",\n      \"position\" : 0\n    },\n    {\n      \"token\" : \"de\",\n      \"start_offset\" : 1,\n      \"end_offset\" : 2,\n      \"type\" : \"word\",\n      \"position\" : 1\n    },\n    {\n      \"token\" : \"hua\",\n      \"start_offset\" : 2,\n      \"end_offset\" : 3,\n      \"type\" : \"word\",\n      \"position\" : 2\n    },\n    {\n      \"token\" : \"刘德华\",\n      \"start_offset\" : 0,\n      \"end_offset\" : 3,\n      \"type\" : \"word\",\n      \"position\" : 3\n    },\n    {\n      \"token\" : \"ldh\",\n      \"start_offset\" : 0,\n      \"end_offset\" : 3,\n      \"type\" : \"word\",\n      \"position\" : 4\n    }\n  ]\n}\n\u003c/pre\u003e\n\n3.Create mapping\n\u003cpre\u003e\nPOST /medcl/_mapping \n{\n        \"properties\": {\n            \"name\": {\n                \"type\": \"keyword\",\n                \"fields\": {\n                    \"pinyin\": {\n                        \"type\": \"text\",\n                        \"store\": false,\n                        \"term_vector\": \"with_offsets\",\n                        \"analyzer\": \"pinyin_analyzer\",\n                        \"boost\": 10\n                    }\n                }\n            }\n        }\n    \n}\n\u003c/pre\u003e\n\n4.Indexing\n\u003cpre\u003e\nPOST /medcl/_create/andy\n{\"name\":\"刘德华\"}\n\u003c/pre\u003e\n\n5.Let's search\n\n\u003cpre\u003e\n\ncurl http://localhost:9200/medcl/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E\ncurl http://localhost:9200/medcl/_search?q=name.pinyin:%e5%88%98%e5%be%b7\ncurl http://localhost:9200/medcl/_search?q=name.pinyin:liu\ncurl http://localhost:9200/medcl/_search?q=name.pinyin:ldh\ncurl http://localhost:9200/medcl/_search?q=name.pinyin:de+hua\n\n\u003c/pre\u003e\n\n6.Using Pinyin-TokenFilter\n\u003cpre\u003e\nPUT /medcl1/ \n{\n    \"settings\" : {\n        \"analysis\" : {\n            \"analyzer\" : {\n                \"user_name_analyzer\" : {\n                    \"tokenizer\" : \"whitespace\",\n                    \"filter\" : \"pinyin_first_letter_and_full_pinyin_filter\"\n                }\n            },\n            \"filter\" : {\n                \"pinyin_first_letter_and_full_pinyin_filter\" : {\n                    \"type\" : \"pinyin\",\n                    \"keep_first_letter\" : true,\n                    \"keep_full_pinyin\" : false,\n                    \"keep_none_chinese\" : true,\n                    \"keep_original\" : false,\n                    \"limit_first_letter_length\" : 16,\n                    \"lowercase\" : true,\n                    \"trim_whitespace\" : true,\n                    \"keep_none_chinese_in_first_letter\" : true\n                }\n            }\n        }\n    }\n}\n\u003c/pre\u003e\n\nToken Test:刘德华 张学友 郭富城 黎明 四大天王\n\u003cpre\u003e\nGET /medcl1/_analyze\n{\n  \"text\": [\"刘德华 张学友 郭富城 黎明 四大天王\"],\n  \"analyzer\": \"user_name_analyzer\"\n}\n\u003c/pre\u003e\n\u003cpre\u003e\n{\n  \"tokens\" : [\n    {\n      \"token\" : \"ldh\",\n      \"start_offset\" : 0,\n      \"end_offset\" : 3,\n      \"type\" : \"word\",\n      \"position\" : 0\n    },\n    {\n      \"token\" : \"zxy\",\n      \"start_offset\" : 4,\n      \"end_offset\" : 7,\n      \"type\" : \"word\",\n      \"position\" : 1\n    },\n    {\n      \"token\" : \"gfc\",\n      \"start_offset\" : 8,\n      \"end_offset\" : 11,\n      \"type\" : \"word\",\n      \"position\" : 2\n    },\n    {\n      \"token\" : \"lm\",\n      \"start_offset\" : 12,\n      \"end_offset\" : 14,\n      \"type\" : \"word\",\n      \"position\" : 3\n    },\n    {\n      \"token\" : \"sdtw\",\n      \"start_offset\" : 15,\n      \"end_offset\" : 19,\n      \"type\" : \"word\",\n      \"position\" : 4\n    }\n  ]\n}\n\u003c/pre\u003e\n\n\n7.Used in phrase query\n\n- option 1\n\n\u003cpre\u003e\nPUT /medcl2/\n{\n    \"settings\" : {\n        \"analysis\" : {\n            \"analyzer\" : {\n                \"pinyin_analyzer\" : {\n                    \"tokenizer\" : \"my_pinyin\"\n                    }\n            },\n            \"tokenizer\" : {\n                \"my_pinyin\" : {\n                    \"type\" : \"pinyin\",\n                    \"keep_first_letter\":false,\n                    \"keep_separate_first_letter\" : false,\n                    \"keep_full_pinyin\" : true,\n                    \"keep_original\" : false,\n                    \"limit_first_letter_length\" : 16,\n                    \"lowercase\" : true\n                }\n            }\n        }\n    }\n}\nGET /medcl2/_search\n{\n  \"query\": {\"match_phrase\": {\n    \"name.pinyin\": \"刘德华\"\n  }}\n}\n\n\u003c/pre\u003e\n\n- option 2\n\n\u003cpre\u003e\n \nPUT /medcl3/\n{\n   \"settings\" : {\n       \"analysis\" : {\n           \"analyzer\" : {\n               \"pinyin_analyzer\" : {\n                   \"tokenizer\" : \"my_pinyin\"\n                   }\n           },\n           \"tokenizer\" : {\n               \"my_pinyin\" : {\n                   \"type\" : \"pinyin\",\n                   \"keep_first_letter\":true,\n                   \"keep_separate_first_letter\" : true,\n                   \"keep_full_pinyin\" : true,\n                   \"keep_original\" : false,\n                   \"limit_first_letter_length\" : 16,\n                   \"lowercase\" : true\n               }\n           }\n       }\n   }\n}\n   \nPOST /medcl3/_mapping \n{\n  \"properties\": {\n      \"name\": {\n          \"type\": \"keyword\",\n          \"fields\": {\n              \"pinyin\": {\n                  \"type\": \"text\",\n                  \"store\": false,\n                  \"term_vector\": \"with_offsets\",\n                  \"analyzer\": \"pinyin_analyzer\",\n                  \"boost\": 10\n              }\n          }\n      }\n  }\n}\n  \n   \nGET /medcl3/_analyze\n{\n   \"text\": [\"刘德华\"],\n   \"analyzer\": \"pinyin_analyzer\"\n}\n \nPOST /medcl3/_create/andy\n{\"name\":\"刘德华\"}\n\nGET /medcl3/_search\n{\n \"query\": {\"match_phrase\": {\n   \"name.pinyin\": \"刘德h\"\n }}\n}\n\nGET /medcl3/_search\n{\n \"query\": {\"match_phrase\": {\n   \"name.pinyin\": \"刘dh\"\n }}\n}\n\nGET /medcl3/_search\n{\n \"query\": {\"match_phrase\": {\n   \"name.pinyin\": \"liudh\"\n }}\n}\n\nGET /medcl3/_search\n{\n \"query\": {\"match_phrase\": {\n   \"name.pinyin\": \"liudeh\"\n }}\n}\n\nGET /medcl3/_search\n{\n \"query\": {\"match_phrase\": {\n   \"name.pinyin\": \"liude华\"\n }}\n}\n\n\u003c/pre\u003e\n\n8.That's all, have fun.\n\n\n# Community\n\nFell free to join the Discord server to discuss anything around this project: \n\n[https://discord.gg/4tKTMkkvVX](https://discord.gg/4tKTMkkvVX)\n\n# License\n\nCopyright ©️ INFINI Labs.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finfinilabs%2Fanalysis-pinyin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finfinilabs%2Fanalysis-pinyin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finfinilabs%2Fanalysis-pinyin/lists"}