{"id":13652997,"url":"https://github.com/wittawatj/jtcc","last_synced_at":"2025-04-23T06:31:01.559Z","repository":{"id":28312188,"uuid":"31824955","full_name":"wittawatj/jtcc","owner":"wittawatj","description":"Java library to tokenize Thai text into a list of TCCs","archived":false,"fork":false,"pushed_at":"2017-05-30T12:24:58.000Z","size":363,"stargazers_count":18,"open_issues_count":2,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-10T04:36:19.081Z","etag":null,"topics":["java","natural-language-processing","thai-nlp"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wittawatj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-07T19:46:09.000Z","updated_at":"2024-10-21T13:18:29.000Z","dependencies_parsed_at":"2022-09-01T03:00:35.463Z","dependency_job_id":null,"html_url":"https://github.com/wittawatj/jtcc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wittawatj%2Fjtcc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wittawatj%2Fjtcc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wittawatj%2Fjtcc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wittawatj%2Fjtcc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wittawatj","download_url":"https://codeload.github.com/wittawatj/jtcc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250384804,"owners_count":21421794,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","natural-language-processing","thai-nlp"],"created_at":"2024-08-02T02:01:04.725Z","updated_at":"2025-04-23T06:30:59.773Z","avatar_url":"https://github.com/wittawatj.png","language":"Java","funding_links":[],"categories":["Uncategorized","Libraries/Services","自然語言處理-泰語","Word \u0026 Syllable Segmentation"],"sub_categories":["Uncategorized","Thai Character Cluster","函式庫"],"readme":"# JTCC\n\n**JTCC** is a Java library to tokenize Thai text into a list of TCCs. The rules\nused to determine TCCs' boundaries are implemented as grammar using [ANTLR](http://www.antlr.org/).\n\n\n\n## What is TCC ?\n\nTCC or *Thai Character Cluster* (proposed in [Character Cluster Based Thai Information Retrieval](http://portal.acm.org/citation.cfm?id=355225 ) is a group of inseparable Thai characters. This\ninseparability derives from Thai writing system which is independent of any\ncontext. As a result, TCC can be determined by a simple list of rules\ndescribing e.g., what characters need to follow/precede other characters. \n\n\n## TCC Examples \n\n * Input: ฉันฝากขวดขี้ผึ้งใส่ถุงให้เศรษฐี\n * Output TCCs: ฉัน|ฝา|ก|ข|ว|ด|ขี้|ผึ้|ง|ใส่|ถุ|ง|ให้|เศ|ร|ษ|ฐี|\n\n * Input: สะช้ะมาบ้ากิถิ้บีดี้ขึงทึ่งรือขื่อกุตุ้บสูตู่เละเส๊ะเขเป้\n * Output TCCs: สะ|ช้ะ|มา|บ้า|กิ|ถิ้|บี|ดี้|ขึ|ง|ทึ่|ง|รือ|ขื่อ|กุ|ตุ้|บ|สู|ตู่|เละ|เส๊ะ|เข|เป้|\n\nNote that we only put the delimiter at the end of each TCC. \n\n## Applications of TCCs \nThe TCC itself has no use to the end users. TCC is mostly used in a bigger\nnatural language processing system by acting as the first step of processing\ninput text. An obvious merit of TCC is that it can be used to eliminate\nimpossible word boundary positions in the running text. \n\n## Program Usage \n\nCalling JTCC from the command line is as simple as calling a normal executable\nJAR file. Command-line JTCC has 3 modes.\n * Tokenize input from stdin\n * Tokenize the content in a file\n * Tokenize the string passed as a command line argument\n\nGeneral usage format is \n\n    java -jar JTCC-x.x \u003cmode_keyword\u003e [argument] \n\nReplace x.x with the version of JTCC in use.\n\n### Tokenize input from stdin\n\nExamples:\n\n    echo \"Some input here\" | java -jar JTCC-x.x.jar stdin\n\nThis tokenizes the input passed from stdin and outputs to the default stdout\n(screen). \n\n### Tokenize a content file\n\nExamples:\n\n    java -jar JTCC-x.x.jar file C:/thaitext.txt\n\nThis tokenizes the content at the path C:/thaitext.txt, and outputs to the screen. \n\n### Tokenize specified input string\n\nExamples:\n\n    java -jar JTCC.jar content \"ตรงนี้เป็นเนื้อหาที่ต้องการตัด TCC. Content to tokenize into TCCs here.\" \n\nThis tokenizes whatever string coming after the keyword \"content\" and outputs to the screen.\n\n## Note \nJTCC is not a mature project nor does it provide a standard way of grouping\ninseparable Thai characters. \n\nThe term _inseparable_ is, in fact, ambiguous in some cases. For example, given\nan input \"ถุงให้\", by relying on the original definition of TCC, the output TCCs\nshould be \"ถุ|ง|ให้|\". However, some might argue that the delimiter after \"ถุ\" can\nbe removed without much effort to make it as \"ถุง|ให้|\". One method to do so\nmight be to look ahead one more character. In this case, it is \"ใ\". Since \"ใ\"\ncannot be grouped with \"ง\" (i.e.,/ it is impossible to have \"งใ\"), so it might\nbe tempting to group \"ง\" to the previous TCC, thus forming \"ถุง\".\n\nI agree that this argument makes sense. But, be reminded that the goal of this\nproject is to create a library capable of tokenizing an input text into TCCs.\nThe mentioned idea above seems to go beyond TCC (probably to syllable level).\nTherefore, we will stick with the global context-independent TCC tokenizing\nrules for now. At least, the mentioned look-ahead strategy will not be\nimplemented in the near future.\n\n## License \n![GPL v3](http://www.gnu.org/graphics/gplv3-127x51.png \"GPL v3\")\n\n    JTCC is a Java package for tokenizing Thai text into a list of TCCs.\n    Copyright (C) 2010 Wittawat Jitkrittum\n\n    JTCC is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see http://www.gnu.org/licenses/.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwittawatj%2Fjtcc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwittawatj%2Fjtcc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwittawatj%2Fjtcc/lists"}