{"id":13652989,"url":"https://github.com/rkcosmos/deepcut","last_synced_at":"2025-05-16T15:03:02.878Z","repository":{"id":57418029,"uuid":"95091660","full_name":"rkcosmos/deepcut","owner":"rkcosmos","description":"A Thai word tokenization library using Deep Neural Network","archived":false,"fork":false,"pushed_at":"2020-10-23T10:36:02.000Z","size":11815,"stargazers_count":426,"open_issues_count":6,"forks_count":98,"subscribers_count":31,"default_branch":"master","last_synced_at":"2025-05-06T06:12:44.360Z","etag":null,"topics":["deep-learning","deep-neural-networks","keras","keras-tensorflow","python","segmentation","tensorflow","thai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rkcosmos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-22T08:17:45.000Z","updated_at":"2025-05-04T02:56:47.000Z","dependencies_parsed_at":"2022-09-03T08:51:43.654Z","dependency_job_id":null,"html_url":"https://github.com/rkcosmos/deepcut","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkcosmos%2Fdeepcut","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkcosmos%2Fdeepcut/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkcosmos%2Fdeepcut/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkcosmos%2Fdeepcut/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rkcosmos","download_url":"https://codeload.github.com/rkcosmos/deepcut/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252631386,"owners_count":21779427,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","deep-neural-networks","keras","keras-tensorflow","python","segmentation","tensorflow","thai"],"created_at":"2024-08-02T02:01:04.667Z","updated_at":"2025-05-16T15:03:02.830Z","avatar_url":"https://github.com/rkcosmos.png","language":"Python","readme":"# Deepcut\n\n[![License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](https://github.com/rkcosmos/deepcut/blob/master/LICENSE) [![DOI](https://zenodo.org/badge/95091660.svg)](https://zenodo.org/badge/latestdoi/95091660)\n\nA Thai word tokenization library using Deep Neural Network.\n\n![model_structure](https://user-images.githubusercontent.com/1214890/58486992-14c1d880-8191-11e9-9122-8385750e06bd.png)\n\n## What's new\n\n* `v0.7.0` Migrate from keras to TensorFlow 2.0\n* `v0.6.0` Allow excluding stop words and custom dictionary, updated weight with semi-supervised learning\n* `v0.5.2` Better pretrained weight matrix\n* `v0.5.1` Faster tokenization by code refactorization\n* `examples` folder provide starter script for Thai text classification problem\n* `DeepcutJS`, you can try tokenizing Thai text on web browser [here](https://rkcosmos.github.io/deepcut/)\n\n## Performance\n\nThe Convolutional Neural network is trained from 90 % of NECTEC's BEST corpus (consists of 4 sections, article, news, novel and encyclopedia) and test on the rest 10 %. It is a binary classification model trying to predict whether a character is the beginning of word or not. The results calculated from only 'true' class are as follow\n\n| Precision | Recall |   F1   |\n| --------- | ------ | ------ |\n| 97.8%     | 98.5%  | 98.1%  |\n\n## Installation\n\nInstall using `pip` for stable release (tensorflow version2.0),\n\n``` bash\npip install deepcut\n```\n\nFor latest development release (recommended),\n\n``` bash\npip install git+git://github.com/rkcosmos/deepcut.git\n```\n\nIf you want to use tensorflow version 1.x and standalone keras, you will need\n\n``` bash\npip install deepcut==0.6.1\n```\n\n### Docker\n\nFirst, install and run [`docker`](https://www.docker.com/get-started) on your machine. Then, you can build and run `deepcut` as follows\n\n``` bash\ndocker build -t deepcut:dev . # build docker image\ndocker run --rm -it deepcut:dev # run docker, -it flag makes it interactive, --rm for clean up the container and remove file system\n```\n\nThis will open a shell for us to play with `deepcut`.\n\n## Usage\n\n``` python\nimport deepcut\ndeepcut.tokenize('ตัดคำได้ดีมาก')\n```\n\nOutput will be in list format\n\n``` bash\n['ตัดคำ','ได้','ดี','มาก']\n```\n\n### Bag-of-word transformation\n\nWe implemented a tokenizer which works similar to [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from `scikit-learn` . Here is an example usage:\n\n``` python\nfrom deepcut import DeepcutTokenizer\ntokenizer = DeepcutTokenizer(ngram_range=(1,1),\n                             max_df=1.0, min_df=0.0)\nX = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 6 CSR sparse matrix\nprint(tokenizer.vocabulary_) # {'บิน': 0, 'ได้': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}, column index of sparse matrix\n\nX_test = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน']) # use built tokenizer vobalurary to transform new text\nprint(X_test.shape) # 2 x 6 CSR sparse matrix\n\ntokenizer.save_model('tokenizer.pickle') # save the tokenizer to use later\n```\n\nYou can load the saved tokenizer to use later\n\n``` python\ntokenizer = deepcut.load_model('tokenizer.pickle')\nX_sample = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน'])\nprint(X_sample.shape) # getting the same 2 x 6 CSR sparse matrix as X_test\n```\n\n### Custom Dictionary\n\nUser can add custom dictionary by adding path to `.txt` file with one word per line like the following.\n\n``` bash\nขี้เกียจ\nโรงเรียน\nดีมาก\n```\n\nThe file can be placed as an `custom_dict` argument in `tokenize` function e.g.\n\n``` python\ndeepcut.tokenize('ตัดคำได้ดีมาก', custom_dict='/path/to/custom_dict.txt')\ndeepcut.tokenize('ตัดคำได้ดีมาก', custom_dict=['ดีมาก']) # alternatively, you can provide a list of custom dictionary\n```\n\n## Notes\n\nSome texts might not be segmented as we would expected (e.g.'โรงเรียน' -\u003e ['โรง', 'เรียน']), this is because of\n\n* BEST corpus (training data) tokenizes word this way (They use 'Compound words' as a criteria for segmentation)\n* They are unseen/new words -\u003e Ideally, this would be cured by having better corpus but it's not very practical so I am thinking of doing semi-supervised learning to incorporate new examples.\n\nAny suggestion and comment are welcome, please post it in issue section.\n\n## Contributors\n\n* [Rakpong Kittinaradorn](https://github.com/rkcosmos)\n* [Korakot Chaovavanich](https://github.com/korakot)\n* [Titipat Achakulvisut](https://github.com/titipata)\n* [Chanwit Kaewkasi](https://github.com/chanwit)\n\n## Citations\n\nIf you use `deepcut` in your project or publication, please cite the library as follows\n\n``` bash\nRakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn,\nPattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad.\n(2019, September 23). DeepCut: A Thai word tokenization library using Deep Neural Network. Zenodo. http://doi.org/10.5281/zenodo.3457707\n```\n\nor BibTeX entry:\n\n``` bib\n@misc{Kittinaradorn2019,\n    author       = {Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn, Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad},\n    title        = {{DeepCut: A Thai word tokenization library using Deep Neural Network}},\n    month        = Sep,\n    year         = 2019,\n    doi          = {10.5281/zenodo.3457707},\n    version      = {1.0},\n    publisher    = {Zenodo},\n    url          = {http://doi.org/10.5281/zenodo.3457707}\n}\n```\n\n## Partner Organizations\n\n* True Corporation\n\nWe are open for contribution and collaboration.\n","funding_links":[],"categories":["Text processor","Uncategorized","Libraries/Services","ไลบรารี่"],"sub_categories":["Uncategorized","Word Segmentation"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frkcosmos%2Fdeepcut","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frkcosmos%2Fdeepcut","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frkcosmos%2Fdeepcut/lists"}