{"id":13589160,"url":"https://github.com/seanghay/khmerpunctuate","last_synced_at":"2025-09-06T15:51:08.582Z","repository":{"id":225110648,"uuid":"693275082","full_name":"seanghay/khmerpunctuate","owner":"seanghay","description":"Punctuation Restoration for Khmer language","archived":false,"fork":false,"pushed_at":"2024-07-23T04:59:10.000Z","size":2820,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-07T03:47:27.050Z","etag":null,"topics":["khmer","khmer-language","khmer-punct","punctuation-restoration","sentence-segmentation","xlm-roberta"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seanghay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-18T17:41:56.000Z","updated_at":"2024-07-30T22:37:43.000Z","dependencies_parsed_at":"2024-11-06T08:35:07.100Z","dependency_job_id":"ecb5c76b-1215-4c26-b4e9-72e243c33e81","html_url":"https://github.com/seanghay/khmerpunctuate","commit_stats":null,"previous_names":["seanghay/khmerpunctuate"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/seanghay/khmerpunctuate","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fkhmerpunctuate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fkhmerpunctuate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fkhmerpunctuate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fkhmerpunctuate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seanghay","download_url":"https://codeload.github.com/seanghay/khmerpunctuate/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fkhmerpunctuate/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264379148,"owners_count":23598814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["khmer","khmer-language","khmer-punct","punctuation-restoration","sentence-segmentation","xlm-roberta"],"created_at":"2024-08-01T16:00:24.288Z","updated_at":"2025-07-13T04:32:11.016Z","avatar_url":"https://github.com/seanghay.png","language":"Python","funding_links":[],"categories":["Awesome Khmer Language"],"sub_categories":["2. Toolkit"],"readme":"## Punctuation Restoration for Khmer language\n\nBuilt with [[xashru/punctuation-restoration]](https://github.com/xashru/punctuation-restoration) using [[xlm-roberta-khmer-small]](https://huggingface.co/seanghay/xlm-roberta-khmer-small) and then exported to `onnxruntime`\n\n### Features\n- Whitespaces Prediction\n- Sentence Segmentation\n- Punctuation Prediction\n- Number Entity Prediction\n\n### Install\n\n```shell\npip install khmerpunctuate\n\n# Or\npip install git+https://github.com/seanghay/khmerpunctuate.git\n```\n\n### Usage\n\nSupported token types are\n\n```python\n{\n  0: \"\",\n  1: \" \",\n  2: \"!\",\n  3: \"។\",\n  4: \"?\",\n  5: \"៖\",\n  6: \"។\\n\",\n  7: \"B-NUMBER\",\n  8: \"I-NUMBER\",\n  9: \"B-QUOTE\",\n  10: \"I-QUOTE\",\n}\n```\n\n```python\nfrom khmernormalizer import normalize\nfrom khmercut import tokenize\nfrom khmerpunctuate import punctuate\n\ntext = normalize(\"អយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញបានព្រមានថានឹងចេញដីកាបញ្ជាឲ្យបង្ខំនិងឲ្យឃុំខ្លួនតាមនីតិវិធីប្រសិនបើលោករ៉ុងឈុនដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិមិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀលឲ្យបានមុនថ្ងៃទី០៤ខែមីនាឆ្នាំ២០២៤ទេនោះ\")\ntokens = tokenize(text)\n\noutput_text = \"\"\nfor token, punct, punct_id in punctuate(tokens):\n  # exclude special tokens like I-NUMBER, B-NUMBER, I-QUOTE and B-QUOTE\n  if punct_id \u003c 7:\n    output_text += token + punct\n  else:\n    output_text += token\n\nprint(output_text)\n```\n\n```\nអយ្យការអមសាលាដំបូងរាជធានីភ្នំពេញ បានព្រមានថា នឹងចេញដីកាបញ្ជាឱ្យបង្ខំ និងឱ្យឃុំខ្លួនតាមនីតិវិធី ប្រសិនបើលោក រ៉ុង ឈុន ដែលបច្ចុប្បន្នជាទីប្រឹក្សាគណបក្សកម្លាំងជាតិ មិនបានបង់ប្រាក់ពិន័យចំនួន២លានរៀល ឱ្យបានមុនថ្ងៃទី០៤ខែមីនា ឆ្នាំ២០២៤ទេនោះ \n```\n\n\n### Example\n\nThe example below is available on [[Google Colab]](https://colab.research.google.com/drive/18lHUdJGHD55TTklwWz4d6CNOVfRYMoFG?usp=sharing)\n\nModel file is hosted on [[HuggingFace]](https://huggingface.co/seanghay/khmer-punctuation-restore)\n\n\n### Evaluation\n\n**XLM RoBERTa Khmer: (49M params)**\n\n\n| Precision | 0.95528402 | 0.79168481 | 0.85507246 | 0.74523436 | 0.7877551  | 0.79452055 | 0.62296801 | 0.96415685 | 0.98617407 | 0.67324778 | 0.57505285 | 0.8240493  |\n|-----------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|\n| Recall    | 0.96957471 | 0.73475191 | 0.13947991 | 0.86194329 | 0.69010727 | 0.63736264 | 0.08452508 | 0.96852034 | 0.99192858 | 0.22035541 | 0.21068939 | 0.77592102 |\n| F1 score  | 0.96237631 | 0.76215662 | 0.2398374  | 0.79935128 | 0.73570521 | 0.70731707 | 0.14885353 | 0.96633367 | 0.98904296 | 0.33203505 | 0.30839002 | 0.79926129 |\n\nAccuracy: 0.930086988701306\n\n\n---\n\n**XLM RoBERTa Base (279M params)**\n\n| Metric    | 1          | 2          | 3          | 4          | 5          | 6          | 7          | 8          | 9          | 10         | 11         | 12         |\n|-----------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|\n| Precision | 0.96143204 | 0.82657744 | 0.88399072 | 0.79077633 | 0.82349285 | 0.85393258 | 0.55724225 | 0.96397178 | 0.98844483 | 0.72191436 | 0.67759563 | 0.8508466  |\n| Recall    | 0.97304725 | 0.77059714 | 0.45035461 | 0.90182234 | 0.78963051 | 0.83516484 | 0.18804696 | 0.97943409 | 0.99381541 | 0.46300485 | 0.43222308 | 0.81077656 |\n| F1 score  | 0.96720478 | 0.79760625 | 0.59671104 | 0.84265665 | 0.80620627 | 0.84444444 | 0.28120013 | 0.97164142 | 0.99112284 | 0.56417323 | 0.52778435 | 0.83032843 |\n| Accuracy  | 0.9399183767909306 |            |            |            |            |            |            |            |            |            |            |            |\n\n\n### License\n\n`MIT`\n\n\n### Citation\n\n```bibtex\n@inproceedings{alam-etal-2020-punctuation,\n    title = \"Punctuation Restoration using Transformer Models for High-and Low-Resource Languages\",\n    author = \"Alam, Tanvirul  and\n      Khan, Akib  and\n      Alam, Firoj\",\n    booktitle = \"Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)\",\n    month = nov,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.wnut-1.18\",\n    pages = \"132--142\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseanghay%2Fkhmerpunctuate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseanghay%2Fkhmerpunctuate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseanghay%2Fkhmerpunctuate/lists"}