{"id":23857451,"url":"https://github.com/ekaputra07/ina-sms-classifier","last_synced_at":"2025-10-11T06:32:11.514Z","repository":{"id":37059986,"uuid":"202867493","full_name":"ekaputra07/ina-sms-classifier","owner":"ekaputra07","description":"A project to create a ML classification model for Indonesian text/sms messages using Tensorflow.","archived":false,"fork":false,"pushed_at":"2024-05-14T22:17:41.000Z","size":26022,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-22T10:26:47.707Z","etag":null,"topics":["classification-model","machine-learning","tensorflow2"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ekaputra07.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-17T10:33:44.000Z","updated_at":"2021-10-06T23:31:13.000Z","dependencies_parsed_at":"2025-02-22T10:25:36.818Z","dependency_job_id":"b659a678-9c8d-43ca-ada9-7777c514d534","html_url":"https://github.com/ekaputra07/ina-sms-classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ekaputra07/ina-sms-classifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekaputra07%2Fina-sms-classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekaputra07%2Fina-sms-classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekaputra07%2Fina-sms-classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekaputra07%2Fina-sms-classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ekaputra07","download_url":"https://codeload.github.com/ekaputra07/ina-sms-classifier/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekaputra07%2Fina-sms-classifier/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267324400,"owners_count":24069391,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification-model","machine-learning","tensorflow2"],"created_at":"2025-01-03T02:55:13.447Z","updated_at":"2025-10-11T06:32:06.478Z","avatar_url":"https://github.com/ekaputra07.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ina-sms-classifier\n\nA project to create Machine Learning model to classify Indonesian text/sms messages using [Tensorflow](https://www.tensorflow.org) and its [Keras](https://keras.io) api.\n\nThe main puspose is **to be able to detect scam/fraud SMS that often received by mobile phone users in Indonesia from unknown person and many have been reported to be victims of this kind of fraud activity**.\n\n_Future plan_: the model can be transformed into [Tensorflow Lite](https://www.tensorflow.org/lite) and can be deployed as a mobile app that classify text message in real-time as it received by users. No need to send the message to model serving server to avoid privacy issue.\n\nFor now, it will classify messages into 4 classes:\n\n- Scam (0)\n- Online gambling website promotion (1)\n- Online loans website promotion (2)\n- Others (3)\n\nThanks to [laporsms.com](https://laporsms.com) for their effort collecting all the data that I've been using in this project.\n\n## Usage\n\n### Create text tokenizer\n```\n\u003e\u003e python create_tokenizer.py -h\n\nusage: create_tokenizer.py [-h] --input INPUT [--text-column TEXT_COLUMN] [--max-words MAX_WORDS] --output OUTPUT\n\nCreate tokenizer object file\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --input INPUT         Input file to read (must be CSV file)\n  --text-column TEXT_COLUMN\n                        Name of the text column\n  --max-words MAX_WORDS\n                        Maximum number of words to use when tokenize sentences (default: 20000)\n  --output OUTPUT       Where to store the tokenizer object\n```\n\nExample:\n```\npython create_tokenizer.py \\\n--input dataset/sms-row.csv \\\n--output model/tokenizer.pkl \\\n--text-column message\n```\n\n### Train and save the model\n\n```\n\u003e\u003e python create_model.py -h                                                                                                                                  \n\nusage: create_model.py [-h] --tokenizer TOKENIZER --dataset DATASET [--text-column TEXT_COLUMN] [--label-column LABEL_COLUMN] [--max-words MAX_WORDS] [--maxlen MAXLEN] [--emb-dim EMB_DIM] [--class-num CLASS_NUM]\n                       [--val-split VAL_SPLIT] [--test-split TEST_SPLIT] [--epochs EPOCHS] [--batch-size BATCH_SIZE] --output OUTPUT\n\nTrain and save model\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --tokenizer TOKENIZER\n                        Path to saved tokenizer\n  --dataset DATASET     Path to dataset file (must be CSV)\n  --text-column TEXT_COLUMN\n                        Name of the text column (default: text)\n  --label-column LABEL_COLUMN\n                        Name of the label column (default: label)\n  --max-words MAX_WORDS\n                        Max. number of words in vocabulary (must match tokenizer max-words, default: 20000)\n  --maxlen MAXLEN       Max. number of words per message to use in training (default: 50)\n  --emb-dim EMB_DIM     Words embedding dimension (default: 8)\n  --class-num CLASS_NUM\n                        Number of output classes (default: 4)\n  --val-split VAL_SPLIT\n                        Ratio of validation split (default: 0.2)\n  --test-split TEST_SPLIT\n                        Ratio of test split (default: 0.2)\n  --epochs EPOCHS       Training epochs (default: 10)\n  --batch-size BATCH_SIZE\n                        Training batch size (default: 512)\n  --output OUTPUT       Where to store the model\n```\n\nExample:\n```\npython create_model.py \\\n--tokenizer model/tokenizer.pkl \\\n--dataset dataset/sms-labeled-3k-clean.csv \\\n--text-column message \\\n--output model/latest\n--epochs 75\n```\n\nAt the end of the training you'll be asked whether you want to save the model, if yes then the model will be saved to `/model/latest`\n\n### Model performance from latest training\n\n*NOTE: below results are based on training 2700 of datapoints that are labeled from total of 18K (labeling all of them not finish yet).*\n```\n================== VALIDATION ===================\nLOSS            : 0.13091\nACCURACY        : 0.94737\nPRECISION       : 0.96234\nRECALL          : 0.93117\nAUC             : 0.99760\n\n\n================== TEST ===================\nLOSS            : 0.21565\nACCURACY        : 0.93091\nPRECISION       : 0.94424\nRECALL          : 0.92364\nAUC             : 0.99164\n\nCONFUSION MATRIX:\n[[128   1   2   0]\n [  1  30   0   0]\n [  0   2  80   3]\n [  9   0   1  18]]\n\nCLASSIFICATION REPORT:\n              precision    recall  f1-score   support\n\n           0       0.93      0.98      0.95       131\n           1       0.91      0.97      0.94        31\n           2       0.96      0.94      0.95        85\n           3       0.86      0.64      0.73        28\n\n    accuracy                           0.93       275\n   macro avg       0.91      0.88      0.89       275\nweighted avg       0.93      0.93      0.93       275\n```\n\n![Plot LOSS](https://github.com/ekaputra07/ina-sms-classifier/blob/master/plot_loss.png?raw=true)\n![Plot ACC](https://github.com/ekaputra07/ina-sms-classifier/blob/master/plot_acc.png?raw=true)\n\n### Development\n\nI recommends you to install all the dependencies using [Conda]() and install the following libraries:\n```\ntensorflow\nscikit-learn\npandas\nnumpy\nmatplotlib\nseaborn\n```\n\n### License\n```\nCopyright (C) 2020  Eka Putra\n\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program.  If not, see \u003chttp://www.gnu.org/licenses/\u003e.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekaputra07%2Fina-sms-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fekaputra07%2Fina-sms-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekaputra07%2Fina-sms-classifier/lists"}