Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ekaputra07/ina-sms-classifier

A project to create a ML classification model for Indonesian text/sms messages using Tensorflow.
https://github.com/ekaputra07/ina-sms-classifier

classification-model machine-learning tensorflow2

Last synced: 5 days ago
JSON representation

A project to create a ML classification model for Indonesian text/sms messages using Tensorflow.

Host: GitHub
URL: https://github.com/ekaputra07/ina-sms-classifier
Owner: ekaputra07
Created: 2019-08-17T10:33:44.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2022-07-29T22:29:14.000Z (over 2 years ago)
Last Synced: 2023-03-25T21:08:01.295Z (almost 2 years ago)
Topics: classification-model, machine-learning, tensorflow2
Language: Jupyter Notebook
Homepage:
Size: 24.8 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

        # ina-sms-classifier

A project to create Machine Learning model to classify Indonesian text/sms messages using [Tensorflow](https://www.tensorflow.org) and its [Keras](https://keras.io) api.

The main puspose is **to be able to detect scam/fraud SMS that often received by mobile phone users in Indonesia from unknown person and many have been reported to be victims of this kind of fraud activity**.

_Future plan_: the model can be transformed into [Tensorflow Lite](https://www.tensorflow.org/lite) and can be deployed as a mobile app that classify text message in real-time as it received by users. No need to send the message to model serving server to avoid privacy issue.

For now, it will classify messages into 4 classes:

- Scam (0)

- Online gambling website promotion (1)

- Online loans website promotion (2)

- Others (3)

Thanks to [laporsms.com](https://laporsms.com) for their effort collecting all the data that I've been using in this project.

## Usage

### Create text tokenizer

```

>> python create_tokenizer.py -h

usage: create_tokenizer.py [-h] --input INPUT [--text-column TEXT_COLUMN] [--max-words MAX_WORDS] --output OUTPUT

Create tokenizer object file

optional arguments:

  -h, --help            show this help message and exit

  --input INPUT         Input file to read (must be CSV file)

  --text-column TEXT_COLUMN

                        Name of the text column

  --max-words MAX_WORDS

                        Maximum number of words to use when tokenize sentences (default: 20000)

  --output OUTPUT       Where to store the tokenizer object

```

Example:

```

python create_tokenizer.py \

--input dataset/sms-row.csv \

--output model/tokenizer.pkl \

--text-column message

```

### Train and save the model

```

>> python create_model.py -h                                                                                                                                  

usage: create_model.py [-h] --tokenizer TOKENIZER --dataset DATASET [--text-column TEXT_COLUMN] [--label-column LABEL_COLUMN] [--max-words MAX_WORDS] [--maxlen MAXLEN] [--emb-dim EMB_DIM] [--class-num CLASS_NUM]

                       [--val-split VAL_SPLIT] [--test-split TEST_SPLIT] [--epochs EPOCHS] [--batch-size BATCH_SIZE] --output OUTPUT

Train and save model

optional arguments:

  -h, --help            show this help message and exit

  --tokenizer TOKENIZER

                        Path to saved tokenizer

  --dataset DATASET     Path to dataset file (must be CSV)

  --text-column TEXT_COLUMN

                        Name of the text column (default: text)

  --label-column LABEL_COLUMN

                        Name of the label column (default: label)

  --max-words MAX_WORDS

                        Max. number of words in vocabulary (must match tokenizer max-words, default: 20000)

  --maxlen MAXLEN       Max. number of words per message to use in training (default: 50)

  --emb-dim EMB_DIM     Words embedding dimension (default: 8)

  --class-num CLASS_NUM

                        Number of output classes (default: 4)

  --val-split VAL_SPLIT

                        Ratio of validation split (default: 0.2)

  --test-split TEST_SPLIT

                        Ratio of test split (default: 0.2)

  --epochs EPOCHS       Training epochs (default: 10)

  --batch-size BATCH_SIZE

                        Training batch size (default: 512)

  --output OUTPUT       Where to store the model

```

Example:

```

python create_model.py \

--tokenizer model/tokenizer.pkl \

--dataset dataset/sms-labeled-3k-clean.csv \

--text-column message \

--output model/latest

--epochs 75

```

At the end of the training you'll be asked whether you want to save the model, if yes then the model will be saved to `/model/latest`

### Model performance from latest training

*NOTE: below results are based on training 2700 of datapoints that are labeled from total of 18K (labeling all of them not finish yet).*

```

================== VALIDATION ===================

LOSS            : 0.13091

ACCURACY        : 0.94737

PRECISION       : 0.96234

RECALL          : 0.93117

AUC             : 0.99760

================== TEST ===================

LOSS            : 0.21565

ACCURACY        : 0.93091

PRECISION       : 0.94424

RECALL          : 0.92364

AUC             : 0.99164

CONFUSION MATRIX:

[[128   1   2   0]

 [  1  30   0   0]

 [  0   2  80   3]

 [  9   0   1  18]]

CLASSIFICATION REPORT:

              precision    recall  f1-score   support

           0       0.93      0.98      0.95       131

           1       0.91      0.97      0.94        31

           2       0.96      0.94      0.95        85

           3       0.86      0.64      0.73        28

    accuracy                           0.93       275

   macro avg       0.91      0.88      0.89       275

weighted avg       0.93      0.93      0.93       275

```

![Plot LOSS](https://github.com/ekaputra07/ina-sms-classifier/blob/master/plot_loss.png?raw=true)

![Plot ACC](https://github.com/ekaputra07/ina-sms-classifier/blob/master/plot_acc.png?raw=true)

### Development

I recommends you to install all the dependencies using [Conda]() and install the following libraries:

```

tensorflow

scikit-learn

pandas

numpy

matplotlib

seaborn

```

### License

```

Copyright (C) 2020  Eka Putra

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation, either version 3 of the License, or

(at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

GNU General Public License for more details.

You should have received a copy of the GNU General Public License

along with this program.  If not, see .

```