https://github.com/codhek/mitibot

A Graph based machine learning approach to bot mitigation systems.
https://github.com/codhek/mitibot

bot-mitigation bot-mitigation-systems bots-protection machine-learning

Last synced: 22 days ago
JSON representation

A Graph based machine learning approach to bot mitigation systems.

Host: GitHub
URL: https://github.com/codhek/mitibot
Owner: CodHeK
Created: 2019-10-16T10:54:10.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-01-05T15:57:12.000Z (over 4 years ago)
Last Synced: 2025-03-13T00:19:56.569Z (4 months ago)
Topics: bot-mitigation, bot-mitigation-systems, bots-protection, machine-learning
Language: Python
Size: 14.2 MB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # mitiBot

A Graph based machine learning approach to bot mitigation systems.

### Datasets

Just run `setup.sh` to download the `training` and `testing` datasets. The datasets get downloaded into the `./datasets` folder.

### Using the data files

```

b = Build(['42.csv', '43.csv', '46.csv', '47.csv', '48.csv', '52.csv', '53.csv'])

```

Just pass the file names, it will read the files from the `./datasets` directory and load the data.

### e2e mode

To perform both `training` then `testing` use the `e2e` flag.

```

python3 model.py --e2e

```

Configuration used in the e2e mode:

```

# Training dataset

b = Build(['42.csv', '43.csv', '46.csv', '47.csv', '48.csv', '52.csv', '53.csv'])

b.data = b.build_train_set(b.non_bot_tuples, b.bot_tuples)

b.preprocess()

train_p1()

train_p2()

# Testing dataset

t = Build(['50.csv', '51.csv'])

t.data = t.build_test_set(t.non_bot_tuples, t.bot_tuples, 50)

t.preprocess()

test()

```

Total time:

```

  Avg: ~45m

```

### k-fold mode

Perform K-fold cross-validation on the 9 datasets using the `--kfold` flag.

We use:

```

datasets = ['42.csv', '43.csv', '46.csv', '47.csv', '48.csv', '50.csv', '51.csv', '52.csv', '53.csv']

```

In each iteration we use one of the datasets for testing and the rest for training.

```

$ python3 model.py --kfold

```

Takes about `~8 hours` in total to complete! (Check [logs](https://github.com/CodHeK/mitiBot/blob/master/kfold.logs))

![kfold-output](screenshots/kfold-output.png)

In the end prints the average accuracy for Logistic Regression and Naive Bayes using DBSCAN in phase 1.

DBSCAN + LR | DBSCAN + NB

:-------------------------:|:-------------------------:

97.46%  |  97.29%

### Training

You can train the model in 2 ways, as it has PHASE 1 (UNSUPERVISED) and PHASE 2 (SUPERVISED)

This will peform both the phases one by one.

```

python3 model.py --train

```

If you want to perform the 2 phases separately `(given the feature vectors are already saved in f.json and fvecs.json)`

```

python3 model.py --phase1

```

and

```

python3 model.py --phase2

```

Once trained, it creates the pickle files of the model and saves it in the `saved` folder which is then used for the testing.

NOTE:

You could directly used the saved `feature vectors` store in the JSON format in folder `saved_train` and directly train `phase2` of the training process inorder to fasten the training process!

The above, weights saved are trained on the following data files: `['42.csv', '43.csv', '46.csv', '47.csv', '48.csv', '52.csv', '53.csv']` in case you want to modify you'll have to train `phase1` first whose weights once trained will we saved in the `/saved` folder.

### Testing

Using the command below will use the pre-trained classifier saved in the pickle file in the `saved` folder.

```

python3 model.py --test

```

### Cluster size maps

#

 Kmeans (n_clusters=2, random_state=0) | DBScan (eps=0.4, min_samples=4)

:-------------------------:|:-------------------------:

![cluster_png](screenshots/kmeans.png)  |  ![cluster_png](screenshots/dbscan.png)

### DBSCAN + Naive Bayes Classifer

Tested on the data files `50.csv` and `51.csv`.

#

Test run:

#

![test50_51](screenshots/test_db_nb.png)

#

Test time:

```

  Avg: ~7m

```

### DBSCAN + Logistic Regression Classifer

Tested on the data files `50.csv` and `51.csv`.

#

Test run:

#

![test50_51](screenshots/test_db_lr.png)

#

Test time:

```

  Avg: ~6m

```

### Experimenting

Using only Unsupervised Learning as our learning technique we get :

#### DBScan (eps=1.0, min_samples=4)

Testing the cluster on the data files `50.csv` and `51.csv`

![dbscan_exp](screenshots/dbscan_exp.png)

### Using various number of clusters for KMeans

Testing the cluster on the data files `50.csv` and `51.csv`

![kmeans_exp](screenshots/kmeans_exp.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/codhek/mitibot

Awesome Lists containing this project

README