{"id":15627444,"url":"https://github.com/cstub/ml-ids","last_synced_at":"2025-10-11T00:13:20.479Z","repository":{"id":54349559,"uuid":"196524941","full_name":"cstub/ml-ids","owner":"cstub","description":"A machine learning based Intrusion Detection System","archived":false,"fork":false,"pushed_at":"2019-12-11T13:36:25.000Z","size":36077,"stargazers_count":146,"open_issues_count":5,"forks_count":58,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-01T20:45:16.575Z","etag":null,"topics":["intrusion-detection-systems","machine-learning"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cstub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-12T06:51:03.000Z","updated_at":"2025-08-29T08:40:35.000Z","dependencies_parsed_at":"2022-08-13T13:00:18.697Z","dependency_job_id":null,"html_url":"https://github.com/cstub/ml-ids","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/cstub/ml-ids","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cstub%2Fml-ids","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cstub%2Fml-ids/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cstub%2Fml-ids/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cstub%2Fml-ids/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cstub","download_url":"https://codeload.github.com/cstub/ml-ids/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cstub%2Fml-ids/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279005649,"owners_count":26083940,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["intrusion-detection-systems","machine-learning"],"created_at":"2024-10-03T10:16:58.039Z","updated_at":"2025-10-11T00:13:20.464Z","avatar_url":"https://github.com/cstub.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A machine learning based approach towards building an Intrusion Detection System\n\n## Problem Description\nWith the rising amount of network enabled devices connected to the internet such as mobile phones, IOT appliances or vehicles the concern about the security implications of using these devices is growing. The increase in numbers and types of networked devices inevitably leads to a wider surface of attack whereas the impact of successful attacks is becoming increasingly severe as more critical responsibilities are assumed be these devices.\n\nTo identify and counter network attacks it is common to employ a combination of multiple systems in order to prevent attacks from happening or to detect and stop ongoing attacks if they can not be prevented initially.\nThese systems are usually comprised of an intrusion prevention system such as a firewall as the first layer of security with intrusion detection systems representing the second layer.\nShould the intrusion prevention system be unable to prevent a network attack it is the task of the detection system to identify malicious network traffic in order to stop the ongoing attack and keep the recorded network traffic data for later analysis. This data can subsequently be used to update the prevention system to allow for the detection of the specific network attack in the future. The need for intrusion detection systems is rising as absolute prevention against attacks is not possible due to the rapid emergence of new attack types.\n\nEven though intrusion detection systems are an essential part of network security many detection systems deployed today have a significant weakness as they facilitate signature-based attack classification patterns which are able to detect the most common known attack patterns but have the drawback of being unable to detect novel attack types.\nTo overcome this limitation research in intrusion detection systems is focusing on more dynamic approaches based on machine learning and anomaly detection methods. In these systems the normal network behaviour is learned by processing previously recorded benign data packets which allows the system to identify new attack types by analyzing network traffic for anomalous data flows.\n\nThis project aims to implement a classifier capable of identifying network traffic as either benign or malicious based on machine learning and deep learning methodologies.\n\n## Data\nThe data used to train the classifier is taken from the [CSE-CIC-IDS2018](https://www.unb.ca/cic/datasets/ids-2018.html) dataset provided by the Canadian Institute for Cybersecurity. It was created by capturing all network traffic during ten days of operation inside a controlled network environment on AWS where realistic background traffic and different attack scenarios were conducted.\nAs a result the dataset contains both benign network traffic as well as captures of the most common network attacks.\nThe dataset is comprised of the raw network captures in pcap format as well as csv files created by using [CICFlowMeter-V3](https://www.unb.ca/cic/research/applications.html#CICFlowMeter) containing 80 statistical features of the individual network flows combined with their corresponding labels.\nA network flow is defined as an aggregation of interrelated network packets identified by the following properties:\n* Source IP\n* Destination IP\n* Source port\n* Destination port\n* Protocol\n\nThe dataset contains approximately 16 million individual network flows and covers the following attack scenarios:\n* Brute Force\n* DoS,\n* DDos\n* Heartbleed,\n* Web Attack,\n* Infiltration,\n* Botnet\n\n## Approach\nThe goal of this project is to create a classifier capable of categorising network flows as either benign or malicious.\nThe problem is understood as a supervised learning problem using the labels provided in the dataset which identify the network flows as either benign or malicious. Different approaches of classifying the data will be evaluated to formulate the problem either as a binary classification or a multiclass classification problem differentiating between the individual classes of attacks provided in the dataset in the later case. A relevant subset of the features provided in the dataset will be used as predictors to classify individual network flows.\nMachine learning methods like k-nearest neighbours, random forest or SVM will be applied to the problem and evaluated in the first step in order to assess the feasibility of using traditional machine learning approaches.\nSubsequently deep learning models like convolutional neural networks, autoencoders or recurrent neural networks will be employed to create a competing classifier as recent research has shown that deep learning methods represent a promising application in the field of anomaly detection.\nThe results of both approaches will be compared to select the best performing classifier.\n\n## Deliverables\nThe classifier will be deployed and served via a REST API in conjunction with a simple web application providing a user interface to utilize the API.\n\nThe REST API will provide the following functionality:\n* an endpoint to submit network capture files in pcap format. Individual network flows are extracted from the capture files and analysed for malicious network traffic.\n* (optional) an endpoint to stream continuous network traffic captures which are analysed in near real-time combined with\n* (optional) an endpoint to register a web-socket in order to get notified upon detection of malicious network traffic.\n\nTo further showcase the project, a testbed could be created against which various attack scenarios can be performed. This testbed would be connected to the streaming API for near real-time detection of malicious network traffic.\n\n## Computational resources\nThe requirements regarding the computational resources to train the classifiers are given below:\n\n| Category      | Resource      |\n| ------------- | ------------- |\n| CPU | Intel Core i7 processor |\n| RAM | 32 GB                   |\n| GPU | 1 GPU, 8 GB RAM         |\n| HDD | 100 GB                  |\n\n\n## Classifier\n\nThe machine learning estimator created in this project follows a supervised approach and is trained using the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) algorithm. Employing the [CatBoost](https://catboost.ai/) library a binary classifier is created, capable of classifying network flows as either benign or malicious. The chosen parameters of the classifier and its performance metrics can be examined in the following [notebook](https://github.com/cstub/ml-ids/blob/master/notebooks/07_binary_classifier_comparison/binary-classifier-comparison.ipynb).     \n\n## Deployment Architecture\n\nThe deployment architecture of the complete ML-IDS system is explained in detail in the [system architecture](https://docs.google.com/document/d/1s_EBMTid4gdrsQU_xOCAYK1BzxkhhnYl6wHFSZo_9Tw/edit?usp=sharing).\n\n## Model Training and Deployment\n\nThe model can be trained and deployed either locally or via [Amazon SageMaker](https://aws.amazon.com/sagemaker/).     \nIn each case the [MLflow](https://www.mlflow.org/docs/latest/index.html) framework is utilized to train the model and create the model artifacts.\n\n### Installation\n\nTo install the necessary dependencies checkout the project and create a new Anaconda environment from the environment.yml file.\n\n```\nconda env create -f environment.yml\n```\n\nAfterwards activate the environment and install the project resources.\n\n```\nconda activate ml-ids\n\npip install -e .\n```\n\n### Dataset Creation\n\nTo create the dataset for training use the following command:\n\n```\nmake split_dataset \\\n  DATASET_PATH={path-to-source-dataset}\n```\n\nThis command will read the source dataset and split the dataset into separate train/validation/test sets with a sample ratio of 80%/10%/10%. The specified source dataset should be a folder containing multiple `.csv` files.    \nYou can use the [CIC-IDS-2018 dataset](https://www.unb.ca/cic/datasets/ids-2018.html) provided via [Google Drive](https://drive.google.com/open?id=1HrTPh0YRSZ4T9DLa_c47lubheKUcPl0r) for this purpose.    \nOnce the command completes a new folder `dataset` is created that contains the splitted datasets in `.h5` format.\n\n### Local Mode\n\nTo train the model in local mode, using the default parameters and dataset locations created by `split_dataset`, use the following command:\n\n```\nmake train_local\n```\n\nIf the datasets are stored in a different location or you want to specify different training parameters, you can optionally supply the dataset locations and a training parameter file:\n\n```\nmake train_local \\\n  TRAIN_PATH={path-to-train-dataset} \\\n  VAL_PATH={path-to-train-dataset} \\\n  TEST_PATH={path-to-train-dataset} \\\n  TRAIN_PARAM_PATH={path-to-param-file}\n```\n\nUpon completion of the training process the model artifacts can be found in the `build/models/gradient_boost` directory.\n\nTo deploy the model locally the MLflow CLI can be used.\n\n```\nmlflow models serve -m build/models/gradient_boost -p 5000\n```\n\nThe model can also be deployed as a Docker container using the following commands:\n\n```\nmlflow models build-docker -m build/models/gradient_boost -n ml-ids-classifier:1.0\n\ndocker run -p 5001:8080 ml-ids-classifier:1.0\n```\n\n### Amazon SageMaker\n\nTo train the model on Amazon SageMaker the following command sequence is used:\n\n```\n# build a new docker container for model training\nmake sagemaker_build_image \\\n  TAG=1.0\n\n# upload the container to AWS ECR\nmake sagemaker_push_image \\\n  TAG=1.0\n\n# execute the training container on Amazon SageMaker\nmake sagemaker_train_aws \\\n  SAGEMAKER_IMAGE_NAME={ecr-image-name}:1.0 \\\n  JOB_ID=ml-ids-job-0001\n```\n\nThis command requires a valid AWS account with the appropriate permissions to be configured locally via the [AWS CLI](https://aws.amazon.com/cli/). Furthermore, [AWS ECR](https://aws.amazon.com/ecr/) and Amazon SageMaker must be configured for the account.\n\nUsing this repository, the manual invocation of the aforementioned commands is not necessary as training on Amazon SageMaker is supported via a [GitHub workflow](https://github.com/cstub/ml-ids/blob/master/.github/workflows/train.yml) that is triggered upon creation of a new tag of the form `m*` (e.g. `m1.0`).\n\nTo deploy a trained model on Amazon SageMaker a [GitHub Deployment request](https://developer.github.com/v3/repos/deployments/) using the GitHub API must be issued, specifying the tag of the model.\n\n```\n{\n  \"ref\": \"refs/tags/m1.0\",\n  \"payload\": {},\n  \"description\": \"Deploy request for model version m1.0\",\n  \"auto_merge\": false\n}\n```\n\nThis deployment request triggers a [GitHub workflow](https://github.com/cstub/ml-ids/blob/master/.github/workflows/deployment.yml), deploying the model to SageMaker.\nAfter successful deployment the model is accessible via the SageMaker HTTP API.\n\n## Using the Classifier\n\nThe classifier deployed on Amazon SageMaker is not directly available publicly, but can be accessed using the [ML-IDS REST API](https://github.com/cstub/ml-ids-api).  \n\n### REST API\n\nTo invoke the REST API the following command can be used to submit a prediction request for a given network flow:\n\n```\ncurl -X POST \\\n  http://ml-ids-cluster-lb-1096011980.eu-west-1.elb.amazonaws.com/api/predictions \\\n  -H 'Accept: */*' \\\n  -H 'Content-Type: application/json; format=pandas-split' \\\n  -H 'Host: ml-ids-cluster-lb-1096011980.eu-west-1.elb.amazonaws.com' \\\n  -H 'cache-control: no-cache' \\\n  -d '{\"columns\":[\"dst_port\",\"protocol\",\"timestamp\",\"flow_duration\",\"tot_fwd_pkts\",\"tot_bwd_pkts\",\"totlen_fwd_pkts\",\"totlen_bwd_pkts\",\"fwd_pkt_len_max\",\"fwd_pkt_len_min\",\"fwd_pkt_len_mean\",\"fwd_pkt_len_std\",\"bwd_pkt_len_max\",\"bwd_pkt_len_min\",\"bwd_pkt_len_mean\",\"bwd_pkt_len_std\",\"flow_byts_s\",\"flow_pkts_s\",\"flow_iat_mean\",\"flow_iat_std\",\"flow_iat_max\",\"flow_iat_min\",\"fwd_iat_tot\",\"fwd_iat_mean\",\"fwd_iat_std\",\"fwd_iat_max\",\"fwd_iat_min\",\"bwd_iat_tot\",\"bwd_iat_mean\",\"bwd_iat_std\",\"bwd_iat_max\",\"bwd_iat_min\",\"fwd_psh_flags\",\"bwd_psh_flags\",\"fwd_urg_flags\",\"bwd_urg_flags\",\"fwd_header_len\",\"bwd_header_len\",\"fwd_pkts_s\",\"bwd_pkts_s\",\"pkt_len_min\",\"pkt_len_max\",\"pkt_len_mean\",\"pkt_len_std\",\"pkt_len_var\",\"fin_flag_cnt\",\"syn_flag_cnt\",\"rst_flag_cnt\",\"psh_flag_cnt\",\"ack_flag_cnt\",\"urg_flag_cnt\",\"cwe_flag_count\",\"ece_flag_cnt\",\"down_up_ratio\",\"pkt_size_avg\",\"fwd_seg_size_avg\",\"bwd_seg_size_avg\",\"fwd_byts_b_avg\",\"fwd_pkts_b_avg\",\"fwd_blk_rate_avg\",\"bwd_byts_b_avg\",\"bwd_pkts_b_avg\",\"bwd_blk_rate_avg\",\"subflow_fwd_pkts\",\"subflow_fwd_byts\",\"subflow_bwd_pkts\",\"subflow_bwd_byts\",\"init_fwd_win_byts\",\"init_bwd_win_byts\",\"fwd_act_data_pkts\",\"fwd_seg_size_min\",\"active_mean\",\"active_std\",\"active_max\",\"active_min\",\"idle_mean\",\"idle_std\",\"idle_max\",\"idle_min\"],\"data\":[[80,17,\"21\\\\/02\\\\/2018 10:15:06\",119759145,75837,0,2426784,0,32,32,32.0,0.0,0,0,0.0,0.0,20263.87212,633.2460039,1579.1859130859,31767.046875,920247,1,120000000,1579.1859130859,31767.046875,920247,1,0,0.0,0.0,0,0,0,0,0,0,606696,0,633.2460327148,0.0,32,32,32.0,0.0,0.0,0,0,0,0,0,0,0,0,0,32.0004234314,32.0,0.0,0,0,0,0,0,0,75837,2426784,0,0,-1,-1,75836,8,0.0,0.0,0,0,0.0,0.0,0,0]]}'\n```\n\n### ML-IDS API Clients\n\nFor convenience, the Python clients implemented in the [ML-IDS API Clients project](https://github.com/cstub/ml-ids-api-client) can be used to submit new prediction requests to the API and receive real-time notifications on detection of malicious network flows.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcstub%2Fml-ids","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcstub%2Fml-ids","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcstub%2Fml-ids/lists"}