{"id":13574403,"url":"https://github.com/oneapi-src/network-intrusion-detection","last_synced_at":"2025-04-04T15:30:51.355Z","repository":{"id":66145928,"uuid":"560595391","full_name":"oneapi-src/network-intrusion-detection","owner":"oneapi-src","description":"AI Starter Kit for Network Intrusion Detection using Intel® Extension for Scikit-learn*","archived":true,"fork":false,"pushed_at":"2024-02-16T21:12:49.000Z","size":158,"stargazers_count":6,"open_issues_count":0,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-05T09:44:35.607Z","etag":null,"topics":["machine-learning","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oneapi-src.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-11-01T20:55:41.000Z","updated_at":"2024-06-23T15:28:14.000Z","dependencies_parsed_at":"2024-02-16T22:28:00.455Z","dependency_job_id":"88081bd4-5f34-40cc-86ee-bacf8ab887f1","html_url":"https://github.com/oneapi-src/network-intrusion-detection","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fnetwork-intrusion-detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fnetwork-intrusion-detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fnetwork-intrusion-detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fnetwork-intrusion-detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oneapi-src","download_url":"https://codeload.github.com/oneapi-src/network-intrusion-detection/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247202575,"owners_count":20900806,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","scikit-learn"],"created_at":"2024-08-01T15:00:51.278Z","updated_at":"2025-04-04T15:30:46.345Z","avatar_url":"https://github.com/oneapi-src.png","language":"Python","readme":"PROJECT NOT UNDER ACTIVE MANAGEMENT\n\nThis project will no longer be maintained by Intel.\n\nIntel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.  \n\nIntel no longer accepts patches to this project.\n\nIf you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.  \n\nContact: webadmin@linux.intel.com\n# Network Intrusion Detection\r\n\r\n## Introduction\r\nCyberattacks are escalating at a staggering rate globally. Intrusion prevention systems continuously monitor network traffic, looking for possible malicious incidents, containing the threat and capturing information about them, further reporting such information to system administrators, and improving preventative action. \r\n\r\nWith the changing patterns in network behavior, it is necessary to use a dynamic approach to detect and prevent such intrusions. A lot of research has been devoted to this field, and there is a universal acceptance that static datasets do not capture traffic compositions and interventions. It is needed the modifiable, reproducible, and extensible dataset to learn and tackle sophisticated attackers who can easily bypass basic intrusion detection systems (IDS).\r\n\r\nThe goal of this example is to use Intel® oneAPI packages and describe how we can leverage the [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html) and [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html) to build a Network Intrusion Detection model.\r\n\r\nCheck out more workflow examples in the [Developer Catalog](https://developer.intel.com/aireferenceimplementations).\r\n\r\n## Solution Technical Overview\r\nA network-based intrusion detection system (NIDS) is used to monitor and analyze network traffic to protect a system from network-based threats. A NIDS reads all inbound packets and searches for any suspicious patterns. When threats are discovered, based on their severity, the system could take action such as notifying administrators, or barring the source IP (internet protocol) address from accessing the network. \r\n\r\nThe experiment aimed to build a Network Intrusion Detection System that detects any network intrusions. The main purpose of a NIDS is to alert a system administrator each time an intruder tries to access into the network using a supervised learning algorithm. The goal is to train a model to classify the input data as benign, malicious, or outlier.\r\n\r\nThe solution contained in this repo uses the following Intel® packages:\r\n\r\n* ***Intel® Distribution for Python****\r\n\r\n\tThe [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html) provides:\r\n\r\n    * Scalable performance using all available CPU cores on laptops, desktops, and powerful servers\r\n\t* Support for the latest CPU instructions\r\n\t* Near-native performance through acceleration of core numerical and machine learning packages with libraries like the Intel® oneAPI Math Kernel Library (oneMKL) and Intel® oneAPI Data Analytics Library\r\n\t* Productivity tools for compiling Python* code into optimized instructions\r\n\t* Essential Python* bindings for easing integration of Intel® native tools with your Python* project\r\n\r\n* ***Intel® Extension for Scikit-Learn****\r\n\r\n  Using Scikit-Learn* with this extension, you can:\r\n\r\n\t* Speed up training and inference by up to 100x with the equivalent mathematical accuracy.\r\n\t* Continue to use the open source Scikit-Learn* API.\r\n\t* Enable and disable the extension with a couple lines of code or at the command line.\r\n\r\nFor more details, visit [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html), [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html), the [Network Intrusion Detection](https://github.com/oneapi-src/network-intrusion-detection) GitHub repository.\r\n\r\n## Solution Technical Details\r\nAs classification analysis is an exploratory task, an analyst will often run on different datasets of different sizes, resulting in different insights that they may use for decisions all from the same raw dataset. The algorithm used for classification is nu-support vector classifier (NuSVC). NuSVC is similar to the support vector classifier (SVC) with the only difference being that the NuSVC classifier has a *nu* parameter to control the number of support vectors. For training, we are passing 70% of the dataset, whereas the remaining 30% is used for batch inferencing.\r\n\r\nThe reference kit implementation is a reference solution to the described use case that includes:\r\n\r\n  * An Optimized reference End-to-End (E2E) architecture enabled with Intel® Extension for Scikit-learn* available as part of Intel® oneAPI AI toolkit optimizations\r\n\r\n### Expected Input-Output\r\n\r\n**Input**                                 | **Output** |\r\n| :---: | :---: |\r\n| Telemetry data records          | For each type of intrusion (malignant, benign, outlier) $d$, the probability [0, 1] of the intrusion $d$ |\r\n\r\n**Example Input**                                 | **Example Output** |\r\n| :---: | :---: |\r\n|Values for avg_ipt, bytes_in, bytes_out, dest_ip, dest_port, entropy, num_pkts_out, num_pkts_in, proto, src_ip,\tsrc_port,\ttime_end,\ttime_start,\ttotal_entropy, label,\tduration | {'Malignant': 0.778, 'Benign': 0.023, 'Outlier': 0.176}\r\n\r\n### Hyper-parameter Analysis\r\nIn realistic scenarios, an analyst will run the same classification algorithm multiple times on the same dataset, scanning across different hyper-parameters.  To capture this, we measure the total amount of time it takes to generate classification results (F1-score) in loop hyper-parameters for a fixed algorithm, which we define as hyper-parameter analysis. In practice, the results of each hyper-parameter analysis provides the analyst with many different clusters that they can take and further analyze.\r\n\r\n#### \u003ca name=\"use-case-flow\"\u003e\u003c/a\u003eOptimized E2E architecture with Intel® oneAPI components\r\n![Use_case_flow](assets/e2e_flow_optimized.png)\r\n\r\n### Dataset\r\n\r\nThis reference kit is implemented to demonstrate an experiment LUFlow dataset from Kaggle* and can be found at https://www.kaggle.com/datasets/mryanm/luflow-network-intrusion-detection-data-set (2021.02.17.csv file is downloaded and saved to the data folder and used as a dataset in this reference kit). \r\n\r\nLUFlow is a flow-based intrusion detection data set which contains telemetry of emerging attacks. Flows which were unable to be determined as malicious but are not part of the normal telemetry profile are labelled as outliers.\r\n\r\nEach row in the data set has values for:\r\n\r\n| Name | Description |\r\n| --- | --- |\r\n| src_ip | The source IP address associated with the flow. This feature is anonymised to the corresponding Autonomous System |\r\n| src_port | The source port number associated with the flow. |\r\n| dest_ip | The destination IP address associated with the flow. The feature is also anonymised in the same manner as before.\r\n| dest_port | The destination port number associated with the flow |\r\n| protocol | The protocol number associated with the flow. For example TCP is 6 |\r\n| bytes_in | The number of bytes transmitted from source to destination |\r\n| bytes_out | The number of bytes transmitted from destination to source. |\r\n| num_pkts_in | The packet count from source to destination |\r\n| num_pkts_out | The packet count from destination to source |\r\n| entropy | The entropy in bits per byte of the data fields within the flow. This number ranges from 0 to 8. |\r\n| total_entropy | The total entropy in bytes over all of the bytes in the data fields of the flow |\r\n| mean_ipt | The mean of the inter-packet arrival times of the flow |\r\n| time_start | The start time of the flow in seconds since the epoch. |\r\n| time_end | The end time of the flow in seconds since the epoch |\r\n| duration | The flow duration time, with microsecond precision |\r\n| label | The label of the flow, as decided by Tangerine. Either benign, outlier, or malicious |\r\n\r\nBased on these features, the Network Intrusion Detection System has been built to identify the type of intrusion. Rows with empty columns were deleted from the initial CSV file. Instructions for downloading the data for use can be found in the [Download the Dataset](#download-the-dataset) section.\r\n\r\n\u003e *Please see this data set's applicable license for terms and conditions. Intel® Corporation does not own the rights to this data set and does not confer any rights to it.*\r\n\r\n## Validated Hardware Details\r\nThere are workflow-specific hardware and software setup requirements to run this use case.\r\n\r\n| Recommended Hardware\r\n| ----------------------------\r\n| CPU: Intel® 2nd Gen Xeon® Platinum 8280 CPU @ 2.70GHz or higher\r\n| RAM: 187 GB\r\n| Recommended Free Disk Space: 20 GB or more\r\n\r\nOperating System: Ubuntu* 22.04 LTS.\r\n\r\n## How it Works\r\nAs mentioned above this Network Intrusion Detection System uses NuSVC from the Scikit-Learn* library to train an artificial intelligence (AI) model and generate labels by classification for the passed in data.\r\n\r\nThe use case can be summarized in three steps:\r\n* Read and preprocess the data\r\n* Perform training and predictions\r\n* Hyperparameter tuning analysis\r\n\r\n## Get Started\r\nStart by **defining an environment variable** that will store the workspace path, this can be an existing directory or one to be created in further steps. This ENVVAR will be used for all the commands executed using absolute paths.\r\n\r\n[//]: # (capture: baremetal)\r\n```bash\r\nexport WORKSPACE=$PWD/network-intrusion-detection\r\n```\r\n\r\nDefine `DATA_DIR` and `OUTPUT_DIR`.\r\n\r\n[//]: # (capture: baremetal)\r\n```bash\r\nexport DATA_DIR=$WORKSPACE/data\r\nexport OUTPUT_DIR=$WORKSPACE/output\r\n```\r\n\r\n### Download the Workflow Repository\r\nCreate a working directory for the workflow and clone the [Main\r\nRepository](https://github.com/oneapi-src/network-intrusion-detection) into your working\r\ndirectory.\r\n\r\n[//]: # (capture: baremetal)\r\n```\r\nmkdir -p $WORKSPACE \u0026\u0026 cd $WORKSPACE\r\n```\r\n\r\n```bash\r\ngit clone https://github.com/oneapi-src/network-intrusion-detection $WORKSPACE\r\n```\r\n### Set Up Conda\r\nTo learn more, please visit [install anaconda on Linux](https://docs.anaconda.com/free/anaconda/install/linux/).\r\n\r\n```bash\r\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\r\nbash Miniconda3-latest-Linux-x86_64.sh\r\n```\r\n### Set Up Environment\r\nInstall and set the libmamba solver as default solver. Run the following commands:\r\n\r\n```bash\r\nconda install -n base conda-libmamba-solver -y\r\nconda config --set solver libmamba\r\n```\r\n\r\nThe [env/intel_env.yml](./env/intel_env.yml) file contains all dependencies to create the Intel® environment.\r\n\r\n| **Packages required in YAML file**| **Version**\r\n| :---                              | :--\r\n| python                            | 3.10\r\n| intelpython3_full                 | 2024.0.0\r\n| pandas                            | 2.1.3\r\n\r\n Execute next command to create the conda environment.\r\n\r\n```bash\r\nconda env create -f $WORKSPACE/env/intel_env.yml\r\n```\r\n\r\nDuring this setup, `intrusion_detection_intel` conda environment will be created with the dependencies listed in the YAML configuration. Use the following command to activate the environment created above:\r\n\r\n```bash\r\nconda activate intrusion_detection_intel\r\n```\r\n\r\n### Download the Dataset\r\nTo setup the data for run the workflow, do the following:\r\n\r\n1. Install [Kaggle\\* API](https://github.com/Kaggle/kaggle-api) and configure your [credentials](https://github.com/Kaggle/kaggle-api#api-credentials) and [proxies](https://github.com/Kaggle/kaggle-api#set-a-configuration-value).\r\n\r\n2. Download the data from https://www.kaggle.com/datasets/mryanm/luflow-network-intrusion-detection-data-set, save it to data directory.\r\n\r\n\t```bash\r\n    cd $DATA_DIR\r\n    kaggle datasets download -d mryanm/luflow-network-intrusion-detection-data-set\r\n\t```\r\n\r\n3. Unzip `2021.02.17.csv` file to data directory.\r\n\r\n    ```bash\r\n    unzip -p luflow-network-intrusion-detection-data-set.zip \"*/2021.02.17.csv\" \u003e 2021.02.17.csv\r\n\t```\r\n4. Remove `luflow-network-intrusion-detection-data-set.zip` file from data directory and return to workspace path.\r\n\t\r\n\t```bash\r\n\trm luflow-network-intrusion-detection-data-set.zip\r\n    cd $WORKSPACE\r\n\t```\r\n\r\n## Supported Runtime Environment\r\nYou can execute the references pipelines using the following environments:\r\n* Bare Metal\r\n\r\n### Run Using Bare Metal\r\nFollow these instructions to set up and run this workflow on your own development system.\r\n\r\n#### Set Up System Software\r\nOur examples use the ``conda`` package and environment on your local computer. If you don't already have ``conda`` installed, go to [Set up conda](#set-up-conda) or see the [Conda* Linux installation instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/linux.html).\r\n\r\n#### Run Workflow\r\nOnce we create and activate the `intrusion_detection_intel` environment, we can run the next steps.\r\n\r\n##### Dataset Preprocessing\r\n\r\nTo remove the rows with empty values from the downloaded CSV file, the below script has to be run:\r\n\r\n```shell\r\npython src/data_prep.py -i inputfile [-o outputfile]  \r\n```\r\n\r\nAn example of using the above script is as below:\r\n\r\n[//]: # (capture: baremetal)\r\n```\r\npython $WORKSPACE/src/data_prep.py -i $DATA_DIR/2021.02.17.csv \\\r\n    -o $DATA_DIR/data.csv\r\n```\r\n\r\n##### Model building process with Intel® optimizations\r\n\r\nAs mentioned above this Network Intrusion Detection System uses NuSVC from the Scikit-Learn* library to train an AI model and generate labels by classification for the passed in data. This process is captured within the `run_benchmarks.py` script. This script *reads and preprocesses the data*, and *performs training, predictions, and hyperparameter tuning analysis on NuSVC*, while also reporting on the execution time for all the mentioned steps. This script can also save each of the intermediate models for an in-depth analysis of the quality of fit. \r\n\r\nThe script takes the following arguments:\r\n\r\n```shell\r\nusage: src/run_benchmarks.py [-l LOGFILE] [--hptune] [-a {svc,nusvc,lr}] \r\n    [-d DATASETSIZE] [-c CSVPATH] [-s SAVE_MODEL_DIR]\r\n\r\noptional arguments:\r\n  -l LOGFILE, --logfile LOGFILE\r\n                        log file to output benchmarking results to (default: None)\r\n  --hptune              activate hyper parameter tuning (default: False)\r\n  -a {svc,nusvc,lr}, --algo {svc,nusvc,lr}\r\n                        name of the algorithm to be used (default: svc)\r\n  -d DATASETSIZE, --datasetsize DATASETSIZE\r\n                        size of the dataset (default: 10000)\r\n  -c CSVPATH, --csvpath CSVPATH\r\n                        path to input csv (default: data/data.csv)\r\n  -s SAVE_MODEL_DIR, --save_model_dir SAVE_MODEL_DIR\r\n                        directory to save model to (default: models/)\r\n```           \r\n\r\nAs an example of using this, we can run the following commands to train and save NuSVC models. To run training with Intel® Distribution for Python* and Intel® technologies for data size 300K, we would run:\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/run_benchmarks.py -d 300000 --algo nusvc -c $DATA_DIR/data.csv \\\r\n    -s $OUTPUT_DIR/models\r\n```\r\n\r\nIn a realistic pipeline, this training process would follow the [Optimized E2E architecture](#use-case-flow), adding a human in the loop to determine the quality of the classification solution from each of the saved models/predictions in the `saved_models` directory, or better, while tuning the model. The quality of a classification solution is highly dependent on the human analyst and they have the ability to not only tune hyper-parameters but also modify the features being used to find better solutions.\r\n\r\n[1]: #optimized-e2e-architecture-with-intel®-oneapi-components\r\n\r\n##### Running classification Analysis/Predictions\r\nThe `inference.py` script performs predictions and takes the following arguments:\r\n\r\n```bash\r\nusage: src/inference.py [-h] [-l LOGFILE] [-c CSVPATH] -m MODELPATH [-d DATASETSIZE]\r\n\r\noptional arguments:\r\n  -l LOGFILE, --logfile LOGFILE\r\n                        log file to output benchmarking results to (default: None)\r\n  -c CSVPATH, --csvpath CSVPATH\r\n                        path to input csv file (default: data/data.csv)\r\n  -m MODELPATH, --modelpath MODELPATH\r\n                        saved model path (default: None)\r\n  -d DATASETSIZE, --datasetsize DATASETSIZE\r\n                        size of the dataset (default: 10000)\r\n```\r\n\r\nTo run the batch and real-time inference, we would run (using the saved model trained before):\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/inference.py --modelpath $OUTPUT_DIR/models/NuSVC_model.sav \\\r\n    -c $DATA_DIR/data.csv -d 10000\r\n```\r\n\r\n##### Hyperparameter tuning\r\n\r\n\r\n***Loop Based Hyperparameter Tuning***: It is used to apply the fit method to train and optimize by applying different parameter values in loops to get the best Silhouette score and thereby a better performing model.\r\n\r\nSilhouette score is a metric used to calculate how well each data point fits into its predicted cluster. This measure has a range of [-1, 1]:\r\n\r\n* 1: Means clusters are well apart from each other and clearly distinguished.\r\n\r\n* 0: Means clusters are indifferent, or the distance between clusters is not significant.\r\n\r\n* -1: Means clusters are assigned in the wrong way.\r\n\r\n**Parameters Considered**\r\n| **Parameter** | **Description** | **Values**\r\n| :-- | :-- | :-- \r\n| `kernel` | kernels | rbf, poly\r\n| `gamma` | Gamma Value | 1e-4\r\n\r\nTo execute hyperparameter tuning, we would run:\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/run_benchmarks.py --hptune -d 300000 --algo nusvc -c $DATA_DIR/data.csv \\\r\n    -s $OUTPUT_DIR/models\r\n```\r\n\r\nTo run the batch and real-time inference, we would run (using the saved model above created with hyperparameter tuning):\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/inference.py --modelpath $OUTPUT_DIR/models/NUSVC_model_hp.sav \\\r\n    -c $DATA_DIR/data.csv -d 10000\r\n```\r\n\r\n#### Clean Up Bare Metal\r\nFollow these steps to restore your `$WORKSPACE` directory to an initial step. Please note that all downloaded dataset files, conda environment, and logs created by workflow will be deleted. Before executing next steps back up your important files.\r\n\r\n```bash\r\n# activate base environment\r\nconda activate base\r\n# delete conda environment created\r\nconda env remove -n intrusion_detection_intel\r\n```\r\n\r\n[//]: # (capture: baremetal)\r\n```bash\r\n# delete all data generated\r\nrm $DATA_DIR/data.csv \r\nrm -rf $DATA_DIR/2021.02.17.csv\r\n# delete all outputs generated\r\nrm -rf $OUTPUT_DIR\r\n```\r\n\r\n### Expected Output\r\nThe `run_benchmarks.py` outputs are input data rows, dataset size rows, data preprocessing time and training time. For example, training NuSVC model for data size 300K should return similar results as shown below:\r\n\r\n```bash\r\nIntel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)\r\nINFO:__main__:Loading intel libraries...\r\nINFO:__main__:Input data rows: 592589\r\nINFO:__main__:Dataset rows: 300000\r\nINFO:__main__:data prep time is ----\u003e 0.928395 secs\r\nINFO:__main__:Training without HP tuning\r\nINFO:__main__:Training with NuSVC\r\nINFO:__main__:NUSVC training time w/o hp tuning is ----\u003e 25.118885 secs\r\n```\r\n\r\nThe `inference.py` outputs are input data rows, dataset size rows, batch prediction time and classification report:\r\n\r\n```bash\r\nIntel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)\r\nINFO:__main__:Input data rows: 592589\r\nINFO:__main__:Dataset rows: 10000\r\nINFO:__main__:Batch Prediction time is ----\u003e 0.168146 secs\r\nINFO:__main__:Classification report \r\n              precision    recall  f1-score   support\r\n\r\n      benign       0.08      0.99      0.14       526\r\n   malicious       0.61      0.31      0.41      4811\r\n     outlier       0.62      0.09      0.16      4663\r\n\r\n    accuracy                           0.24     10000\r\n   macro avg       0.44      0.46      0.24     10000\r\nweighted avg       0.59      0.24      0.28     10000\r\n\r\n\r\nINFO:__main__:Average Real Time inference time taken ---\u003e 0.004962 secs\r\n```\r\n\r\n Running the `run_benchmarks.py` with hyperparameter tuning with NuSVC for data size 300K, expected outputs are:\r\n\r\n```bash\r\nIntel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)\r\nINFO:__main__:Loading intel libraries...\r\nINFO:__main__:Input data rows: 592589\r\nINFO:__main__:Dataset rows: 300000\r\nINFO:__main__:data prep time is ----\u003e 0.928749 secs\r\nINFO:__main__:Training with HP tuning\r\nINFO:__main__:Training with NuSVC\r\nFitting 2 folds for each of 2 candidates, totalling 4 fits\r\n[CV 2/2; 1/2] START gamma=0.0001, kernel=rbf....................................\r\n[CV 1/2; 1/2] START gamma=0.0001, kernel=rbf....................................\r\n[CV 1/2; 2/2] START gamma=0.0001, kernel=poly...................................\r\n[CV 2/2; 2/2] START gamma=0.0001, kernel=poly...................................\r\n[CV 1/2; 2/2] END ....gamma=0.0001, kernel=poly;, score=0.626 total time=   8.6s\r\n[CV 2/2; 2/2] END ....gamma=0.0001, kernel=poly;, score=0.516 total time=   8.8s\r\n[CV 2/2; 1/2] END .....gamma=0.0001, kernel=rbf;, score=0.782 total time=  11.1s\r\n[CV 1/2; 1/2] END .....gamma=0.0001, kernel=rbf;, score=0.746 total time=  11.2s\r\nINFO:__main__:Best params {'gamma': 0.0001, 'kernel': 'rbf'}\r\nINFO:__main__:Best score 0.764057\r\nINFO:__main__:NUSVC training time is ----\u003e 32.227316 secs\r\nINFO:__main__:NUSVC training time with best params is---------\u003e 17.639800 secs\r\n```\r\n\r\nRun the inference using the saved model created with hyperparameter tuning should return similar results as shown below:\r\n\r\n```bash\r\nIntel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)\r\nINFO:__main__:Input data rows: 592589\r\nINFO:__main__:Dataset rows: 10000\r\nINFO:__main__:Batch Prediction time is ----\u003e 0.152508 secs\r\nINFO:__main__:Classification report \r\n              precision    recall  f1-score   support\r\n\r\n      benign       0.05      0.99      0.10       526\r\n   malicious       0.60      0.04      0.07      4811\r\n     outlier       0.00      0.00      0.00      4663\r\n\r\n    accuracy                           0.07     10000\r\n   macro avg       0.22      0.34      0.06     10000\r\nweighted avg       0.29      0.07      0.04     10000\r\n\r\n\r\nINFO:__main__:Average Real Time inference time taken ---\u003e 0.005621 secs\r\n```\r\n\r\nMachine Learning models will be saved in ``$OUTPUT_DIR/models``:\r\n\r\n```bash\r\nNUSVC_model_hp.sav\r\nNuSVC_model.sav\r\n```\r\n\r\n## Summary and Next Steps\r\nWe investigate the amount of time taken to perform hyper-parameter analysis under a combination of gamma (1e-4) and kernels (rbf, poly).\r\n\r\nAs classification analysis is an exploratory task, an analyst will often run on a different dataset of different sizes, resulting in different insights that they may use for decisions all from the same raw dataset.\r\n\r\nFor demonstrational purposes of the scaling of Intel® Extension for SciKit-learn*, we benchmark a full classification analysis using the 300k dataset size for training. Inference benchmark is made on NuSVC model trained with 300k dataset, using the real-time and batch size of 25k.\r\n\r\nTo build a Network Intrusion Detection System, Data Scientists will need to train models for substantial datasets and run inferences more frequently. The ability to accelerate training will allow them to train more frequently and achieve better F1-score. Besides training, faster speed in inference will allow them to provide Network Intrusion Detection in real-time scenarios as well as more frequently. This reference kit implementation provides a performance-optimized guide around Network Intrusion Detection use cases that can be easily scaled across similar use cases.\r\n\r\n## Learn More\r\nFor more information about or to read about other relevant workflow examples, see these guides and software resources:\r\n\r\n- [Intel® AI Analytics Toolkit (AI Kit)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html)\r\n- [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-Python*.html)\r\n- [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html)\r\n- [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html)\r\n\r\n## Support\r\nIf you have questions or issues about this use case, want help with troubleshooting, want to report a bug or submit enhancement requests, please submit a GitHub issue.\r\n\r\n## Appendix\r\n\\*Names and brands that may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html).\r\n\r\n### Disclaimers\r\nTo the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.\r\n\r\nIntel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.\r\n","funding_links":[],"categories":["Table of Contents"],"sub_categories":["AI - Frameworks and Toolkits"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fnetwork-intrusion-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foneapi-src%2Fnetwork-intrusion-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fnetwork-intrusion-detection/lists"}