{"id":13337957,"url":"https://github.com/Timeless-zfqi/AS-DMF-framework","last_synced_at":"2025-03-11T08:32:07.559Z","repository":{"id":39653384,"uuid":"497316449","full_name":"Timeless-zfqi/AS-DMF-framework","owner":"Timeless-zfqi","description":"AS-DMF framework guide","archived":false,"fork":false,"pushed_at":"2022-08-01T10:03:07.000Z","size":623,"stargazers_count":2,"open_issues_count":3,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-23T20:11:28.275Z","etag":null,"topics":["encrypted-traffic-analysis","feature-reduction","feature-selection","lightweight","malware","python3","stacking-classifier","tls","wireshark","zat","zeek"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Timeless-zfqi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-05-28T12:49:54.000Z","updated_at":"2024-10-23T07:44:29.000Z","dependencies_parsed_at":"2022-09-20T06:50:51.263Z","dependency_job_id":null,"html_url":"https://github.com/Timeless-zfqi/AS-DMF-framework","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Timeless-zfqi%2FAS-DMF-framework","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Timeless-zfqi%2FAS-DMF-framework/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Timeless-zfqi%2FAS-DMF-framework/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Timeless-zfqi%2FAS-DMF-framework/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Timeless-zfqi","download_url":"https://codeload.github.com/Timeless-zfqi/AS-DMF-framework/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243000835,"owners_count":20219751,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["encrypted-traffic-analysis","feature-reduction","feature-selection","lightweight","malware","python3","stacking-classifier","tls","wireshark","zat","zeek"],"created_at":"2024-07-29T19:15:16.471Z","updated_at":"2025-03-11T08:32:07.245Z","avatar_url":"https://github.com/Timeless-zfqi.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AS-DMF：A Lightweight TLS encrypted traffic detection framework\nAuthors:   \n## Contents\n- [Introduction](#Introduction)\n- [Setup](#Setup)\n- [Dataset and feature extraction](#Dataset-and-feature-extraction)\n- [Feature selection mechanism](#Feature-selection-mechanism)\n- [DMF classifier](#DMF-classifier)\n- [Query and training](#Query-and-training) \n- [Acknowledgement](#Acknowledgement) \n\n## Introduction  \nOur project is a combination of active learning and feature selection to achieve lightweight detection of TLS encrypted malicious traffic. The aim is to work lightly on both data and feature dimensions.  \n__Modules of AS-DMF framework include:__\n* __Data pre-processing and feature extraction__.\nThis module is used to pre-process the captured pcap packets and perform preliminary feature extraction to select the TLS encrypted flows to form the initial sample set.\n* __Feature selection mechanism__.\nThis module is used to perform feature selection and to study feature-level lightweighting.\n* __DMF classifier__.\nDMF classifier is the model used to train query samples in AS-DMF framework.\n* __Query and training__.\nThis module is the query and training process of AS-DMF. It mainly uses the pool-based active learning framework and specific querying strategies to query and label informative and representative instances. And train the labeled instances using DMF classifier.\n![DFM](https://github.com/Timeless-zfqi/AS-DMF-framework/blob/main/Figure/Framework.jpg)\n\n## Setup\nBefore you use this project, you must configure the following environment.  \n1. Requirements\n```\npython \u003e= 3.7\nlinux \u003e= Ubuntu 20.04\nzeek(LST) \u003e= 4.0+\nwireshark\n```  \n2. Basic Dependencies\n```\nscikit-learn\nzat\nzeek-flowmeter\nalipy\n```  \n3. Others  \nFor other packets used in the experiment, please refer to _impot.txt_\n## Dataset and feature extraction\nYou can run this module in _Data pre-processing.ipynb_. Details are shown below:   \n\n1.Dataset  \nWe use the open source [CTU-13](https://www.stratosphereips.org/datasets-ctu13 \"CTU-13\") botnet dataset.\n\n2.How to merge pacp packets?  \nYou need to execute the following command from the command line:\n```\n\u003ecd wireshark\n\u003emergecap -w target_path/normal.pcap source_path/CTU-Normal/*.pcap\n```\n3. Initial feature extraction in zeek  \n```\nzeek flowmeter -C -r target pcap path/*.pcap (or .pcapng is also accept)\n```\n4. To Python  \nImport the extracted features into Python by zat and filter the TLS encrypted flows.  \n\n## Feature selection mechanism\nUse ANOVA and MIC to sort the features and pick the number of features you need.  \nYou can run this module in the _feature selection mechanism.ipynb_.  \n```python\nn = number\nX_selection = SelectKBest(lambda X, Y: tuple(map(tuple,array(list(map(lambda x:mic(x, Y), \n                    X.T))).T)),k=n).fit_transform(x_de,y)\n```\n## DMF classifier  \n### Structure  \nAccording to the characteristics of the extracted features, Random Forest classifier, XGBoost classifier and Gaussian Naive Bayes classifier are designed respectively. The three classifiers are combined according to the stacking strategy to form DMF classifier, and the second layer of model is logistic regression.  \n\u003cdiv align=\"center\"\u003e\n\u003cimg src=https://github.com/Timeless-zfqi/AS-DMF-framework/blob/main/Figure/stacking.jpg width=50% /\u003e\n\u003c/div\u003e  \n  \n### Implement your own algorithm  \nIn DMF classifier, there is no limitation for your implementation. All you need is ensure all models have the ability to output probability. Among them {pipe1, pipe2, pipe3, meta_classifier}  \n```python\ndef model(num):\n    sclf = RandomForestClassifier(max_depth=12,n_estimators=100,oob_score=True,n_jobs=-1)\n    sxgb = XGBClassifier(eval_metric=['logloss','auc','error'],max_depth=12,n_estimators=120,n_jobs=-1)\n    sgnb = GaussianNB()\n    pipe1 = make_pipeline(ColumnSelector(cols=range(num)),sclf)\n    pipe2 = make_pipeline(ColumnSelector(cols=range(num)),sxgb)\n    pipe3 = make_pipeline(ColumnSelector(cols=range(num)),sgnb)\n\n    stack = StackingClassifier(classifiers=[pipe1,pipe2,pipe3], meta_classifier=LogisticRegression(solver=\"lbfgs\"))\nreturn stack\n```  \n## Query and training  \nAfter completing the modeling, you can quickly build an AS-DMF query framework using the Toolbox tool in the ALiPy package. The framework uses a pool-based active learning approach and a specific query strategy for querying, labeling and training. You need to pre-set a labeled training set L and a large pool of unlabeled samples U. The sample size of L and U can be set by yourself.  \n```python\nalibox.split_AL(test_ratio=0.3, initial_label_rate=0.001, split_count=10)\n```\nALiPy provides us with diverse query strategies, or combine and design new ones according to your own needs. Take QBC adoption as an example to quickly implement a query operation.  \n```python\nalibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')\n\n# Split data\nalibox.split_AL(test_ratio=0.3, initial_label_rate=0.001, split_count=10)\n\n# Use the default Logistic Regression classifier\nmodel = alibox.get_default_model()\n\n# The cost budget is 50 times querying\nstopping_criterion = alibox.get_stopping_criterion('num_of_queries', 50)\n\n# Use pre-defined strategy\nQBCStrategy = alibox.get_query_strategy(strategy_name='QueryInstanceQBC')\nQBC_result = []\n\nfor round in range(10):\n    # Get the data split of one fold experiment\n    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)\n    # Get intermediate results saver for one fold experiment\n    saver = alibox.get_stateio(round)\n\n    while not stopping_criterion.is_stop():\n        # Select a subset of Uind according to the query strategy\n        # Passing model=None to use the default model for evaluating the committees' disagreement\n        select_ind = QBCStrategy.select(label_ind, unlab_ind, model=None, batch_size=1)\n        label_ind.update(select_ind)\n        unlab_ind.difference_update(select_ind)\n\n        # Update model and calc performance according to the model you are using\n        model.fit(X=X[label_ind.index, :], y=y[label_ind.index])\n        pred = model.predict(X[test_idx, :])\n        accuracy = alibox.calc_performance_metric(y_true=y[test_idx],\n                                                  y_pred=pred,\n                                                  performance_metric='accuracy_score')\n\n        # Save intermediate results to file\n        st = alibox.State(select_index=select_ind, performance=accuracy)\n        saver.add_state(st)\n        saver.save()\n\n        # Passing the current progress to stopping criterion object\n        stopping_criterion.update_information(saver)\n    # Reset the progress in stopping criterion object\n    stopping_criterion.reset()\n    QBC_result.append(copy.deepcopy(saver))\n\nanalyser = alibox.get_experiment_analyser(x_axis='num_of_queries')\nanalyser.add_method(method_name='QBC', method_results=QBC_result)\nprint(analyser)\nanalyser.plot_learning_curves(title='Example of AL', std_area=True)\n```  \n## Acknowledgement\nThanks for these awesome resources that were used during the development of the AS-DMF framework：  \n* https://www.stratosphereips.org/datasets-ctu13\n* https://www.wireshark.org/\n* https://zeek.org/\n* https://github.com/zeek-flowmeter/zeek-flowmeter\n* https://github.com/SuperCowPowers/zat\n* https://scikit-learn.org/stable/index.html\n* https://github.com/NUAA-AL/ALiPy\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTimeless-zfqi%2FAS-DMF-framework","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTimeless-zfqi%2FAS-DMF-framework","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTimeless-zfqi%2FAS-DMF-framework/lists"}