{"id":20556520,"url":"https://github.com/somjit101/microsoft-malware-detection","last_synced_at":"2025-06-23T13:36:16.621Z","repository":{"id":179926692,"uuid":"387572641","full_name":"somjit101/Microsoft-Malware-Detection","owner":"somjit101","description":"A multi-class classification problem where the task is to classify a file to one of 9 types of Malware usually found in a Windows system, using information from the raw data and metadata of the file. ","archived":false,"fork":false,"pushed_at":"2021-07-20T20:48:47.000Z","size":13675,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-06T06:48:18.792Z","etag":null,"topics":["bag-of-words","cross-validation","data-mining","feature-engineering","feature-extraction","gradient-boosting","k-nearest-neighbors","logistic-regression","machine-learning","microsoft-malware","microsoft-malware-detection","multi-variate-analysis","multiprogramming","parallel-processing","parallelization","random-search","t-sne","univariate-analysis","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/somjit101.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-19T19:26:09.000Z","updated_at":"2021-07-20T20:48:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"29a7d8bd-b0e6-4a86-a157-eddad43516a2","html_url":"https://github.com/somjit101/Microsoft-Malware-Detection","commit_stats":null,"previous_names":["somjit101/microsoft-malware-detection"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/somjit101/Microsoft-Malware-Detection","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somjit101%2FMicrosoft-Malware-Detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somjit101%2FMicrosoft-Malware-Detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somjit101%2FMicrosoft-Malware-Detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somjit101%2FMicrosoft-Malware-Detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/somjit101","download_url":"https://codeload.github.com/somjit101/Microsoft-Malware-Detection/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somjit101%2FMicrosoft-Malware-Detection/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261487466,"owners_count":23166088,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bag-of-words","cross-validation","data-mining","feature-engineering","feature-extraction","gradient-boosting","k-nearest-neighbors","logistic-regression","machine-learning","microsoft-malware","microsoft-malware-detection","multi-variate-analysis","multiprogramming","parallel-processing","parallelization","random-search","t-sne","univariate-analysis","xgboost"],"created_at":"2024-11-16T03:28:43.334Z","updated_at":"2025-06-23T13:36:11.607Z","avatar_url":"https://github.com/somjit101.png","language":"Jupyter Notebook","readme":"# Microsoft Malware Detection\nA multi-class classification problem where the task is to classify a file to one of 9 types of Malware usually found in a Windows system, using information from the raw data and metadata of the file. \n\n## Real-world Problem\n\n### What is Malware ?\n\nThe term malware is a contraction of malicious software. Put simply, malware is any piece of software that was written with the intent of doing harm to data, devices or to people.\n[Source](https://www.avg.com/en/signal/what-is-malware)\n\n### Problem Statement\n\nIn the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to \u003cb\u003eidentify whether a given piece of file/software\u003cb\u003e is a malware. \n  \n[Kaggle Problem](https://www.kaggle.com/c/malware-classification)\n  \nMicrosoft has been very active in building anti-malware products over the years  and it runs it’s anti-malware utilities over \u003cb\u003e150 million computers\u003c/b\u003e around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families. \n  \n**This dataset provided by Microsoft contains about 9 classes of malware.**\n  \n### Real-world Objectives/Constraints\n  \n1. Minimize multi-class error.\n2. Multi-class probability estimates.\n3. Malware detection should not take hours and block the user's computer. It should fininsh in a few seconds or a minute.\n  \n## Machine Learning Problem \n  \n  ### Data\n  \n  #### Data Overview\n  \n  \u003cli\u003e Dataset Link : https://www.kaggle.com/c/malware-classification/data \u003c/li\u003e\n\u003cli\u003e For every malware, we have two files \u003col\u003e \u003cli\u003e .asm file (read more: https://www.reviversoft.com/file-extensions/asm) \u003c/li\u003e\u003cli\u003e.bytes file (the raw data contains the hexadecimal representation of the file's binary content, without the PE header)\u003c/li\u003e\u003c/ol\u003e\u003c/li\u003e \n    \n\u003cli\u003eTotal train dataset consist of 200GB data out of which 50Gb of data is .bytes files and 150GB of data is .asm files:  \u003c/li\u003e\n\u003cli\u003e\u003cb\u003eLots of Data for a single-box/computer.\u003c/b\u003e \u003c/li\u003e\n\n\u003cli\u003eThere are total 10,868 .bytes files and 10,868 asm files total 21,736 files \u003c/li\u003e\n\n\u003cli\u003eThere are 9 types of malwares (9 classes) in our give data\u003c/li\u003e\n\u003cli\u003e Types of Malware:\n    \u003col\u003e\n        \u003cli\u003e Ramnit \u003c/li\u003e\n        \u003cli\u003e Lollipop \u003c/li\u003e\n        \u003cli\u003e Kelihos_ver3 \u003c/li\u003e\n        \u003cli\u003e Vundo \u003c/li\u003e\n        \u003cli\u003e Simda \u003c/li\u003e\n        \u003cli\u003e Tracur \u003c/li\u003e\n        \u003cli\u003e Kelihos_ver1 \u003c/li\u003e\n        \u003cli\u003e Obfuscator.ACY \u003c/li\u003e\n        \u003cli\u003e Gatak \u003c/li\u003e\n    \u003c/ol\u003e\n\u003c/li\u003e\n  \n  #### Example Data Point\n  \n  \u003cp style = \"font-size:18px\"\u003e\u003cb\u003e .asm file\u003c/b\u003e\u003c/p\u003e\n\u003cpre\u003e\n.text:00401000\t\t\t\t\t\t\t\t       assume es:nothing, ss:nothing, ds:_data,\tfs:nothing, gs:nothing\n.text:00401000 56\t\t\t\t\t\t\t       push    esi\n.text:00401001 8D 44 24\t08\t\t\t\t\t\t       lea     eax, [esp+8]\n.text:00401005 50\t\t\t\t\t\t\t       push    eax\n.text:00401006 8B F1\t\t\t\t\t\t\t       mov     esi, ecx\n.text:00401008 E8 1C 1B\t00 00\t\t\t\t\t\t       call    ??0exception@std@@QAE@ABQBD@Z ; std::exception::exception(char const * const \u0026)\n.text:0040100D C7 06 08\tBB 42 00\t\t\t\t\t       mov     dword ptr [esi],\toffset off_42BB08\n.text:00401013 8B C6\t\t\t\t\t\t\t       mov     eax, esi\n.text:00401015 5E\t\t\t\t\t\t\t       pop     esi\n.text:00401016 C2 04 00\t\t\t\t\t\t\t       retn    4\n.text:00401016\t\t\t\t\t\t       ; ---------------------------------------------------------------------------\n.text:00401019 CC CC CC\tCC CC CC CC\t\t\t\t\t       align 10h\n.text:00401020 C7 01 08\tBB 42 00\t\t\t\t\t       mov     dword ptr [ecx],\toffset off_42BB08\n.text:00401026 E9 26 1C\t00 00\t\t\t\t\t\t       jmp     sub_402C51\n.text:00401026\t\t\t\t\t\t       ; ---------------------------------------------------------------------------\n.text:0040102B CC CC CC\tCC CC\t\t\t\t\t\t       align 10h\n.text:00401030 56\t\t\t\t\t\t\t       push    esi\n.text:00401031 8B F1\t\t\t\t\t\t\t       mov     esi, ecx\n.text:00401033 C7 06 08\tBB 42 00\t\t\t\t\t       mov     dword ptr [esi],\toffset off_42BB08\n.text:00401039 E8 13 1C\t00 00\t\t\t\t\t\t       call    sub_402C51\n.text:0040103E F6 44 24\t08 01\t\t\t\t\t\t       test    byte ptr\t[esp+8], 1\n.text:00401043 74 09\t\t\t\t\t\t\t       jz      short loc_40104E\n.text:00401045 56\t\t\t\t\t\t\t       push    esi\n.text:00401046 E8 6C 1E\t00 00\t\t\t\t\t\t       call    ??3@YAXPAX@Z    ; operator delete(void *)\n.text:0040104B 83 C4 04\t\t\t\t\t\t\t       add     esp, 4\n.text:0040104E\n.text:0040104E\t\t\t\t\t\t       loc_40104E:\t\t\t       ; CODE XREF: .text:00401043\u0018j\n.text:0040104E 8B C6\t\t\t\t\t\t\t       mov     eax, esi\n.text:00401050 5E\t\t\t\t\t\t\t       pop     esi\n.text:00401051 C2 04 00\t\t\t\t\t\t\t       retn    4\n.text:00401051\t\t\t\t\t\t       ; ---------------------------------------------------------------------------\n\u003c/pre\u003e\n\u003cp style = \"font-size:18px\"\u003e\u003cb\u003e .bytes file\u003c/b\u003e\u003c/p\u003e\n\u003cpre\u003e\n00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20\n00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01\n00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18\n00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04\n00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80\n00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90\n00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19\n00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00\n00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00\n00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00\n004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08\n004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A\n004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04\n004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82\n004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00\n004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00\n00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00\n00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00\n00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10\n00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11\n00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10\n00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01\n00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00\n00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00\n00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11\n00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00\n\n\u003c/pre\u003e\n  \n  ### Mapping the Real-world Problem to an ML Problem\n  \n  **There are nine different classes of malware that we need to classify a given a data point =\u003e Multi class classification problem**\n  \n  \n  #### Performance Metrics\n  [Source](https://www.kaggle.com/c/malware-classification#evaluation)\n  \n  * Multi class log-loss \n  * Confusion matrix \n  \n  #### ML Objectives/Constraints\n  \n  **Objective :** Predict the probability of each data-point belonging to each of the nine classes.\n  \n  **Constraints :**\n  * Class probabilities are needed.\n  * Penalize the errors in class probabilites =\u003e Metric is Log-loss.\n  * Some Latency constraints.\n  \n  ### Train/Test Dataset\n  \n  We have split the dataset randomly into three parts train, cross validation and test with 64%,16%, 20% of data respectively.\n  \n  ### Useful blogs, videos and reference research papers\n  \n  * [Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification](https://arxiv.org/pdf/1511.04317.pdf)\n  * [YouTube Video - Kaggle top solution](https://www.youtube.com/watch?v=VLQTRlLGz5Y)\n  * [Derek Chadwick's GitHub repo](https://github.com/dchad/malware-detection)\n  * [Useful Slides](https://vizsec.org/files/2011/Nataraj.pdf)\n  * [\"Cross validation is more trustworthy than domain knowledge.\"](https://www.dropbox.com/sh/gfqzv0ckgs4l1bf/AAB6EelnEjvvuQg2nu_pIB6ua?dl=0) by Nataraj\n  \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomjit101%2Fmicrosoft-malware-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsomjit101%2Fmicrosoft-malware-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomjit101%2Fmicrosoft-malware-detection/lists"}