{"id":21315304,"url":"https://github.com/futurecomputing4ai/malconv2","last_synced_at":"2025-07-12T01:31:30.859Z","repository":{"id":63478450,"uuid":"322113840","full_name":"FutureComputing4AI/MalConv2","owner":"FutureComputing4AI","description":"Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection","archived":false,"fork":false,"pushed_at":"2020-12-18T03:35:50.000Z","size":22187,"stargazers_count":62,"open_issues_count":4,"forks_count":12,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-06T12:11:16.653Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FutureComputing4AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-16T22:08:57.000Z","updated_at":"2025-03-12T22:42:39.000Z","dependencies_parsed_at":"2022-11-19T18:01:47.317Z","dependency_job_id":null,"html_url":"https://github.com/FutureComputing4AI/MalConv2","commit_stats":null,"previous_names":["futurecomputing4ai/malconv2"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FutureComputing4AI/MalConv2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FMalConv2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FMalConv2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FMalConv2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FMalConv2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FutureComputing4AI","download_url":"https://codeload.github.com/FutureComputing4AI/MalConv2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FMalConv2/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264923080,"owners_count":23683716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-21T18:18:55.649Z","updated_at":"2025-07-12T01:31:28.599Z","avatar_url":"https://github.com/FutureComputing4AI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection (a.k.a., MalConv2)\n\nThis is the PyTorch code implementing the approaches from our AAAI 2021 paper [Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection](https://arxiv.org/abs/2012.09390). Using it, you can train the original MalConv model faster and using less memory \nthan before. You can also train our new MalConv with “Global Channel Gating” (GCG), what allows MalConv to learn feature interactions from across the entire inputs. \n\n## Code Organization\n\nThis is research quality code that has gone through some quick edits before going online, and comes with no warranty. The rough outline of the files in this repo. \n\n### binaryLoader.py \n\n`binaryLoader.py` contains the functions we use for loading in a dataset of binaries, and supports un-gziping them on the fly to reduce IO costs. It also includes a sampler that is used to create batches of similarly sized files to minimize excess \npadding used during training. This assumes the input dataset is already in sorted order by file size. \n\n### checkpoint.py\n\nThis contains code used to perform gradient checkpointing for reduced memory usage. This is optional and generally not necessary for our MalConv* models now, but was used during experimentation. \n\n### LowMemConv.py \n\nLowMemConv is the base class that implementations extend to obtain the fixed-memory pooling we introduced. This is provided by `seq2fix` function, which does the work of applying the convolution in chunks, tracking the winners, and then grouping the \nwinning slices to run over with gradient calculations on. \n\nThe user extends `LowMemConvBase`, implementing the `processRange` function, which applies whatever convolutional strategy they desire to a range of bytes. The `determinRF` function is used to determine the receptive field size by iteratively testing \nfor the smallest input size that does not error, so that we know how to size our chunk sizes later. \n\n\n### MalConvGCT_nocat.py \u0026 MalConvGCTTrain.py\n\nMalConvGCT_nocat implements the new contribution of our paper, using the GCT attention. An older file, MalConvGCT uses this pooling but with a concatenation at the end. \n\nThe `MalConvGCTTrain.py` is the sister file that will train a `MalConvGCT` object. \n\nThe associated \"*Train.py\" functions allow for training these models. AvastTyleConv implements the max pool version of the Avast architecture, and MalConvML implement a multiple layer version of MalConv that were used in ablation testing. MalConv.py \nimplements the original MalConv using our new low memory approach. \n\n### malconvGCT_nocat.checkpoint\n\nThis file contains the weights for the GCT model from our paper’s results. It has some extra parameters that were never used due to some lines left commented in durning model training. It also has an off-by-one “bug” that says its the 21’st epoch \ninstead of the 20’th. \n\nTo load this file, you want to have code that looks like:\n\n```python\nfrom MalConvGCT_nocat import MalConvGCT\n\nmlgct = MalConvGCT(channels=256, window_size=256, stride=64,)\nx = torch.load(\"malconvGCT_nocat.checkpoint.checkpoint\")\nmlgct.load_state_dict(x['model_state_dict'], strict=False)\n```\n\n### AvastStyleConv.py\n\nThis implements a network in the style of Avast’s CNN from 2018, but replacing average pooling with our temporal max pooling for speed. \n\n### MalConv.py\n\nImplements the original MalConv network with our faster training/pooling. \n\n\n### MalConvML.py\n\nThis file contains an alternative experiment approach to training with more layers, but never worked well. \n\n### ContinueTraining.py\n\nThis file can be used to resume the training of models from a given checkpoint. \n\n### OptunaTrain.py \n\nThis file is used to do training why a hyper-parameter search. \n\n### Non-Neg options\n\nThe non-negative training currently present is faulty, as it allows you to do such training with a softmax output, which is technically incorrect. Please do not use it. \n\n\n## Citations\n\nIf you use the MalConv GCT algorithm or code, please cite our work! \n\n```\n@inproceedings{malconvGCT,\nauthor = {Raff, Edward and Fleshman, William and Zak, Richard and Anderson, Hyrum and Filar, Bobby and Mclean, Mark},\nbooktitle = {The Thirty-Fifth AAAI Conference on Artificial Intelligence},\ntitle = {{Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection}},\nyear = {2021},\nurl={https://arxiv.org/abs/2012.09390},\n}\n```\n\n## Contact \n\nIf you have questions, please contact \n\nMark Mclean \u003cmrmclea@lps.umd.edu\u003e\nEdward Raff \u003cedraff@lps.umd.edu\u003e\nRichard Zak \u003crzak@lps.umd.edu\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffuturecomputing4ai%2Fmalconv2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffuturecomputing4ai%2Fmalconv2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffuturecomputing4ai%2Fmalconv2/lists"}