{"id":15913090,"url":"https://github.com/maksimekin/rfot","last_synced_at":"2025-10-04T10:08:59.916Z","repository":{"id":83019934,"uuid":"429332660","full_name":"MaksimEkin/RFoT","owner":"MaksimEkin","description":"Random Forest of Tensors (RFoT) is a tensor decomposition based ensemble semi-supervised classifier.","archived":false,"fork":false,"pushed_at":"2022-04-04T22:19:40.000Z","size":9085,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-02-08T17:14:38.290Z","etag":null,"topics":["canonical-polyadic","classification","clustering","cpd","cybersecurity","factorization","latent-variables","machine-learning","malware-analysis","semi-supervised-learning","tensor-decomposition","tensors"],"latest_commit_sha":null,"homepage":"https://maksimekin.github.io/RFoT/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MaksimEkin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-18T07:16:51.000Z","updated_at":"2023-03-14T08:01:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"766e6e67-3d44-401d-96ca-07fe2256cebc","html_url":"https://github.com/MaksimEkin/RFoT","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaksimEkin%2FRFoT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaksimEkin%2FRFoT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaksimEkin%2FRFoT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaksimEkin%2FRFoT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MaksimEkin","download_url":"https://codeload.github.com/MaksimEkin/RFoT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927844,"owners_count":20856198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["canonical-polyadic","classification","clustering","cpd","cybersecurity","factorization","latent-variables","machine-learning","malware-analysis","semi-supervised-learning","tensor-decomposition","tensors"],"created_at":"2024-10-06T16:23:02.996Z","updated_at":"2025-10-04T10:08:54.871Z","avatar_url":"https://github.com/MaksimEkin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Random Forest of Tensors (RFoT) \u003cimg align=\"left\" width=\"50\" height=\"50\" src=\"RFoT/RFoT.png\"\u003e\n\n\u003cdiv align=\"center\", style=\"font-size: 50px\"\u003e\n    \u003cimg src=\"https://img.shields.io/hexpm/l/plug\"\u003e\u003c/img\u003e\n    \u003cimg src=\"https://img.shields.io/badge/python-v3.8.5-blue\"\u003e\u003c/img\u003e\n\u003c/div\u003e\n\n\u003cbr\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"500\" src=\"RFoT/rfot_demo.png\"\u003e\n\u003c/p\u003e\n\nTensor decomposition is a powerful unsupervised Machine Learning method that enables the modeling of multi-dimensional data, including malware data. We introduce a novel ensemble semi-supervised classification algorithm, named Random Forest of Tensors (RFoT), that utilizes tensor decomposition to extract the complex and multi-faceted latent patterns from data. Our hybrid model leverages the strength of multi-dimensional analysis combined with clustering to capture the sample groupings in the latent components, whose combinations distinguish malware and benign-ware. The patterns extracted from a given data with tensor decomposition depend upon the configuration of the tensor such as dimension, entry, and rank selection. To capture the unique perspectives of different tensor configurations, we employ the “wisdom of crowds” philosophy and make use of decisions made by the majority of a randomly generated ensemble of tensors with varying dimensions, entries, and ranks.\n\nAs the tensor decomposition backend, RFoT offers two CPD algorithms. First, RFoT package includes the Python implementation of **CP-ALS** (**[pyCP_ALS](https://github.com/MaksimEkin/pyCP_ALS)**) algorithm that was originally introduced in the [MATLAB Tensor Toolbox](https://www.tensortoolbox.org) [2,3,4,5]. CP-ALS backend can also be used to **decompose each random tensor in a parallel manner**. RFoT can also be used with the Python implentation of the **CP-APR** (**[pyCP_APR](https://github.com/lanl/pyCP_APR)**) algorithm with the **GPU capability** [1]. Use of CP-APR backend allows decomposing each random tensor configuration both in an **embarrassingly parallel fashion in a single GPU**, and in a **multi-GPU parallel execution**.\n\n\u003cdiv align=\"center\", style=\"font-size: 50px\"\u003e\n\n### [:information_source: Documentation](https://maksimekin.github.io/RFoT/index.html) \u0026emsp; [:orange_book: Example Notebooks](examples/) \u0026emsp; [:bar_chart: Datasets](data/) \u0026emsp; [:page_facing_up: Abstract](https://www.maksimeren.com/abstract/Random_Forest_of_Tensors_RFoT_MTEM.pdf)  \u0026emsp; [:scroll: Poster](https://www.maksimeren.com/poster/Random_Forest_of_Tensors_RFoT_MTEM.pdf)\n\n\u003c/div\u003e\n\n\n## Installation\n\n```shell\nconda create --name RFoT python=3.8.5\nconda activate RFoT\npip install git+https://github.com/MaksimEkin/RFoT.git\n```\n\n## Example Usage\nIn below example, we use a small sample from [EMBER-2018](https://github.com/elastic/ember) dataset to classify malware and benign-ware:\n- Random tensors in the ensemble are decomposed in a multi-GPU parallel fashion using 2 GPUs. (```n_jobs=2```, ```n_gpus=2```).\n- Use CP-APR tensor decomposition backend with GPU (```decomp=\"cp_apr_gpu\"```).\n- 200 tensor configurations are randomly sampled (```n_estimators=200```).\n- A tensor's dimension in the ensemble could be between 3 and 8 (```min_dimensions=3```, ```max_dimensions=8```). \n- Rank is between 2 and 10. (```rank=\"random\"```, ```min_rank=2```, ```max_rank=10```).\n- Cluster uniformity threshold of 1.0 is used (```cluster_purity_tol=1.0```).\n- Patterns are captured with Mean Shift (MS) clustering (```clustering=\"ms\"```).\n- Feature values representing the tensor entry are not binned (```bin_entry=False```).\n- Maximum tensor dimension size representing any feature is equals to the total number of unique values for that feature, where the values are mapped to an index in the tensor dimension (```bin_scale=1```).\n\n```python\nimport pickle\nimport numpy as np\nfrom sklearn.metrics import f1_score\nfrom RFoT import RFoT\n\n# load the exmple data\ndata = pickle.load(open(\"data/example.p\", \"rb\"))\nX = data[\"X\"]\ny_experiment = data[\"y_experiment\"]\ny_true = data[\"y_true\"]\n\n# Predict the unknown sample labels\nmodel = RFoT(\n    bin_scale=1,\n    min_dimensions=3,\n    max_dimensions=8,\n    cluster_purity_tol=1.0,\n    rank=\"random\",\n    min_rank=2,\n    max_rank=10,\n    n_estimators=200,\n    bin_entry=False,\n    decomp=\"cp_apr_gpu\",\n    clustering=\"ms\",\n    n_jobs=2,\n    n_gpus=2\n)\ny_pred = model.predict(X, y_experiment)\n\n# Results\nunknown_indices = np.argwhere(y_experiment == -1).flatten()\ndid_predict_indices = np.argwhere(y_pred[unknown_indices] != -1).flatten()\nabstaining_count = len(np.argwhere(y_pred == -1))\nf1 = f1_score(\n    y_true[unknown_indices][did_predict_indices],\n    y_pred[unknown_indices][did_predict_indices],\n    average=\"weighted\",\n)\n\nprint(\"Num. of Abstaining\", abstaining_count)\nprint(\"Percent Abstaining\", (abstaining_count / len(unknown_indices)) * 100, \"%\")\nprint(\"F1=\", f1)\n```\n**See the [examples](examples/) for more.**\n\n## How to Cite RFoT?\nIf you use RFoT, please cite it.\n\n```latex\n@MISC{Eren2022RFoT,\n  author = {M. E. {Eren} and C. {Nicholas} and E. {Raff} and R. {Yus} and J. S. {Moore} and B. S. {Alexandrov}},\n  title = {{RFoT}},\n  year = {2022},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/MaksimEkin/RFoT}}\n}\n\n@MISC{eren2021RFoT,\n      title={Random Forest of Tensors (RFoT)}, \n      author={M. E. {Eren} and C. {Nicholas} and R. {McDonald} and C. {Hamer}},\n      year={2021},\n      note={Presented at the 12th Annual Malware Technical Exchange Meeting, Online, 2021}\n}\n```\n\n## Acknowledgments\n\nThis work was done as part of Maksim E. Eren's Master's Thesis at the University of Maryland, Baltimore County with the thesis committee members and collaborators Charles Nicholas, Edward Raff, Roberto Yus, Boian S. Alexandrov, and Juston S. Moore.\n\n## References\n[1] Eren, M.E., Moore, J.S., Skau, E.W., Bhattarai, M., Moore, E.A, and Alexandrov, B.. 2022. General-Purpose Unsupervised Cyber Anomaly Detection via Non-Negative Tensor Factorization. Digital Threats: Research and Practice, 28 pages. DOI: https://doi.org/10.1145/3519602\n\n[2] General software, latest release: Brett W. Bader, Tamara G. Kolda and others, Tensor Toolbox for MATLAB, Version 3.2.1, www.tensortoolbox.org, April 5, 2021.\n\n[3] Dense tensors: B. W. Bader and T. G. Kolda, Algorithm 862: MATLAB Tensor Classes for Fast Algorithm Prototyping, ACM Trans. Mathematical Software, 32(4):635-653, 2006, http://dx.doi.org/10.1145/1186785.1186794.\n\n[4] Sparse, Kruskal, and Tucker tensors: B. W. Bader and T. G. Kolda, Efficient MATLAB Computations with Sparse and Factored Tensors, SIAM J. Scientific Computing, 30(1):205-231, 2007, http://dx.doi.org/10.1137/060676489.\n\n[5] M. E. Eren. pyCP_ALS. https://github.com/MaksimEkin/pyCP_ALS, 2022.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaksimekin%2Frfot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaksimekin%2Frfot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaksimekin%2Frfot/lists"}