{"id":28444435,"url":"https://github.com/futurecomputing4ai/ember2024","last_synced_at":"2026-03-07T18:31:34.480Z","repository":{"id":297540031,"uuid":"996433283","full_name":"FutureComputing4AI/EMBER2024","owner":"FutureComputing4AI","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-22T20:16:42.000Z","size":57,"stargazers_count":91,"open_issues_count":12,"forks_count":20,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-14T01:39:39.726Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FutureComputing4AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-05T00:20:21.000Z","updated_at":"2026-02-11T18:14:25.000Z","dependencies_parsed_at":"2025-06-06T02:32:24.553Z","dependency_job_id":"2ce8d7fc-7514-4f97-a7c5-915e2cb8e925","html_url":"https://github.com/FutureComputing4AI/EMBER2024","commit_stats":null,"previous_names":["futurecomputing4ai/ember2024"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FutureComputing4AI/EMBER2024","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FEMBER2024","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FEMBER2024/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FEMBER2024/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FEMBER2024/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FutureComputing4AI","download_url":"https://codeload.github.com/FutureComputing4AI/EMBER2024/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FutureComputing4AI%2FEMBER2024/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30226246,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T18:12:09.766Z","status":"ssl_error","status_checked_at":"2026-03-07T18:11:58.786Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-06T09:09:15.311Z","updated_at":"2026-03-07T18:31:34.450Z","avatar_url":"https://github.com/FutureComputing4AI.png","language":"Python","funding_links":[],"categories":[":bookmark_tabs: Datasets"],"sub_categories":["Scientific Research"],"readme":"# EMBER2024\n\nEMBER2024 is an update to the [EMBER2017 and EMBER2018](https://github.com/elastic/ember/) datasets. It includes raw features and labels for 3.2 million malicious and benign files from 6 different file types (Win32, Win64, .NET, APK, ELF, and PDF). EMBER2024 is meant to allow researchers to explore a variety of common malware analysis classification tasks. The dataset includes 7 types of labels and tags that support malicious/benign detection, malware family classification, malware behavior prediction, and more.\n\nFor more details, check out our [paper](https://arxiv.org/abs/2506.05074)!\n\n\n## EMBER2024 Contents\n\nEMBER2024 includes features and labels for malware that was first uploaded to VirusTotal between Sep. 24th, 2023 and Dec. 14th, 2024. There are exactly 50,500 files chosen from each week of that time period, with the first 52 weeks of files making up the training set and the last 12 going to the test set. This lets researchers simulate how effectively a classifier might detect malware that is newer than its training corpus. In total, the training set is 2,626,000 files and the test set is 606,000 files.\n\n#### File Statistics\n| File Type   | Malicious + Benign (Weekly) | Train Total | Test Total |\n| -------- | ------- | ------ | ------- |\n| Win32  | 30,000    | 1,560,000 | 360,000 |\n| Win64 | 10,000     | 520,000 | 120,000 |\n| .NET    | 5,000    | 260,000 | 60,000 |\n| APK  | 4,000    | 208,000 | 48,000 |\n| PDF | 1,000  | 52,000 | 12,000 |\n| ELF    | 500    | 26,000 | 6,000 |\n\n#### Challenge Set\n\nEMBER also includes features and labels for 6,315 malicious files in a \"challenge set\". These files initially went undetected by ~70 antivirus products on VirusTotal but were later found to be malicious. The challenge set is an excellent resource for assessing how well a machine larning classifier is able to detect evasive malware.\n\n\n## EMBER Feature Version 3\n\nThe previous EMBER feature versions were pinned to [LIEF](lief.re) version 0.9.0, which requires Python 3.6. EMBER feature version 3 (\"thrember\") is a re-implementation of the EMBER feature vector format that uses the [pefile](https://github.com/erocarrera/pefile) library instead. pefile is stable and has no dependencies, making it ideal going forward. We have also made several addition to the EMBER feature vector format, which now includes features from the DOS header, Rich header, PE data directories, Authenticode signatures, and warnings during PE parsing. Furthermore, we have added support for feature extraction from non-PE files using a subset of the EMBER feature version 3 format. We show that effective classifiers for APK, ELF, and PDF files can be trained using just features from general file info, byte statistics, and string statistics.\n\n## Installation\n\nTo clone the repository and install it using pip, run:\n```\ngit clone https://github.com/FutureComputing4AI/EMBER2024.git\ncd EMBER2024/\npip install .\n```\n\n\n## Download Models and Dataset\n\nThe EMBER2024 models, features, and labels are hosted on HuggingFace. To download them from the HuggingFace hub, launch a Python console, import thrember, and run the download_models() and/or download_dataset() functions:\n\n#### Downloading the models\n\nTo download the 14 benchmark LightGBM classifiers we trained on EMBER2024, run:\n\n```\nthrember.download_models(\"/path/to/download/to/\")\n```\n\n\n#### Downloading the data:\n\n```\nimport thrember\nthrember.download_dataset(\"/path/to/download/to/\")\n```\n\nYou can download smaller chunks of the dataset by passing different keyword arguments to download_dataset.py:\n\nDownload all PE (Win32, Win64, and .NET) files:\n```\nthrember.download_dataset(\"/path/to/download/to/\" file_type=\"PE\")\n```\n\nDownload just the APKs in the training set:\n```\nthrember.download_dataset(\"/path/to/download/to/\" file_type=\"APK\", split=\"train\")\n```\n\nDownload just the challenge set:\n\n```\nthrember.download_dataset(\"/path/to/download/to/\" split=\"challenge\")\n```\n\n\n\nThe sizes of the features and labels for each portion of EMBER2024 are shown below:\n\n| Subset | Total Size |\n| ------ | ------ |\n| Win32 train | 23.7 GB |\n| Win32 test  | 4.9 GB |\n| Win64 train | 12.9 GB |\n| Win64 test  | 2.5 GB |\n| .NET train | 1.8 GB |\n| .NET test | 425 MB |\n| APK train | 1.0 GB|\n| APK test | 234 MB|\n| PDF train | 197 MB |\n| PDF test | 46 MB |\n| ELF train | 100 MB |\n| ELF test | 24 MB |\n| challenge | 126 MB |\n\n\n## Vectorizing Raw Features\n\nDepending on which files you choose to download, you can vectorize the entire EMBER2024 dataset or just a part of it. The Python code below will create .dat files with feature vectors and malicious/benign labels for the train, test, and challenge sets.\n\n```\nimport thrember\nthrember.create_vectorized_features('/path/to/dataset/')\n```\n\nFamilies and tags were assigned to files using [ClarAVy](https://github.com/FutureComputing4AI/ClarAVy/). If you want to train a classifier on other types of labels or tags, pass the label_type keyword to the create_vectorized_features() function:\n\n```\nthrember.create_vectorized_features('/path/to/dataset/', label_type=\"family\")\nthrember.create_vectorized_features('/path/to/dataset/', label_type=\"behavior\")\nthrember.create_vectorized_features('/path/to/dataset/', label_type=\"file_property\")\nthrember.create_vectorized_features('/path/to/dataset/', label_type=\"packer\")\nthrember.create_vectorized_features('/path/to/dataset/', label_type=\"exploit\")\nthrember.create_vectorized_features('/path/to/dataset/', label_type=\"group\")\n```\n\nBy default, any families, behaviors, etc. that occur fewer than 10 times in EMBER2024 are ignored during vectorization. To adjust this, use the class_min keyword:\n\n```\nthrember.create_vectorized_features('/path/to/dataset/', label_type=\"family\", class_min=1)\n```\n\n## Reading EMBER Vectors\n\nOnce you've vectorized EMBER2024, you can read the data and labels into numpy ndarrays:\n\n```\nimport thrember\nX_train, y_train = thrember.read_vectorized_features('/path/to/dataset/', subset=\"train\")\nX_test, y_test = thrember.read_vectorized_features('/path/to/dataset/', subset=\"test\")\nX_challenge, y_challenge = thrember.read_vectorized_features('/path/to/dataset/', subset=\"challenge\")\n```\n\n## More Examples\n\nCheck out the ```examples/``` directory for more example code!\n\n```\nember2024-notebook.ipynb -- Explore the EMBER2024 dataset\ntrain_lgbm.py -- Train a LightGBM classifier\neval_lgbm.py -- Evaluate a classifier on the test and challenge sets\n```\n\n\n## Dataset Methodology\n\nTo learn more about how we built EMBER2024, check out our [vtpipeline-rs](https://github.com/FutureComputing4AI/vtpipeline-rs) repository!\n\n\n## Citing\n\nIf you use EMBER2024 in your own research, please cite it using:\n\n```\n@inproceedings{joyce2025ember,\n      title={EMBER2024 - A Benchmark Dataset for Holistic Evaluation of Malware Classifiers},\n      author={Robert J. Joyce and Gideon Miller and Phil Roth and Richard Zak and Elliott Zaresky-Williams and Hyrum Anderson and Edward Raff and James Holt},\n      year={2025},\n      booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining},\n}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffuturecomputing4ai%2Fember2024","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffuturecomputing4ai%2Fember2024","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffuturecomputing4ai%2Fember2024/lists"}