{"id":17984953,"url":"https://github.com/rdspring1/mission","last_synced_at":"2026-03-09T02:32:01.131Z","repository":{"id":67479131,"uuid":"135932436","full_name":"rdspring1/MISSION","owner":"rdspring1","description":"MISSION: Ultra Large-Scale Feature Selection using Count-Sketches","archived":false,"fork":false,"pushed_at":"2019-10-06T16:02:27.000Z","size":58,"stargazers_count":13,"open_issues_count":2,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-06T10:11:54.215Z","etag":null,"topics":["compressive-sensing","count-sketches","dna-metagenomics","feature-extraction","hashing","large-scale-learning"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rdspring1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-03T19:06:12.000Z","updated_at":"2023-12-17T13:22:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"853c2419-fe26-4945-bcc0-deeea0509d19","html_url":"https://github.com/rdspring1/MISSION","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rdspring1/MISSION","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdspring1%2FMISSION","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdspring1%2FMISSION/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdspring1%2FMISSION/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdspring1%2FMISSION/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rdspring1","download_url":"https://codeload.github.com/rdspring1/MISSION/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdspring1%2FMISSION/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30280851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T02:23:26.802Z","status":"ssl_error","status_checked_at":"2026-03-09T02:22:46.175Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compressive-sensing","count-sketches","dna-metagenomics","feature-extraction","hashing","large-scale-learning"],"created_at":"2024-10-29T18:23:30.411Z","updated_at":"2026-03-09T02:32:01.114Z","avatar_url":"https://github.com/rdspring1.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MISSION\n[MISSION: Ultra Large-Scale Feature Selection using Count-Sketches](https://arxiv.org/abs/1806.04310)\n\nAn ICML 2018 paper by Amirali Aghazadeh\\*, Ryan Spring\\*, Daniel LeJeune, Gautam Dasarathy, Anshumali Shrivastava, Richard G. Baraniuk\n\n\\* These authors contributed equally and are listed alphabetically.\n\n# How-To-Run + Code Versions\n* All data files are formatted using the [VW input format](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format)\n0. Build executables by running Makefile\n1. Mission Logistic Regression\n```\n// Hyperparameters\n// Size of Top-K Heap\nconst size_t TOPK = (1 \u003c\u003c 14) - 1;\n\n// Size of Count-Sketch Array\nconst size_t D = (1 \u003c\u003c 18) - 1;\n\n// Number of Arrays in Count-Sketch\nconst size_t N = 3;\n\n// Learning Rate\nconst float LR = 5e-1;\n\n./mission_logistic train_data test_data\n```\n\n2. Fine-Grained Mission Softmax Regression\n```\n// Hyperparameters\n\n// Size of Top-K Heap\nconst size_t TOPK = (1 \u003c\u003c 20) - 1;\n\n// Number of Classes\nconst size_t K = 193;\n\n// Size of Count-Sketch Array\nconst size_t D = (1 \u003c\u003c 24) - 1;\n\n// Number of Arrays in Count-Sketch\nconst size_t N = 3;\n\n// Learning Rate\nconst float LR = 1e-2;\n\n// Length of String Feature Representation\nconst size_t LEN = 12;\n\n./fine_mission_softmax train_data test_data\n```\n\n3. Coarse-Grained Mission Softmax Regression\n```\n// Hyperparameters\n\n// Size of Top-K Heap\nconst size_t TOPK = (1 \u003c\u003c 22) - 1;\n\n// Number of Classes\nconst size_t K = 193;\n\n// Size of Count-Sketch Array\nconst size_t D = (1 \u003c\u003c 24) - 1;\n\n// Number of Arrays in Count-Sketch\nconst size_t N = 3;\n\n// Learning Rate\nconst float LR = 1e-1;\n\n// Length of String Feature Representation\nconst size_t LEN = 12;\n\n./coarse_mission_softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data\n```\n\n4. Feature Hashing Softmax Regression\n```\n// Hyperparameters\n\n// Number of Classes\nconst size_t K = 193;\n\n// Size of Count-Sketch Array\nconst size_t D = (1 \u003c\u003c 24) - 1;\n\n// Learning Rate\nconst float LR = 1e-2;\n\n// Length of String Feature Representation\nconst size_t LEN = 12;\n\n./softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data\n```\n\n# Optimizations\n\n* Mission streams in the dataset via Memory-Mapped I/O instead of loading everything directly into memory -\\\nNecessary for Tera-Scale Datasets\n* AVX SIMD optimization for fast Softmax Regression\n* The code is currently optimized for the Splice-Site and DNA Metagenomics datasets.\n\n### Mission Softmax Regression\n0. Fine-Grained Feature Set - Each class maintains a separate feature set, so there is a top-k heap for each class.\n1. Coarse-Grained Feature Set - All the classes share a common set of features, so there is only one top-k heap. -\\\nEach feature is measured by its L1 Norm for all classes.\n2. Data Parallelism - Each worker maintains a separate heap, while aggregating gradients in the same count-sketch.\n\n# Datasets\n1. [KDD 2012](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2012)\n2. [RCV1](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#rcv1.binary)\n3. [Webspam - Trigram](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#webspam)\n5. [DNA Metagenomics](http://projects.cbio.mines-paristech.fr/largescalemetagenomics/)\n6. [Criteo 1TB](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#criteo_tb)\n7. [Splice-Site 3.2TB](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdspring1%2Fmission","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frdspring1%2Fmission","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdspring1%2Fmission/lists"}