{"id":50887462,"url":"https://github.com/software-engineering-and-security/loom-android-malware-detection","last_synced_at":"2026-06-15T18:00:58.666Z","repository":{"id":358011073,"uuid":"1239482030","full_name":"software-engineering-and-security/loom-android-malware-detection","owner":"software-engineering-and-security","description":"\"Loom: A Balanced String-Based Transformer for Android Malware Detection (ICICS 2026)\"","archived":false,"fork":false,"pushed_at":"2026-05-15T07:27:39.000Z","size":11326,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-15T09:34:07.233Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/software-engineering-and-security.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-15T06:15:44.000Z","updated_at":"2026-05-15T07:27:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/software-engineering-and-security/loom-android-malware-detection","commit_stats":null,"previous_names":["software-engineering-and-security/loom-android-malware-detection"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/software-engineering-and-security/loom-android-malware-detection","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/software-engineering-and-security%2Floom-android-malware-detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/software-engineering-and-security%2Floom-android-malware-detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/software-engineering-and-security%2Floom-android-malware-detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/software-engineering-and-security%2Floom-android-malware-detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/software-engineering-and-security","download_url":"https://codeload.github.com/software-engineering-and-security/loom-android-malware-detection/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/software-engineering-and-security%2Floom-android-malware-detection/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34374146,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-15T18:00:57.623Z","updated_at":"2026-06-15T18:00:58.654Z","avatar_url":"https://github.com/software-engineering-and-security.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LOOM: A Balanced String-Based Transformer for Android Malware Detection\n\nReference implementation of **LOOM** (ICICS 2026). LOOM fuses three\ncomplementary static feature views extracted from each APK — the Android\nmanifest, API calls, and Dalvik opcodes — into a single string token\nsequence under a fixed token budget, then fine-tunes a transformer\nclassifier on it. The repository also ships reproductions of several\npublished baselines for direct comparison.\n\n\u003e 📄 **Paper:** Hantang Zhang, Mojtaba Eshghie, Bruno Kreyssig, Tommy Löfstedt,\n\u003e Alexandre Bartel. *LOOM: A Balanced String-Based Transformer for Android\n\u003e Malware Detection.* ICICS 2026 (to appear). See [Citation](#citation).\n\n---\n\n## Highlights\n\n- **Three-channel static features** extracted with [androguard]:\n  manifest entities, API-call sequences, and opcode sequences.\n- **Token-budget-aware preprocessing** that allocates the limited\n  context window across the three channels by a configurable ratio\n  (default 1 : 4 : 5 for manifest : api : opcode), with cleaning,\n  third-party-library filtering (AndroLibZoo + LibD + common ad/util\n  libraries), TF-IDF / χ² / information-gain feature selection.\n- **Multiple transformer backbones:** BERT, BigBird, Longformer, ModernBERT.\n- **Model explanation** via LIME, plus a lightweight logistic-regression\n  shadow model for global feature importance.\n- **Reproduced baselines:**\n  [ImageDroid], [MalScan], [RevealDroid], and a multimodal-transformer\n  fusion baseline.\n\n---\n\n## Repository layout\n\n```\nloom-android-malware/\n├── apk_process/           # APK parsing \u0026 raw-feature extraction (androguard)\n├── datasets/              # SHA-256 lists of every APK used in the paper (see Datasets)\n├── docs/                  # Paper appendix and other supplementary documents\n├── feature_process/       # Preprocessing into BERT-friendly token sequences\n│   ├── manifest_process.py\n│   ├── apicall_process.py\n│   ├── opcode_process.py\n│   ├── features_process_final.py     # main preprocessing entry point\n│   ├── split_dataset.py              # leak-free train/val/test split\n│   ├── counter_process.py\n│   ├── dsfile_process.py\n│   ├── malbert.py\n│   ├── tfidf_feature_extractor.py\n│   ├── chi2_feature_extractor.py\n│   ├── information_gain_feature_extractor.py\n│   ├── filter/                       # third-party-library blacklists\n│   └── utils/\n├── model/                 # Transformer classifiers (BERT / BigBird / Longformer / ModernBERT)\n│   ├── bert.py\n│   ├── bert_with_count.py\n│   ├── bigbird_base.py\n│   ├── longformer_base.py\n│   ├── modern_bert.py\n│   └── model_download.py\n├── model_explanation/     # LIME + lightweight shadow model\n│   ├── lime_bert.py\n│   ├── lightweight_model.py\n│   ├── important_feature_process.py\n│   └── feature_association.py\n├── obfuscation/           # Obfuscation-related processing\n├── repro_baselines/       # Reproduced baselines\n│   ├── image_droid/\n│   ├── malscan/\n│   ├── reveal_droid/\n│   └── multimodal_transformer/\n├── UniXcoder/             # UniXcoder wrapper (used by parts of the pipeline)\n├── utils/                 # Dataset building, downloading, analysis helpers\n├── LICENSE\n├── README.md\n└── requirements.txt\n```\n\n---\n\n## Installation\n\nTested with **Python 3.11** on Linux.\n\n```bash\ngit clone https://github.com/HantangZhang/loom-android-malware.git\ncd loom-android-malware\npython -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\n```\n\nMain dependencies (see `requirements.txt` for exact pinned versions):\n\n- `androguard 4.1.3` — APK static analysis\n- `transformers 4.51.3`, `torch 2.7.0`, `datasets 3.5.0` — modeling\n- `scikit-learn`, `scipy`, `lime` — feature selection \u0026 explanation\n- `tqdm`, `pandas`, `numpy`, `loguru`\n\n---\n\n## Quick start\n\nEvery script under `apk_process/`, `feature_process/`, `model/`,\n`model_explanation/`, `obfuscation/`, `utils/`, and `repro_baselines/*/`\nis runnable via `python -m \u003cmodule\u003e` with a `--help`-driven `argparse`\nCLI. Run any module with `--help` to discover its full flag set.\n\nYou may also want to point the HuggingFace cache somewhere other than\n`~/.cache/`:\n\n```bash\nexport ANDROID_ML_CACHE=/path/to/big/disk/hf_cache\n```\n\n### 1. Extract raw features from APKs\n\nFor a full directory of APKs (manifest + API calls + opcode in one parse):\n\n```bash\npython -m apk_process.apk_extractor extract-all \\\n    --target /path/to/apks \\\n    --manifest-dir /out/manifest \\\n    --api-dir      /out/apicall \\\n    --opcode-dir   /out/opcode \\\n    --workers 16\n```\n\nFor an obfuscation sweep (one SHA list, many APK source directories):\n\n```bash\npython -m apk_process.apk_extractor extract-by-sha \\\n    --sha-list /path/to/sha_list.txt \\\n    --apk-dir  /path/to/apks/CID \\\n    --apk-dir  /path/to/apks/JUNK \\\n    --output-dir /out/obfuscation \\\n    --workers 16\n```\n\n### 2. Build a HuggingFace dataset under a token budget\n\n```bash\npython -m feature_process.features_process_final \\\n    --manifest-dir /out/manifest \\\n    --api-dir      /out/apicall \\\n    --opcode-dir   /out/opcode \\\n    --base-dir     /out/features \\\n    --sha-list     /path/to/sub_dataset_sha.txt \\\n    --csv-path     /path/to/apk_metadata.csv \\\n    --filter-method chi2_delta_idf \\\n    --manifest-limit 300 --api-limit 300 --opcode-limit 300 \\\n    --global-limit 512 --cap 2 --num-proc 16\n```\n\nBy default, this command **also** creates a stratified train/val/test\nSHA split under `\u003cbase-dir\u003e/\u003ctask-name\u003e/split/{train,val,test}.txt`\nand fits every label-aware score matrix on the training partition only,\nso val + test never leak label information into feature selection. See\n[Preventing data leakage](#preventing-data-leakage) for the full story\nand the flags that control it.\n\nThe individual preprocessors are also runnable separately:\n`feature_process.manifest_process`, `feature_process.apicall_process`,\n`feature_process.opcode_process`.\n\n### 3. Fine-tune a classifier\n\n```bash\npython -m model.bert \\\n    --model-path bert-base-uncased \\\n    --dataset-dir /out/features/.../final_ds_chi2_delta_idf_... \\\n    --split-dir   /out/features/.../split \\\n    --output-dir  ./malware-bert \\\n    --num-train-epochs 5 --batch-size 64 --learning-rate 2e-5\n```\n\nPass the **same** `--split-dir` that preprocessing wrote so the model's\ntrain / val / test partition matches the one used to fit the feature\nmatrices. Without `--split-dir` the model falls back to a random split\n(useful only for quick experiments — not for published numbers; see\n[Preventing data leakage](#preventing-data-leakage)).\n\nThe long-context variants share the same flag layout:\n`model.bigbird_base`, `model.longformer_base`, `model.modern_bert`,\n`model.bert_with_count`.\n\n### 4. Explain predictions\n\n```bash\npython -m model_explanation.lime_bert \\\n    --dataset-dir   /path/to/hf_dataset \\\n    --model-path    ./malware-bert/checkpoint-XXXX \\\n    --tokenizer-path bert-base-uncased \\\n    --output-dir    ./lime_html \\\n    --lime-csv      ./lime.csv \\\n    --sample-size 50 --num-features 20 --num-samples 100\n```\n\nThe companion shadow model lives in `model_explanation.lightweight_model`\n(L1 logistic regression over the union vocabulary of the three views) and\nthe LIME-CSV analysis tools in `model_explanation.feature_association`\nand `model_explanation.important_feature_process`.\n\n---\n\n## Reproducing baselines\n\nEach baseline lives under `repro_baselines/\u003cname\u003e/`. Every script in\nthose folders is an `argparse` CLI — run with `--help` to inspect flags.\n\n| Baseline                | Folder                                      | Entry points (use `--help` for each) |\n| ----------------------- | ------------------------------------------- | ------------------------------------ |\n| ImageDroid              | `repro_baselines/image_droid/`              | `extract_dex_image_features {extract,fix-labels}`, `model_training {kfold,train,predict}` |\n| MalScan                 | `repro_baselines/malscan/`                  | `malscan_json_features`, `malscan_json_features_fast`, `malscan_merge_features`, `malscan_train_eval` |\n| RevealDroid             | `repro_baselines/reveal_droid/`             | `extract_apicount`, `extract_packageAPI`, `intent_action`, `reflection_native`, `build_features {single,multi}`, `revealDroid_detector {train,eval}` |\n| Multimodal Transformer  | `repro_baselines/multimodal_transformer/`   | `apk_to_dex_images`, `extract_bm_features`, `sm_features {process,process-roots,merge,debug-apk}`, `fusion_classifier`, `fine_tune` |\n\n---\n\n## Preventing data leakage\n\nThe feature-selection step is **label-aware**: the delta-IDF,\nchi-square and information-gain matrices score every token by how its\ndistribution differs between benign and malicious documents. Fitting\nthose matrices on the full dataset — including the samples you later\nhold out for evaluation — silently leaks label information from val /\ntest back into the features of every other sample.\n\nTo prevent this, the pipeline now treats the train/val/test partition\nas a first-class artefact that is shared between preprocessing and\ntraining:\n\n1. `feature_process.split_dataset` produces a deterministic stratified\n   split of the input SHA list and writes\n   `train.txt` / `val.txt` / `test.txt` under a chosen directory.\n2. `feature_process.features_process_final` reads (or creates) that\n   split before doing anything else, and excludes every val + test SHA\n   when fitting the score matrices. The selected tokens for *every*\n   sample (train, val, test) are produced by matrices fitted on the\n   train SHAs only.\n3. The model classifiers (`model.bert`, `model.bert_with_count`,\n   `model.bigbird_base`, `model.longformer_base`, `model.modern_bert`)\n   accept the same `--split-dir`. When given, they filter the\n   preprocessed dataset by the same `train.txt` / `val.txt` /\n   `test.txt` files instead of doing an ad-hoc random split, so the\n   train / test partition the model sees is identical to the one used\n   to fit the feature matrices.\n\n### Default behaviour\n\nRunning the preprocessing pipeline without any new flags already does\nthe right thing:\n\n```bash\npython -m feature_process.features_process_final \\\n    --manifest-dir /out/manifest --api-dir /out/apicall \\\n    --opcode-dir   /out/opcode   --base-dir /out/features \\\n    --sha-list     /path/to/sub_dataset_sha.txt \\\n    --csv-path     /path/to/labels.csv \\\n    --filter-method chi2_delta_idf\n# -\u003e writes /out/features/\u003ctask\u003e/split/{train,val,test}.txt\n# -\u003e fits matrices on train only, applies them to all samples\n```\n\nThen point the model at the same split:\n\n```bash\npython -m model.bert \\\n    --dataset-dir /out/features/\u003ctask\u003e/final_ds_chi2_delta_idf_..._512 \\\n    --split-dir   /out/features/\u003ctask\u003e/split \\\n    --output-dir  ./malware-bert\n```\n\n### Flags\n\n| Flag                       | Default                                 | Notes                                                                 |\n| -------------------------- | --------------------------------------- | --------------------------------------------------------------------- |\n| `--split-dir`              | `\u003cbase-dir\u003e/\u003ctask-name\u003e/split/`         | Reused if it already contains `train.txt` / `val.txt` / `test.txt`.   |\n| `--train-ratio`            | `0.8`                                   |                                                                       |\n| `--val-ratio`              | `0.1`                                   |                                                                       |\n| `--test-ratio`             | `0.1`                                   | Must sum to 1.0 with the other two.                                   |\n| `--split-seed`             | `42`                                    | Determines the split assignment.                                      |\n| `--no-split`               | off                                     | Legacy behaviour (fits matrices on the full dataset; **leaky**).      |\n| `--exclude-matrix-sha-list`| –                                       | Legacy: explicit text file of SHAs to exclude (takes priority).       |\n\n### Producing a split standalone\n\nIf you want to share the same split across many preprocessing runs or\nacross baselines, build it once with the standalone CLI:\n\n```bash\npython -m feature_process.split_dataset \\\n    --sha-list   /path/to/sub_dataset_sha.txt \\\n    --labels-csv /path/to/labels.csv \\\n    --output-dir /path/to/split \\\n    --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --seed 42\n```\n\nThen point both the preprocessing pipeline and every model run at\n`--split-dir /path/to/split`.\n\n---\n\n## Datasets\n\nThis repository **does not redistribute APKs.** Per AndroZoo's terms\nof use we only publish the SHA-256 hashes of every sample used in the\npaper; you can fetch the corresponding APKs from [AndroZoo] (or any\nother source you have access to).\n\nThe `datasets/` directory holds five hash lists, one SHA-256 per line:\n\n| File                                  | Samples | What it is                                                                                  |\n| ------------------------------------- | ------- | ------------------------------------------------------------------------------------------- |\n| `datasets/AndroAMD.txt`               | 20,000  | The main training + test set assembled for the paper (in-house AMD selection).              |\n| `datasets/PublicAMD.txt`              | 15,343  | A public-corpus reference set (no overlap engineering applied; useful as a cross-check).    |\n| `datasets/concept_drift_datasets2022.txt` |   877 | Concept-drift evaluation: APKs first seen in 2022.                                          |\n| `datasets/concept_drift_datasets2023.txt` |   914 | Concept-drift evaluation: APKs first seen in 2023.                                          |\n| `datasets/obfu_1k.txt`                |  1,000 | Obfuscation-robustness evaluation set; 500 benign + 500 malicious (50/50 stratified split). |\n\n### Getting the APKs\n\nOnce you have AndroZoo (or equivalent) credentials, you can download\neach list with the helper in `utils/`:\n\n```bash\npython -m utils.download_by_list download \\\n    --sha-list datasets/AndroAMD.txt \\\n    --apk-dir  /path/to/where/apks/go \\\n    --androzoo-api-key $ANDROZOO_KEY\n```\n\n### Labels\n\nLabels (`sha256,label` CSV, with `0` = benign / `1` = malware) are\n**not bundled** in the repo because they are derived from\nVirusTotal-detection counts that AndroZoo distributes under separate\nterms. After downloading the APKs you can either:\n\n- Pull each sample's `vt_detection` from AndroZoo's metadata CSV and\n  threshold it (the convention used in the paper is `vt_detection \u003e= 4`\n  → malware, `vt_detection == 0` → benign), or\n- Reuse the helpers in `utils/build_apk_market_features.py` /\n  `utils/build_datasets.py`, which automate that thresholding.\n\n### Other resources shipped with the code\n\n- Third-party-library filter lists under `feature_process/filter/`\n  (AndroLibZoo, LibD threshold-10, the `cl_91` / `ad_240` common-library\n  lists, and the API-call blacklist) used by the API-call preprocessor.\n\n---\n\n## Citation\n\n```bibtex\n@inproceedings{zhang2026loom,\n    author    = {Zhang, Hantang and Eshghie, Mojtaba and Kreyssig, Bruno and L\\\"ofstedt, Tommy and Bartel, Alexandre},\n    title     = {{Loom}: A Balanced String-Based Transformer for {Android} Malware Detection},\n    booktitle = {Proceedings of the 28th International Conference on Information and Communications Security ({ICICS} 2026)},\n    series    = {Lecture Notes in Computer Science},\n    publisher = {Springer},\n    address   = {Fukui, Japan},\n    year      = {2026},\n    month     = oct,\n    note      = {To appear}\n}\n```\n\n---\n\n## License\n\nThis project is released under the [MIT License](LICENSE).\n\n## Issues / contact\n\nPlease open a [GitHub Issue](https://github.com/HantangZhang/loom-android-malware/issues)\nfor bug reports, questions, or reproduction trouble.\n\n[androguard]: https://github.com/androguard/androguard\n[ImageDroid]: https://example.com/imagedroid\n[MalScan]: https://example.com/malscan\n[RevealDroid]: https://example.com/revealdroid\n[AndroZoo]: https://androzoo.uni.lu/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftware-engineering-and-security%2Floom-android-malware-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoftware-engineering-and-security%2Floom-android-malware-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoftware-engineering-and-security%2Floom-android-malware-detection/lists"}