https://github.com/software-engineering-and-security/loom-android-malware-detection
"Loom: A Balanced String-Based Transformer for Android Malware Detection (ICICS 2026)"
https://github.com/software-engineering-and-security/loom-android-malware-detection
Last synced: 12 days ago
JSON representation
"Loom: A Balanced String-Based Transformer for Android Malware Detection (ICICS 2026)"
- Host: GitHub
- URL: https://github.com/software-engineering-and-security/loom-android-malware-detection
- Owner: software-engineering-and-security
- License: mit
- Created: 2026-05-15T06:15:44.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-15T07:27:39.000Z (about 1 month ago)
- Last Synced: 2026-05-15T09:34:07.233Z (about 1 month ago)
- Language: Python
- Size: 10.8 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LOOM: A Balanced String-Based Transformer for Android Malware Detection
Reference implementation of **LOOM** (ICICS 2026). LOOM fuses three
complementary static feature views extracted from each APK β the Android
manifest, API calls, and Dalvik opcodes β into a single string token
sequence under a fixed token budget, then fine-tunes a transformer
classifier on it. The repository also ships reproductions of several
published baselines for direct comparison.
> π **Paper:** Hantang Zhang, Mojtaba Eshghie, Bruno Kreyssig, Tommy LΓΆfstedt,
> Alexandre Bartel. *LOOM: A Balanced String-Based Transformer for Android
> Malware Detection.* ICICS 2026 (to appear). See [Citation](#citation).
---
## Highlights
- **Three-channel static features** extracted with [androguard]:
manifest entities, API-call sequences, and opcode sequences.
- **Token-budget-aware preprocessing** that allocates the limited
context window across the three channels by a configurable ratio
(default 1 : 4 : 5 for manifest : api : opcode), with cleaning,
third-party-library filtering (AndroLibZoo + LibD + common ad/util
libraries), TF-IDF / ΟΒ² / information-gain feature selection.
- **Multiple transformer backbones:** BERT, BigBird, Longformer, ModernBERT.
- **Model explanation** via LIME, plus a lightweight logistic-regression
shadow model for global feature importance.
- **Reproduced baselines:**
[ImageDroid], [MalScan], [RevealDroid], and a multimodal-transformer
fusion baseline.
---
## Repository layout
```
loom-android-malware/
βββ apk_process/ # APK parsing & raw-feature extraction (androguard)
βββ datasets/ # SHA-256 lists of every APK used in the paper (see Datasets)
βββ docs/ # Paper appendix and other supplementary documents
βββ feature_process/ # Preprocessing into BERT-friendly token sequences
β βββ manifest_process.py
β βββ apicall_process.py
β βββ opcode_process.py
β βββ features_process_final.py # main preprocessing entry point
β βββ split_dataset.py # leak-free train/val/test split
β βββ counter_process.py
β βββ dsfile_process.py
β βββ malbert.py
β βββ tfidf_feature_extractor.py
β βββ chi2_feature_extractor.py
β βββ information_gain_feature_extractor.py
β βββ filter/ # third-party-library blacklists
β βββ utils/
βββ model/ # Transformer classifiers (BERT / BigBird / Longformer / ModernBERT)
β βββ bert.py
β βββ bert_with_count.py
β βββ bigbird_base.py
β βββ longformer_base.py
β βββ modern_bert.py
β βββ model_download.py
βββ model_explanation/ # LIME + lightweight shadow model
β βββ lime_bert.py
β βββ lightweight_model.py
β βββ important_feature_process.py
β βββ feature_association.py
βββ obfuscation/ # Obfuscation-related processing
βββ repro_baselines/ # Reproduced baselines
β βββ image_droid/
β βββ malscan/
β βββ reveal_droid/
β βββ multimodal_transformer/
βββ UniXcoder/ # UniXcoder wrapper (used by parts of the pipeline)
βββ utils/ # Dataset building, downloading, analysis helpers
βββ LICENSE
βββ README.md
βββ requirements.txt
```
---
## Installation
Tested with **Python 3.11** on Linux.
```bash
git clone https://github.com/HantangZhang/loom-android-malware.git
cd loom-android-malware
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
Main dependencies (see `requirements.txt` for exact pinned versions):
- `androguard 4.1.3` β APK static analysis
- `transformers 4.51.3`, `torch 2.7.0`, `datasets 3.5.0` β modeling
- `scikit-learn`, `scipy`, `lime` β feature selection & explanation
- `tqdm`, `pandas`, `numpy`, `loguru`
---
## Quick start
Every script under `apk_process/`, `feature_process/`, `model/`,
`model_explanation/`, `obfuscation/`, `utils/`, and `repro_baselines/*/`
is runnable via `python -m ` with a `--help`-driven `argparse`
CLI. Run any module with `--help` to discover its full flag set.
You may also want to point the HuggingFace cache somewhere other than
`~/.cache/`:
```bash
export ANDROID_ML_CACHE=/path/to/big/disk/hf_cache
```
### 1. Extract raw features from APKs
For a full directory of APKs (manifest + API calls + opcode in one parse):
```bash
python -m apk_process.apk_extractor extract-all \
--target /path/to/apks \
--manifest-dir /out/manifest \
--api-dir /out/apicall \
--opcode-dir /out/opcode \
--workers 16
```
For an obfuscation sweep (one SHA list, many APK source directories):
```bash
python -m apk_process.apk_extractor extract-by-sha \
--sha-list /path/to/sha_list.txt \
--apk-dir /path/to/apks/CID \
--apk-dir /path/to/apks/JUNK \
--output-dir /out/obfuscation \
--workers 16
```
### 2. Build a HuggingFace dataset under a token budget
```bash
python -m feature_process.features_process_final \
--manifest-dir /out/manifest \
--api-dir /out/apicall \
--opcode-dir /out/opcode \
--base-dir /out/features \
--sha-list /path/to/sub_dataset_sha.txt \
--csv-path /path/to/apk_metadata.csv \
--filter-method chi2_delta_idf \
--manifest-limit 300 --api-limit 300 --opcode-limit 300 \
--global-limit 512 --cap 2 --num-proc 16
```
By default, this command **also** creates a stratified train/val/test
SHA split under `//split/{train,val,test}.txt`
and fits every label-aware score matrix on the training partition only,
so val + test never leak label information into feature selection. See
[Preventing data leakage](#preventing-data-leakage) for the full story
and the flags that control it.
The individual preprocessors are also runnable separately:
`feature_process.manifest_process`, `feature_process.apicall_process`,
`feature_process.opcode_process`.
### 3. Fine-tune a classifier
```bash
python -m model.bert \
--model-path bert-base-uncased \
--dataset-dir /out/features/.../final_ds_chi2_delta_idf_... \
--split-dir /out/features/.../split \
--output-dir ./malware-bert \
--num-train-epochs 5 --batch-size 64 --learning-rate 2e-5
```
Pass the **same** `--split-dir` that preprocessing wrote so the model's
train / val / test partition matches the one used to fit the feature
matrices. Without `--split-dir` the model falls back to a random split
(useful only for quick experiments β not for published numbers; see
[Preventing data leakage](#preventing-data-leakage)).
The long-context variants share the same flag layout:
`model.bigbird_base`, `model.longformer_base`, `model.modern_bert`,
`model.bert_with_count`.
### 4. Explain predictions
```bash
python -m model_explanation.lime_bert \
--dataset-dir /path/to/hf_dataset \
--model-path ./malware-bert/checkpoint-XXXX \
--tokenizer-path bert-base-uncased \
--output-dir ./lime_html \
--lime-csv ./lime.csv \
--sample-size 50 --num-features 20 --num-samples 100
```
The companion shadow model lives in `model_explanation.lightweight_model`
(L1 logistic regression over the union vocabulary of the three views) and
the LIME-CSV analysis tools in `model_explanation.feature_association`
and `model_explanation.important_feature_process`.
---
## Reproducing baselines
Each baseline lives under `repro_baselines//`. Every script in
those folders is an `argparse` CLI β run with `--help` to inspect flags.
| Baseline | Folder | Entry points (use `--help` for each) |
| ----------------------- | ------------------------------------------- | ------------------------------------ |
| ImageDroid | `repro_baselines/image_droid/` | `extract_dex_image_features {extract,fix-labels}`, `model_training {kfold,train,predict}` |
| MalScan | `repro_baselines/malscan/` | `malscan_json_features`, `malscan_json_features_fast`, `malscan_merge_features`, `malscan_train_eval` |
| RevealDroid | `repro_baselines/reveal_droid/` | `extract_apicount`, `extract_packageAPI`, `intent_action`, `reflection_native`, `build_features {single,multi}`, `revealDroid_detector {train,eval}` |
| Multimodal Transformer | `repro_baselines/multimodal_transformer/` | `apk_to_dex_images`, `extract_bm_features`, `sm_features {process,process-roots,merge,debug-apk}`, `fusion_classifier`, `fine_tune` |
---
## Preventing data leakage
The feature-selection step is **label-aware**: the delta-IDF,
chi-square and information-gain matrices score every token by how its
distribution differs between benign and malicious documents. Fitting
those matrices on the full dataset β including the samples you later
hold out for evaluation β silently leaks label information from val /
test back into the features of every other sample.
To prevent this, the pipeline now treats the train/val/test partition
as a first-class artefact that is shared between preprocessing and
training:
1. `feature_process.split_dataset` produces a deterministic stratified
split of the input SHA list and writes
`train.txt` / `val.txt` / `test.txt` under a chosen directory.
2. `feature_process.features_process_final` reads (or creates) that
split before doing anything else, and excludes every val + test SHA
when fitting the score matrices. The selected tokens for *every*
sample (train, val, test) are produced by matrices fitted on the
train SHAs only.
3. The model classifiers (`model.bert`, `model.bert_with_count`,
`model.bigbird_base`, `model.longformer_base`, `model.modern_bert`)
accept the same `--split-dir`. When given, they filter the
preprocessed dataset by the same `train.txt` / `val.txt` /
`test.txt` files instead of doing an ad-hoc random split, so the
train / test partition the model sees is identical to the one used
to fit the feature matrices.
### Default behaviour
Running the preprocessing pipeline without any new flags already does
the right thing:
```bash
python -m feature_process.features_process_final \
--manifest-dir /out/manifest --api-dir /out/apicall \
--opcode-dir /out/opcode --base-dir /out/features \
--sha-list /path/to/sub_dataset_sha.txt \
--csv-path /path/to/labels.csv \
--filter-method chi2_delta_idf
# -> writes /out/features//split/{train,val,test}.txt
# -> fits matrices on train only, applies them to all samples
```
Then point the model at the same split:
```bash
python -m model.bert \
--dataset-dir /out/features//final_ds_chi2_delta_idf_..._512 \
--split-dir /out/features//split \
--output-dir ./malware-bert
```
### Flags
| Flag | Default | Notes |
| -------------------------- | --------------------------------------- | --------------------------------------------------------------------- |
| `--split-dir` | `//split/` | Reused if it already contains `train.txt` / `val.txt` / `test.txt`. |
| `--train-ratio` | `0.8` | |
| `--val-ratio` | `0.1` | |
| `--test-ratio` | `0.1` | Must sum to 1.0 with the other two. |
| `--split-seed` | `42` | Determines the split assignment. |
| `--no-split` | off | Legacy behaviour (fits matrices on the full dataset; **leaky**). |
| `--exclude-matrix-sha-list`| β | Legacy: explicit text file of SHAs to exclude (takes priority). |
### Producing a split standalone
If you want to share the same split across many preprocessing runs or
across baselines, build it once with the standalone CLI:
```bash
python -m feature_process.split_dataset \
--sha-list /path/to/sub_dataset_sha.txt \
--labels-csv /path/to/labels.csv \
--output-dir /path/to/split \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --seed 42
```
Then point both the preprocessing pipeline and every model run at
`--split-dir /path/to/split`.
---
## Datasets
This repository **does not redistribute APKs.** Per AndroZoo's terms
of use we only publish the SHA-256 hashes of every sample used in the
paper; you can fetch the corresponding APKs from [AndroZoo] (or any
other source you have access to).
The `datasets/` directory holds five hash lists, one SHA-256 per line:
| File | Samples | What it is |
| ------------------------------------- | ------- | ------------------------------------------------------------------------------------------- |
| `datasets/AndroAMD.txt` | 20,000 | The main training + test set assembled for the paper (in-house AMD selection). |
| `datasets/PublicAMD.txt` | 15,343 | A public-corpus reference set (no overlap engineering applied; useful as a cross-check). |
| `datasets/concept_drift_datasets2022.txt` | 877 | Concept-drift evaluation: APKs first seen in 2022. |
| `datasets/concept_drift_datasets2023.txt` | 914 | Concept-drift evaluation: APKs first seen in 2023. |
| `datasets/obfu_1k.txt` | 1,000 | Obfuscation-robustness evaluation set; 500 benign + 500 malicious (50/50 stratified split). |
### Getting the APKs
Once you have AndroZoo (or equivalent) credentials, you can download
each list with the helper in `utils/`:
```bash
python -m utils.download_by_list download \
--sha-list datasets/AndroAMD.txt \
--apk-dir /path/to/where/apks/go \
--androzoo-api-key $ANDROZOO_KEY
```
### Labels
Labels (`sha256,label` CSV, with `0` = benign / `1` = malware) are
**not bundled** in the repo because they are derived from
VirusTotal-detection counts that AndroZoo distributes under separate
terms. After downloading the APKs you can either:
- Pull each sample's `vt_detection` from AndroZoo's metadata CSV and
threshold it (the convention used in the paper is `vt_detection >= 4`
β malware, `vt_detection == 0` β benign), or
- Reuse the helpers in `utils/build_apk_market_features.py` /
`utils/build_datasets.py`, which automate that thresholding.
### Other resources shipped with the code
- Third-party-library filter lists under `feature_process/filter/`
(AndroLibZoo, LibD threshold-10, the `cl_91` / `ad_240` common-library
lists, and the API-call blacklist) used by the API-call preprocessor.
---
## Citation
```bibtex
@inproceedings{zhang2026loom,
author = {Zhang, Hantang and Eshghie, Mojtaba and Kreyssig, Bruno and L\"ofstedt, Tommy and Bartel, Alexandre},
title = {{Loom}: A Balanced String-Based Transformer for {Android} Malware Detection},
booktitle = {Proceedings of the 28th International Conference on Information and Communications Security ({ICICS} 2026)},
series = {Lecture Notes in Computer Science},
publisher = {Springer},
address = {Fukui, Japan},
year = {2026},
month = oct,
note = {To appear}
}
```
---
## License
This project is released under the [MIT License](LICENSE).
## Issues / contact
Please open a [GitHub Issue](https://github.com/HantangZhang/loom-android-malware/issues)
for bug reports, questions, or reproduction trouble.
[androguard]: https://github.com/androguard/androguard
[ImageDroid]: https://example.com/imagedroid
[MalScan]: https://example.com/malscan
[RevealDroid]: https://example.com/revealdroid
[AndroZoo]: https://androzoo.uni.lu/