https://github.com/software-engineering-and-security/loom-android-malware-detection

"Loom: A Balanced String-Based Transformer for Android Malware Detection (ICICS 2026)"
https://github.com/software-engineering-and-security/loom-android-malware-detection

Last synced: 12 days ago
JSON representation

"Loom: A Balanced String-Based Transformer for Android Malware Detection (ICICS 2026)"

Host: GitHub
URL: https://github.com/software-engineering-and-security/loom-android-malware-detection
Owner: software-engineering-and-security
License: mit
Created: 2026-05-15T06:15:44.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-05-15T07:27:39.000Z (about 1 month ago)
Last Synced: 2026-05-15T09:34:07.233Z (about 1 month ago)
Language: Python
Size: 10.8 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# LOOM: A Balanced String-Based Transformer for Android Malware Detection

Reference implementation of **LOOM** (ICICS 2026). LOOM fuses three
complementary static feature views extracted from each APK — the Android
manifest, API calls, and Dalvik opcodes — into a single string token
sequence under a fixed token budget, then fine-tunes a transformer
classifier on it. The repository also ships reproductions of several
published baselines for direct comparison.

> 📄 **Paper:** Hantang Zhang, Mojtaba Eshghie, Bruno Kreyssig, Tommy Löfstedt,
> Alexandre Bartel. *LOOM: A Balanced String-Based Transformer for Android
> Malware Detection.* ICICS 2026 (to appear). See [Citation](#citation).

---

## Highlights

- **Three-channel static features** extracted with [androguard]:
manifest entities, API-call sequences, and opcode sequences.
- **Token-budget-aware preprocessing** that allocates the limited
context window across the three channels by a configurable ratio
(default 1 : 4 : 5 for manifest : api : opcode), with cleaning,
third-party-library filtering (AndroLibZoo + LibD + common ad/util
libraries), TF-IDF / χ² / information-gain feature selection.
- **Multiple transformer backbones:** BERT, BigBird, Longformer, ModernBERT.
- **Model explanation** via LIME, plus a lightweight logistic-regression
shadow model for global feature importance.
- **Reproduced baselines:**
[ImageDroid], [MalScan], [RevealDroid], and a multimodal-transformer
fusion baseline.

---

## Repository layout

```
loom-android-malware/
├── apk_process/ # APK parsing & raw-feature extraction (androguard)
├── datasets/ # SHA-256 lists of every APK used in the paper (see Datasets)
├── docs/ # Paper appendix and other supplementary documents
├── feature_process/ # Preprocessing into BERT-friendly token sequences
│ ├── manifest_process.py
│ ├── apicall_process.py
│ ├── opcode_process.py
│ ├── features_process_final.py # main preprocessing entry point
│ ├── split_dataset.py # leak-free train/val/test split
│ ├── counter_process.py
│ ├── dsfile_process.py
│ ├── malbert.py
│ ├── tfidf_feature_extractor.py
│ ├── chi2_feature_extractor.py
│ ├── information_gain_feature_extractor.py
│ ├── filter/ # third-party-library blacklists
│ └── utils/
├── model/ # Transformer classifiers (BERT / BigBird / Longformer / ModernBERT)
│ ├── bert.py
│ ├── bert_with_count.py
│ ├── bigbird_base.py
│ ├── longformer_base.py
│ ├── modern_bert.py
│ └── model_download.py
├── model_explanation/ # LIME + lightweight shadow model
│ ├── lime_bert.py
│ ├── lightweight_model.py
│ ├── important_feature_process.py
│ └── feature_association.py
├── obfuscation/ # Obfuscation-related processing
├── repro_baselines/ # Reproduced baselines
│ ├── image_droid/
│ ├── malscan/
│ ├── reveal_droid/
│ └── multimodal_transformer/
├── UniXcoder/ # UniXcoder wrapper (used by parts of the pipeline)
├── utils/ # Dataset building, downloading, analysis helpers
├── LICENSE
├── README.md
└── requirements.txt
```

---

## Installation

Tested with **Python 3.11** on Linux.

```bash
git clone https://github.com/HantangZhang/loom-android-malware.git
cd loom-android-malware
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Main dependencies (see `requirements.txt` for exact pinned versions):

- `androguard 4.1.3` — APK static analysis
- `transformers 4.51.3`, `torch 2.7.0`, `datasets 3.5.0` — modeling
- `scikit-learn`, `scipy`, `lime` — feature selection & explanation
- `tqdm`, `pandas`, `numpy`, `loguru`

---

## Quick start

Every script under `apk_process/`, `feature_process/`, `model/`,
`model_explanation/`, `obfuscation/`, `utils/`, and `repro_baselines/*/`
is runnable via `python -m ` with a `--help`-driven `argparse`
CLI. Run any module with `--help` to discover its full flag set.

You may also want to point the HuggingFace cache somewhere other than
`~/.cache/`:

```bash
export ANDROID_ML_CACHE=/path/to/big/disk/hf_cache
```

### 1. Extract raw features from APKs

For a full directory of APKs (manifest + API calls + opcode in one parse):

```bash
python -m apk_process.apk_extractor extract-all \
--target /path/to/apks \
--manifest-dir /out/manifest \
--api-dir /out/apicall \
--opcode-dir /out/opcode \
--workers 16
```

For an obfuscation sweep (one SHA list, many APK source directories):

```bash
python -m apk_process.apk_extractor extract-by-sha \
--sha-list /path/to/sha_list.txt \
--apk-dir /path/to/apks/CID \
--apk-dir /path/to/apks/JUNK \
--output-dir /out/obfuscation \
--workers 16
```

### 2. Build a HuggingFace dataset under a token budget

```bash
python -m feature_process.features_process_final \
--manifest-dir /out/manifest \
--api-dir /out/apicall \
--opcode-dir /out/opcode \
--base-dir /out/features \
--sha-list /path/to/sub_dataset_sha.txt \
--csv-path /path/to/apk_metadata.csv \
--filter-method chi2_delta_idf \
--manifest-limit 300 --api-limit 300 --opcode-limit 300 \
--global-limit 512 --cap 2 --num-proc 16
```

By default, this command **also** creates a stratified train/val/test
SHA split under `//split/{train,val,test}.txt`
and fits every label-aware score matrix on the training partition only,
so val + test never leak label information into feature selection. See
[Preventing data leakage](#preventing-data-leakage) for the full story
and the flags that control it.

The individual preprocessors are also runnable separately:
`feature_process.manifest_process`, `feature_process.apicall_process`,
`feature_process.opcode_process`.

### 3. Fine-tune a classifier

```bash
python -m model.bert \
--model-path bert-base-uncased \
--dataset-dir /out/features/.../final_ds_chi2_delta_idf_... \
--split-dir /out/features/.../split \
--output-dir ./malware-bert \
--num-train-epochs 5 --batch-size 64 --learning-rate 2e-5
```

Pass the **same** `--split-dir` that preprocessing wrote so the model's
train / val / test partition matches the one used to fit the feature
matrices. Without `--split-dir` the model falls back to a random split
(useful only for quick experiments — not for published numbers; see
[Preventing data leakage](#preventing-data-leakage)).

The long-context variants share the same flag layout:
`model.bigbird_base`, `model.longformer_base`, `model.modern_bert`,
`model.bert_with_count`.

### 4. Explain predictions

```bash
python -m model_explanation.lime_bert \
--dataset-dir /path/to/hf_dataset \
--model-path ./malware-bert/checkpoint-XXXX \
--tokenizer-path bert-base-uncased \
--output-dir ./lime_html \
--lime-csv ./lime.csv \
--sample-size 50 --num-features 20 --num-samples 100
```

The companion shadow model lives in `model_explanation.lightweight_model`
(L1 logistic regression over the union vocabulary of the three views) and
the LIME-CSV analysis tools in `model_explanation.feature_association`
and `model_explanation.important_feature_process`.

---

## Reproducing baselines

Each baseline lives under `repro_baselines//`. Every script in
those folders is an `argparse` CLI — run with `--help` to inspect flags.

| Baseline | Folder | Entry points (use `--help` for each) |
| ----------------------- | ------------------------------------------- | ------------------------------------ |
| ImageDroid | `repro_baselines/image_droid/` | `extract_dex_image_features {extract,fix-labels}`, `model_training {kfold,train,predict}` |
| MalScan | `repro_baselines/malscan/` | `malscan_json_features`, `malscan_json_features_fast`, `malscan_merge_features`, `malscan_train_eval` |
| RevealDroid | `repro_baselines/reveal_droid/` | `extract_apicount`, `extract_packageAPI`, `intent_action`, `reflection_native`, `build_features {single,multi}`, `revealDroid_detector {train,eval}` |
| Multimodal Transformer | `repro_baselines/multimodal_transformer/` | `apk_to_dex_images`, `extract_bm_features`, `sm_features {process,process-roots,merge,debug-apk}`, `fusion_classifier`, `fine_tune` |

---

## Preventing data leakage

The feature-selection step is **label-aware**: the delta-IDF,
chi-square and information-gain matrices score every token by how its
distribution differs between benign and malicious documents. Fitting
those matrices on the full dataset — including the samples you later
hold out for evaluation — silently leaks label information from val /
test back into the features of every other sample.

To prevent this, the pipeline now treats the train/val/test partition
as a first-class artefact that is shared between preprocessing and
training:

1. `feature_process.split_dataset` produces a deterministic stratified
split of the input SHA list and writes
`train.txt` / `val.txt` / `test.txt` under a chosen directory.
2. `feature_process.features_process_final` reads (or creates) that
split before doing anything else, and excludes every val + test SHA
when fitting the score matrices. The selected tokens for *every*
sample (train, val, test) are produced by matrices fitted on the
train SHAs only.
3. The model classifiers (`model.bert`, `model.bert_with_count`,
`model.bigbird_base`, `model.longformer_base`, `model.modern_bert`)
accept the same `--split-dir`. When given, they filter the
preprocessed dataset by the same `train.txt` / `val.txt` /
`test.txt` files instead of doing an ad-hoc random split, so the
train / test partition the model sees is identical to the one used
to fit the feature matrices.

### Default behaviour

Running the preprocessing pipeline without any new flags already does
the right thing:

```bash
python -m feature_process.features_process_final \
--manifest-dir /out/manifest --api-dir /out/apicall \
--opcode-dir /out/opcode --base-dir /out/features \
--sha-list /path/to/sub_dataset_sha.txt \
--csv-path /path/to/labels.csv \
--filter-method chi2_delta_idf
# -> writes /out/features//split/{train,val,test}.txt
# -> fits matrices on train only, applies them to all samples
```

Then point the model at the same split:

```bash
python -m model.bert \
--dataset-dir /out/features//final_ds_chi2_delta_idf_..._512 \
--split-dir /out/features//split \
--output-dir ./malware-bert
```

### Flags

| Flag | Default | Notes |
| -------------------------- | --------------------------------------- | --------------------------------------------------------------------- |
| `--split-dir` | `//split/` | Reused if it already contains `train.txt` / `val.txt` / `test.txt`. |
| `--train-ratio` | `0.8` | |
| `--val-ratio` | `0.1` | |
| `--test-ratio` | `0.1` | Must sum to 1.0 with the other two. |
| `--split-seed` | `42` | Determines the split assignment. |
| `--no-split` | off | Legacy behaviour (fits matrices on the full dataset; **leaky**). |
| `--exclude-matrix-sha-list`| – | Legacy: explicit text file of SHAs to exclude (takes priority). |

### Producing a split standalone

If you want to share the same split across many preprocessing runs or
across baselines, build it once with the standalone CLI:

```bash
python -m feature_process.split_dataset \
--sha-list /path/to/sub_dataset_sha.txt \
--labels-csv /path/to/labels.csv \
--output-dir /path/to/split \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --seed 42
```

Then point both the preprocessing pipeline and every model run at
`--split-dir /path/to/split`.

---

## Datasets

This repository **does not redistribute APKs.** Per AndroZoo's terms
of use we only publish the SHA-256 hashes of every sample used in the
paper; you can fetch the corresponding APKs from [AndroZoo] (or any
other source you have access to).

The `datasets/` directory holds five hash lists, one SHA-256 per line:

| File | Samples | What it is |
| ------------------------------------- | ------- | ------------------------------------------------------------------------------------------- |
| `datasets/AndroAMD.txt` | 20,000 | The main training + test set assembled for the paper (in-house AMD selection). |
| `datasets/PublicAMD.txt` | 15,343 | A public-corpus reference set (no overlap engineering applied; useful as a cross-check). |
| `datasets/concept_drift_datasets2022.txt` | 877 | Concept-drift evaluation: APKs first seen in 2022. |
| `datasets/concept_drift_datasets2023.txt` | 914 | Concept-drift evaluation: APKs first seen in 2023. |
| `datasets/obfu_1k.txt` | 1,000 | Obfuscation-robustness evaluation set; 500 benign + 500 malicious (50/50 stratified split). |

### Getting the APKs

Once you have AndroZoo (or equivalent) credentials, you can download
each list with the helper in `utils/`:

```bash
python -m utils.download_by_list download \
--sha-list datasets/AndroAMD.txt \
--apk-dir /path/to/where/apks/go \
--androzoo-api-key $ANDROZOO_KEY
```

### Labels

Labels (`sha256,label` CSV, with `0` = benign / `1` = malware) are
**not bundled** in the repo because they are derived from
VirusTotal-detection counts that AndroZoo distributes under separate
terms. After downloading the APKs you can either:

- Pull each sample's `vt_detection` from AndroZoo's metadata CSV and
threshold it (the convention used in the paper is `vt_detection >= 4`
→ malware, `vt_detection == 0` → benign), or
- Reuse the helpers in `utils/build_apk_market_features.py` /
`utils/build_datasets.py`, which automate that thresholding.

### Other resources shipped with the code

- Third-party-library filter lists under `feature_process/filter/`
(AndroLibZoo, LibD threshold-10, the `cl_91` / `ad_240` common-library
lists, and the API-call blacklist) used by the API-call preprocessor.

---

## Citation

```bibtex
@inproceedings{zhang2026loom,
author = {Zhang, Hantang and Eshghie, Mojtaba and Kreyssig, Bruno and L\"ofstedt, Tommy and Bartel, Alexandre},
title = {{Loom}: A Balanced String-Based Transformer for {Android} Malware Detection},
booktitle = {Proceedings of the 28th International Conference on Information and Communications Security ({ICICS} 2026)},
series = {Lecture Notes in Computer Science},
publisher = {Springer},
address = {Fukui, Japan},
year = {2026},
month = oct,
note = {To appear}
}
```

---

## License

This project is released under the [MIT License](LICENSE).

## Issues / contact

Please open a [GitHub Issue](https://github.com/HantangZhang/loom-android-malware/issues)
for bug reports, questions, or reproduction trouble.

[androguard]: https://github.com/androguard/androguard
[ImageDroid]: https://example.com/imagedroid
[MalScan]: https://example.com/malscan
[RevealDroid]: https://example.com/revealdroid
[AndroZoo]: https://androzoo.uni.lu/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/software-engineering-and-security/loom-android-malware-detection

Awesome Lists containing this project

README