An open API service indexing awesome lists of open source software.

https://github.com/rshariffdeen/hercule

An inter-package analysis techinque for supply chain protection, that combines three analyses to identify malicious packages with high precision and high recall
https://github.com/rshariffdeen/hercule

malware-detection program-analysis supply-chain supply-chain-security

Last synced: about 2 months ago
JSON representation

An inter-package analysis techinque for supply chain protection, that combines three analyses to identify malicious packages with high precision and high recall

Awesome Lists containing this project

README

        

# Hercule
[![Docker Pulls](https://img.shields.io/docker/pulls/rshariffdeen/hercule.svg)](https://hub.docker.com/r/rshariffdeen/hercule) [![DOI](https://zenodo.org/badge/670418954.svg)](https://doi.org/10.5281/zenodo.14607268)

![Artifact Available](https://www.acm.org/binaries/content/gallery/acm/publications/replication-badges/artifacts_available_dl.jpg) ![Artifact Functional](https://github.com/user-attachments/assets/f38134ea-964b-4784-b05d-a4faf99eff14) ![Artifact Reusable](https://www.acm.org/binaries/content/gallery/acm/publications/replication-badges/artifacts_evaluated_reusable_dl.jpg)

An inter-package analysis technique for supply chain protection, that combines three analyses
to identify malicious packages with high precision and high recall. Our approach incorporates an integrity check
based on AST differentiation analysis, that can identify discrepancies between the distributed artifacts and
the source repository. We then utilize CodeQL to detect malicious behavior using data-flow analysis. Lastly,
we implement a transitive dependency analysis to identify malicious packages installed as part of the dependency resolution.

Hercule is a supply chain protection tool for Python packages that can accurately identify benign and malicious packages. Hercule is:

* Accurate: Achieves a f1-score of 0.95 scanning nearly 5000 packages
* Extensible: Hercule is designed so that it can be easily extended to support advanced behavior analysis using CodeQL queries
* Scalable: Hercule can analyze large Python packages including but not limited to scipy, numpy, pandas etc

Note: See the [Experiment Guide](experiments/ICSE25/README.md) to replicate the experiments for ICSE'25.

## Dependencies
* Python - 3.8.10+
* Docker - 27.3.1+
* CodeQL - 2.15.0+
* GumTree - 3.1.0+
* PythonParser - 1.4+
* Parso - 0.8.3+

## Build using Dockerfile

Building Hercule is easy with the provided Dockerfile, which has all the build dependencies and libraries required. We provide two Dockerfiles

* Dockerfile: This will build the environment necessary to run the stand-alone tool
* experiments/ICSE25/Dockerfile: This environment includes all necessary dependencies to reproduce the experiments

You can use the following command to build hercule image:

```bash
cd Hercule
docker build -t rshariffdeen/hercule .
# start docker
docker run -it rshariffdeen/hercule /bin/bash
```

The experiments/ICSE25/Dockerfile depends on the hercule base image. The instructions to build and execute experiments/ICSE25/Dockerfile can be found [here](./experiments/ICSE25).

# Example Usage

## Single Package

Hercule supports analyzing packages in different formats, i.e., wheels, zips, .tar.gz files, by passing the flag `--file/-F` . To invoke the tool, call
```bash
hercule -F target_package
```

## Batch Mode

Hercule can be invoked on a collection of collected packages using the `--dir/-D` flag to allow for batch mode processing, e.g.
```bash
hercule -D directory_of_packages
```

To allow for a comparison with similar tools, Hercule has two additional modes - Bandit4Mal (enabled using the flag `--banditmal`) and LastPyMile(enabled using the flag `--lastpymile`), which act as the respective tools.
See [Example Page](docs/Examples.md) for a detailed description

# Limitations #
* Current implementation is based on CodeQL, and thus inherits the limitations of that framework. Please note that CodeQL license is limited to academic research purpose only.
* Hercule does not perform dynamic analysis

## Bugs ##
Hercule should be considered alpha-quality software. Bugs can be reported here:

https://github.com/rshariffdeen/Hercule/issues

## Documentation ##

* [Getting Started](docs/GetStarted.md)
* [Example Usage](docs/Examples.md)
* [ICSE'25 Experiment Replication](experiments/ICSE25/README.md)
* [Manual](docs/Manual.md)

# Contributions
We welcome contributions to improve this work, see [details](docs/Contributing.md)

## Publication ##
**Detecting Python Malware in the Software Supply Chain with Program Analysis**

Ridwan Shariffdeen, Behnaz Hassanshahi, Martin Mirchev, Ali El Husseini, Abhik Roychoudhury

47th International Conference on Software Engineering (ICSE), Software Engineering in Practice track (SEIP)

## Acknowledgements ##
This research was supported by the National Research Foundation, Singapore, and Cyber Security Agency of Singapore under its National Cybersecurity R&D Programme (Fuzz Testing NRF-NCR25-Fuzz-0001). Any opinions, findings and conclusions, or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, and Cyber Security Agency of Singapore.

# License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details

## Citation

If you use Hercule in your research work, we would highly appreciate it if you
cited the [following paper](https://rshariffdeen.com/paper/ICSE25-SEIP.pdf):

```
@inproceedings{hercule,
author = {Shariffdeen, Ridwan and Hassanshahi, Behnaz and Mirchev, Martin and Husseini, Ali El and Roychoudhury, Abhik},
title = {Detecting Python Malware in the Software Supply Chain with Program Analysis},
year = {2025},
doi = {},
booktitle = {2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)},
pages = {},
numpages = {},
keywords = {Supply Chain Protection, Program Analysis, Malware Detection},
location = {Ottawa, Canada},
series = {ICSE SEIP 2025}
}
```