https://github.com/rshariffdeen/hercule
An inter-package analysis techinque for supply chain protection, that combines three analyses to identify malicious packages with high precision and high recall
https://github.com/rshariffdeen/hercule
malware-detection program-analysis supply-chain supply-chain-security
Last synced: about 2 months ago
JSON representation
An inter-package analysis techinque for supply chain protection, that combines three analyses to identify malicious packages with high precision and high recall
- Host: GitHub
- URL: https://github.com/rshariffdeen/hercule
- Owner: rshariffdeen
- License: mit
- Created: 2023-07-25T02:41:53.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-04-09T03:47:02.000Z (2 months ago)
- Last Synced: 2025-04-09T04:27:48.567Z (2 months ago)
- Topics: malware-detection, program-analysis, supply-chain, supply-chain-security
- Language: Java
- Homepage:
- Size: 7.06 MB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: docs/Contributing.md
- License: LICENSE
Awesome Lists containing this project
README
# Hercule
[](https://hub.docker.com/r/rshariffdeen/hercule) [](https://doi.org/10.5281/zenodo.14607268)  
An inter-package analysis technique for supply chain protection, that combines three analyses
to identify malicious packages with high precision and high recall. Our approach incorporates an integrity check
based on AST differentiation analysis, that can identify discrepancies between the distributed artifacts and
the source repository. We then utilize CodeQL to detect malicious behavior using data-flow analysis. Lastly,
we implement a transitive dependency analysis to identify malicious packages installed as part of the dependency resolution.Hercule is a supply chain protection tool for Python packages that can accurately identify benign and malicious packages. Hercule is:
* Accurate: Achieves a f1-score of 0.95 scanning nearly 5000 packages
* Extensible: Hercule is designed so that it can be easily extended to support advanced behavior analysis using CodeQL queries
* Scalable: Hercule can analyze large Python packages including but not limited to scipy, numpy, pandas etcNote: See the [Experiment Guide](experiments/ICSE25/README.md) to replicate the experiments for ICSE'25.
## Dependencies
* Python - 3.8.10+
* Docker - 27.3.1+
* CodeQL - 2.15.0+
* GumTree - 3.1.0+
* PythonParser - 1.4+
* Parso - 0.8.3+## Build using Dockerfile
Building Hercule is easy with the provided Dockerfile, which has all the build dependencies and libraries required. We provide two Dockerfiles
* Dockerfile: This will build the environment necessary to run the stand-alone tool
* experiments/ICSE25/Dockerfile: This environment includes all necessary dependencies to reproduce the experimentsYou can use the following command to build hercule image:
```bash
cd Hercule
docker build -t rshariffdeen/hercule .
# start docker
docker run -it rshariffdeen/hercule /bin/bash
```The experiments/ICSE25/Dockerfile depends on the hercule base image. The instructions to build and execute experiments/ICSE25/Dockerfile can be found [here](./experiments/ICSE25).
# Example Usage
## Single Package
Hercule supports analyzing packages in different formats, i.e., wheels, zips, .tar.gz files, by passing the flag `--file/-F` . To invoke the tool, call
```bash
hercule -F target_package
```## Batch Mode
Hercule can be invoked on a collection of collected packages using the `--dir/-D` flag to allow for batch mode processing, e.g.
```bash
hercule -D directory_of_packages
```To allow for a comparison with similar tools, Hercule has two additional modes - Bandit4Mal (enabled using the flag `--banditmal`) and LastPyMile(enabled using the flag `--lastpymile`), which act as the respective tools.
See [Example Page](docs/Examples.md) for a detailed description# Limitations #
* Current implementation is based on CodeQL, and thus inherits the limitations of that framework. Please note that CodeQL license is limited to academic research purpose only.
* Hercule does not perform dynamic analysis## Bugs ##
Hercule should be considered alpha-quality software. Bugs can be reported here:https://github.com/rshariffdeen/Hercule/issues
## Documentation ##
* [Getting Started](docs/GetStarted.md)
* [Example Usage](docs/Examples.md)
* [ICSE'25 Experiment Replication](experiments/ICSE25/README.md)
* [Manual](docs/Manual.md)# Contributions
We welcome contributions to improve this work, see [details](docs/Contributing.md)## Publication ##
**Detecting Python Malware in the Software Supply Chain with Program Analysis**
Ridwan Shariffdeen, Behnaz Hassanshahi, Martin Mirchev, Ali El Husseini, Abhik Roychoudhury
47th International Conference on Software Engineering (ICSE), Software Engineering in Practice track (SEIP)## Acknowledgements ##
This research was supported by the National Research Foundation, Singapore, and Cyber Security Agency of Singapore under its National Cybersecurity R&D Programme (Fuzz Testing NRF-NCR25-Fuzz-0001). Any opinions, findings and conclusions, or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, and Cyber Security Agency of Singapore.# License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details## Citation
If you use Hercule in your research work, we would highly appreciate it if you
cited the [following paper](https://rshariffdeen.com/paper/ICSE25-SEIP.pdf):```
@inproceedings{hercule,
author = {Shariffdeen, Ridwan and Hassanshahi, Behnaz and Mirchev, Martin and Husseini, Ali El and Roychoudhury, Abhik},
title = {Detecting Python Malware in the Software Supply Chain with Program Analysis},
year = {2025},
doi = {},
booktitle = {2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)},
pages = {},
numpages = {},
keywords = {Supply Chain Protection, Program Analysis, Malware Detection},
location = {Ottawa, Canada},
series = {ICSE SEIP 2025}
}
```