Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sap-samples/cross-language-detection-artifacts

This repository complements our paper by offering the training dataset, the best-performing models utilized in our real-world experiment, the list of identified malicious packages, and the scripts necessary to replicate and verify our results.
https://github.com/sap-samples/cross-language-detection-artifacts

machine-learning open-source sample sample-code security

Last synced: about 1 month ago
JSON representation

This repository complements our paper by offering the training dataset, the best-performing models utilized in our real-world experiment, the list of identified malicious packages, and the scripts necessary to replicate and verify our results.

Awesome Lists containing this project

README

        

# On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI - Paper Artifacts

[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE.txt) [![REUSE status](https://api.reuse.software/badge/github.com/SAP-samples/cross-language-detection-artifacts)](https://api.reuse.software/info/github.com/SAP-samples/cross-language-detection-artifacts)

## Description

This supplementary material constitutes the artifact for our paper titled "On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI".
Specifically:
- `Labelled_Dataset.csv` (1.8MB in size):
- This file contains the labelled dataset used in our evaluation process. It houses the data employed for evaluating the features, assessing the models, and training the classifiers. Such dataset can be used to gain further insights into how features were selected, evaluate the models' performance, and comprehend the training process.
- `.pkl` Files (cross-language (125.9 kB), JavaScript mono-language (73.7 kB), and Python mono-language (110.1 kB)):
- These files, saved in pickle format, encapsulate the best-performing classifiers that emerged from our models' evaluation. They can be imported into Python using scikit-learn library. This feature enables you to use these classifiers for classification tasks.
- `Malicious_Packages_Discovered.csv`:
- This file comprises a comprehensive list of malicious packages that were successfully detected during our real-world experiment. It details their characteristics, behavior types, and obfuscation techniques.
- `scripts/*`:
- This folder contains various scripts including those to reproduce the evaluation and training of the models presented in the paper

## Requirements and Run

(1) Install the dependencies via:

```
$ pip install -r ./scripts/requirements.txt
```

(2) Run the script according to the model you want to reproduce (i.e., cross-language or mono-language). For example, in the case of the cross-language XGBoost model you can run from the main folder:

```
$ python ./scripts/Crosslanguage_XGBoost_train_test.py
```

The produced classifier can be then imported in Python using the joblib function `load()`, for example:

```
classifier_XGBoost = joblib.load(model_path)
```
***************************************************************************
## Features Explaination

`Labelled_Dataset.csv` is a labeled dataset containing packages from two distinct public repositories: NPM and PyPI. It addresses the issue of classifying malicious packages, taking into account the imbalance problem.

In this scenario, we assume that the estimated percentage of malicious packages among all packages is approximately 10%.

It follows the explaination of the columns in the CSV file.

#### Package information

- `Malicious`: encodes the target value, i.e., wether a package is benign (value of 0) or malicious (value of 1)
- `Package Repository`: specifies from which package repository the package comes from (i.e., NPM or PyPI)
- `Package Name`: specifies the package name

#### Features extracted from the source code files (.js, .py )
- `Number of Words in source code` : count of words in the source code files
- `Number of lines in source code` : count of lines in the source code files
- `Number of sospicious token in source code` : count of suspicious words (e.g., bash commands, path to sensitive files) in source code files
- `Number of URLs in source code`
- `Number of base64 chunks in source code` : number of detected valid base64 strings in source code
- `Number of IP adress in source code`
- `bracker ratio mean` : Mean of the ratio (no. of square brackets/byte size) among source code files
- `bracker ratio std`: Standard Deviation of the ratio (no. of square brackets/byte size) among source code files
- `bracker ratio max` : Maximum of the ratio (no. of square brackets/byte size) among source code files
- `bracker ratio q3`: Third quartile of the ratio (no. of square brackets/byte size) among source code files
- `eq ratio mean` : Mean of the ratio (equal signs/byte size) among source code files
- `eq ratio std` : Standard Deviation of the ratio (equal signs/byte size) among source code files
- `eq ratio max` : Maximum of the ratio (equal signs/byte size) among source code files
- `eq ratio q3` : Third quartile of the ratio (equal signs/byte size) among source code files
- `plus ratio mean` : Mean of the ratio (plus signs/byte size) among source code files
- `plus ratio std` : Standard Deviation of the ratio (plus signs/byte size) among source code files
- `plus ratio max` : Maximum of the ratio (plus signs/byte size) among source code files
- `plus ratio q3` : Third quartile of the ratio (plus signs/byte size) among source code files
- `shannon mean ID source code` : Shannon Entropy's mean computed on identifiers, after applying the generalization language
- `shannon std ID source code` : Shannon Entropy's standard deviation computed on identifiers, after applying the generalization language
- `shannon max ID source code` : Shannon Entropy's maximum value computed on identifiers, after applying the generalization language
- `shannon q3 ID source code` : Shannon Entropy's third quartile computed on identifiers, after applying the generalization language
- `shannon mean string source code` : Shannon Entropy's mean computed on strings, after applying the generalization language
- `shannon std string source code` : Shannon Entropy's standard deviation computed on strings, after applying the generalization language
- `shannon max string source code` : Shannon Entropy's maximum valuecomputed on, after applying the generalization language
- `shannon q3 string source code` : Shannon Entropy's third quartile of tokenized strings, after applying the generalization language
- `homogeneous identifiers in source code` : Number of homogeneous identifiers, i.e., identifiers having all characters equal after transforming them through the generalization language
- `homogeneous strings in source code` : Number of homogeneous strings, i.e. strings having all characters equal after transforming them through the generalization language
- `heterogeneous identifiers in source code` : Number of heterogeneous identifiers, i.e., identifiers with more than one symbol after transforming them using the generalization language equal
- `heterogeneous strings in source code` : Number of heterogeneous strings, i.e., strings with more than one symbol after transforming them using the generalization language equal
#### Metadata file (NPM:package.json, Pypi: setup.py )
The following features are extracted from the 'metadata' file in the packages, such as the `package.json` file for NPM and the `setup.py` file for PyPI:
- `Number of Words in metadata` : count of words in the metadata files (i.e., `package.json` file for NPM and the `setup.py` file for PyPI)
- `Number of lines in metadata` : count of lines in the metadata files (i.e., `package.json` file for NPM and the `setup.py` file for PyPI)
- `Number of sospicious token in metadata` : count of suspicious words (e.g., bash commands, path to sensitive files) in the metadata files (i.e., `package.json` file for NPM and the `setup.py` file for PyPI)
- `Number of URLs in metadata`
- `Number of base64 chunks in metadata` : number of detected valid base64 strings the metadata files (i.e., `package.json` file for NPM and the `setup.py` file for PyPI)
- `Number of IP adress in metadata`
- `presence of installation script`: boolean for the presence of installation script (Pypi: install script - NPM: presence of keys `postinstall`, `preinstall`, `install`)
- `shannon mean ID metadata` : Shannon Entropy's mean computed on identifiers, after applying the generalization language
- `shannon std ID metadata` : Shannon Entropy's standard deviation computed on identifiers, after applying the generalization language
- `shannon max ID metadata` : Shannon Entropy's maximum value computed on identifiers, after applying the generalization language
- `shannon q3 ID metadata` : Shannon Entropy's third quartile computed on identifiers, after applying the generalization language
- `shannon mean metadata` : Shannon Entropy's mean computed on strings, after applying the generalization language
- `shannon std string metadata` : Shannon Entropy's standard deviation computed on strings, after applying the generalization language
- `shannon max string metadata` : Shannon Entropy's maximum valuecomputed on, after applying the generalization language
- `shannon q3 string metadata` : Shannon Entropy's third quartile of tokenized strings, after applying the generalization language
- `homogeneous identifiers in metadata` : Number of homogeneous identifiers, i.e., identifiers having all characters equal after transforming them through the generalization language
- `homogeneous strings in metadata` : Number of homogeneous strings, i.e. strings having all characters equal after transforming them through the generalization language
- `heterogeneous identifiers in metadata` : Number of heterogeneous identifiers, i.e., identifiers with more than one symbol after transforming them using the generalization language equal
- `heterogeneous strings in metadata` : Number of heterogeneous strings, i.e., strings with more than one symbol after transforming them using the generalization language equal
#### Structural features of the package
The following features count the number of files per selected extensions:

```'bat', 'bz2', 'c', 'cert', 'conf' ,'cpp' ,'crt', 'css', 'csv', 'deb' ,'erb', 'gemspec', 'gif', 'gz', 'h', 'html', 'ico' ,'ini' ,'jar', 'java', 'jpg', 'js', 'json', 'key' ,'m4v' ,'markdown' ,'md' ,'pdf', 'pem', 'png', 'ps', 'py', 'rb', 'rpm', 'rst','sh' ,'svg', 'toml', 'ttf', 'txt','xml', 'yaml', 'yml', 'eot', 'exe', 'jpeg', 'properties', 'sql', 'swf', 'tar', 'woff', 'woff2', 'aac','bmp', 'cfg' ,'dcm', 'dll', 'doc', 'flac','flv', 'ipynb', 'm4a', 'mid', 'mkv', 'mp3', 'mp4', 'mpg', 'ogg','otf', 'pickle', 'pkl' ,'psd', 'pxd' ,'pxi', 'pyc', 'pyx', 'r', 'rtf', 'so', 'sqlite' ,'tif', 'tp', 'wav', 'webp' ,'whl', 'xcf', 'xz', 'zip' ,'mov' ,'wasm', 'webm'.```

**************************************************

## Time and Space Cost

The current cost values to train our models have been computed in the following configuration:

`MacOS 13.5.1; CPU: 2 GHz Quad-Core Intel Core i5; RAM: 16 GB 3733 MHz LPDDR4X`

In addition, such costs are computed for the whole process of evaluating the best hyperparameters and training the models.

For XGBoost we have the following estimations:
- Cross-language model:
- Estimated train time: `271.04s`
- Estimated space cost: `64.08 MB`
- Mono-language model (JavaScript):
- Estimated train time: `198.24s`
- Estimated space cost: `57.06 MB`
- Mono-language model (Python):
- Estimated train time: `195.50s`
- Estimated space cost: `60.98 MB`

For Decision Tree (DT) we have the following estimations:
- Cross-language model:
- Estimated train time: `86.76s`
- Estimated space cost: `118.24 MB`
- Mono-language model (JavaScript):
- Estimated train time: `51.01s`
- Estimated space cost: `108.03 MB`
- Mono-language model (Python):
- Estimated train time: `46.86s`
- Estimated space cost: `125.35 MB`

For Random Forest (RF) we have the following estimations:
- Cross-language model:
- Estimated train time: `696.67s`
- Estimated space cost: `193.31 MB`
- Mono-language model (JavaScript):
- Estimated train time: `544.89s`
- Estimated space cost: `125.66 MB`
- Mono-language model (Python):
- Estimated train time: `550.65s`
- Estimated space cost: `122.81 MB`

## How to obtain support
[Create an issue](https://github.com/SAP-samples//issues) in this repository if you find a bug or have questions about the content.

For additional support, [ask a question in SAP Community](https://answers.sap.com/questions/ask.html).

## Contributing
If you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses [the standard DCO text of the Linux Foundation](https://developercertificate.org/).

## License
Copyright (c) 2023 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the [LICENSE](LICENSE) file.