Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sophos/yaraml_rules

Security ML models encoded as Yara rules
https://github.com/sophos/yaraml_rules

Last synced: about 12 hours ago
JSON representation

Security ML models encoded as Yara rules

Awesome Lists containing this project

README

        

# Sophos AI YaraML Rules Repository
*Questions, concerns, ideas, results, feedback appreciated, please email [email protected]*

YaraML is a tool that automatically generates Yara rules from training data by translating scikit-learn logistic regression and random forest binary classifiers into the Yara language. Give YaraML a directory of malware files and a directory of benign files of any format and it'll extract substring features, downselect your feature space, train a model, and then "compile" the model and return it as a textual Yara rule. To get a feel for what this looks like, see the logistic regression Powershell detector generated by YaraML and given below.

```
rule Generic_Powershell_Detector
{
strings:
...
$s4 = "DownloadFile" fullword // weight: 3.257
$s5 = "WOW64" fullword // weight: 3.232
$s6 = "bypass" fullword // weight: 3.021
$s7 = "meMoRYSTrEaM" fullword // weight: 2.68
$s8 = "obJEct" fullword // weight: 2.679
$s9 = "OBJecT" fullword // weight: 2.659
$s10 = "ReGeX" fullword // weight: 2.592
$s11 = "samratashok" fullword // weight: 2.548
$s12 = "Dependencies" fullword // weight: 2.494
$s13 = "TVqQAAMAAAAEAAAA" fullword // weight: 2.428
$s14 = "CompressionMode" fullword // weight: 2.366
...
condition:
...
((#s0 * 5.567) + (#s1 * 4.122) + (#s2 * 3.904) + (#s3 * 3.820) +
(#s4 * 3.257) + (#s5 * 3.232) + (#s6 * 3.021) + (#s7 * 2.680) +
(#s8 * 2.679) + (#s9 * 2.659) + (#s10 * 2.592) + (#s11 * 2.548) +
...
> 0
}
```

## How do I get started?

Clone this repo and install it by doing `python setup.py install` (please use Python 3.6 or above - this has been tested on OSX, Ubuntu and Redhat, your mileage may vary on Windows). Invoke the tool as `yaraml`.

Here's an example invocation, assuming you have malicious Powershell scripts in *powershell_malware/* (or any of its subdirectories) and benign Powershell scripts in *powershell_benign/* (or any of its subdirectories):

```
yaraml powershell_malware/ powershell_benign/ # specify the malware and then benign directory in that order
powershell_model # specify the directory where we'll put the resulting rule
powershell_detector # specify the name of your Yara rule
--max_benign_files=100 --max_malicious_files=100 # you can optionally specify an upper bound on the number of files to train on
--model_type="logisticregression" # specify either logisticregression or randomforest here; will use sklearn default hyperparams
# N.B.; you can set hyperparams by using --model_instantiation instead of --model_type and calling the appropriate sklearn constructor:
# (--model_instantiation="LogisticRegression(penalty='l1',solver='liblinear')")
```

## Why YaraML?

Because sometimes we want to use ML models to do blue team work but only Yara is available. And sometimes writing hand crafted rules is too time consuming, or we want an ML alternative to only trusting our rule-writing judgment.

## How well maintained is this code base?

We're providing research code here but will happily respond to questions and bug reports. We want your feedback and we want to make this tool useful to the community.

## How do I cite YaraML?

@misc{Saxe2020,
author = {Saxe, Joshua},
title = {YaraML},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/sophos-ai/yaraml_rules/}}
}