Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/futurecomputing4ai/kilograms
KiloGram algorithm for finding the top-k most frequent n-grams for large values of n quickly with fixed memory.
https://github.com/futurecomputing4ai/kilograms
feature-extraction machine-learning malware n-grams
Last synced: 3 months ago
JSON representation
KiloGram algorithm for finding the top-k most frequent n-grams for large values of n quickly with fixed memory.
- Host: GitHub
- URL: https://github.com/futurecomputing4ai/kilograms
- Owner: FutureComputing4AI
- Created: 2020-02-24T22:33:58.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-10-13T19:49:04.000Z (over 4 years ago)
- Last Synced: 2024-03-23T16:34:21.559Z (11 months ago)
- Topics: feature-extraction, machine-learning, malware, n-grams
- Language: Java
- Homepage:
- Size: 34.2 KB
- Stars: 9
- Watchers: 5
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# KiloGrams
This is the java code implementing the KiloGrams algorithm, from out paper [_KiloGrams: Very Large N-Grams for Malware Classification_](https://arxiv.org/abs/1908.00200). Using it, you can extract the top-_k_ largest _n_-grams from a corpus using a fixed amount of memory, for large values of _k_ and n. In our original paper, we tested with _k_ up to 8192, which took the same time or less than processing _k_=6 grams.
This is research code, and comes with no warranty or support.
## Quick Start
You can use this code to create a dataset based on the top-_k_ _n_-grams. To do so, after building the KiloGrams code, you can run a comand like this:
```
java -Xmx10G -jar Kilograms-1.0-jar-with-dependencies.jar NGram -n 8 -k 1000 -g -b -o grams.dat
```
The top-_k_ ngrams are saved in grams.dat, a binary formated file. See NGram.java or Featurizer.java source code for the nature of the binary format and how to parse it if you want to know the n-grams. If you use a value of _n_ > 8, we recommend you add the hashing-stride option with `-hs`. For example, if you want _n_=1024 grams, we would use `-hs 256`.To create a dataset from the above code, you can use the following command:
```
java -Xmx10G -jar Kilograms-1.0-jar-with-dependencies.jar DATASET -g -b -h grams.dat -o data.libsvm
```By default, this will produce a file using the libsvm format. Scikit-learn can [read this](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html).
If you have a machine with a very large number of cores or very large files, you may want to increase the max memory for Java, depending on your JVM used.
The folders given as input do not have to be executables, or even benign/malicious. They can be any kind of files, and the code will process byte n-grams. The `DATASET` creation step also supports multi-class problems by using the `-mc ... ` flag instead of `-b` and `-g`.
## Citations
If you use the Kilogram algorithm or code, please cite our work!
```
@inproceedings{Kilograms_2019,
author = {Raff, Edward and Fleming, William and Zak, Richard and Anderson, Hyrum and Finlayson, Bill and Nicholas, Charles K. and Mclean, Mark},
booktitle = {Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS'19)},
title = {{KiloGrams: Very Large N-Grams for Malware Classification}},
url = {https://arxiv.org/abs/1908.00200},
year = {2019}
}
```## Contact
If you have questions, please contact
Mark Mclean
Edward Raff
Richard Zak