https://github.com/uds-se/mudflow-statistical-analysis
The repository contains scripts that are needed for statistical analysis of data flows
https://github.com/uds-se/mudflow-statistical-analysis
Last synced: 6 months ago
JSON representation
The repository contains scripts that are needed for statistical analysis of data flows
- Host: GitHub
- URL: https://github.com/uds-se/mudflow-statistical-analysis
- Owner: uds-se
- Created: 2016-06-01T12:54:00.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-06-01T18:07:18.000Z (over 9 years ago)
- Last Synced: 2025-03-23T17:45:50.481Z (7 months ago)
- Language: R
- Homepage: https://www.st.cs.uni-saarland.de/appmining/mudflow/
- Size: 4.05 MB
- Stars: 2
- Watchers: 2
- Forks: 4
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Statistical Analysis Scripts of a MUDFLOW prototype
The repository contains scripts that are needed for statistical analysis of data flows
These scripts implement MUDFLOW approach.
* Modify `scripts/conf.R` such that `` and `` point to your data. Use only base names. All paths are relative to the root application folder.
* Use `` to set an id for a particular classification (can be empty).
* `` file is optional. It is used only when a set of features used in maliciogram construction is being restricted to data flows belonging to a particular susi category.
* Try running `Rscript scripts/make_all.R`* You need the following R libraries:
* e1071,
* kernlab,
* ROCR(optionally),
* parallel,
* iterators,
* foreach
* These scripts were tested with R 3.1.1
* You can make several configurations and launch Mudflow using particular one via `Rscript make_all.R `
* It's possible to launch stages step by step. Please see `stage` variable in the `conf.R`
* If you have a lot of data it's reasonable to convert CSV files to RDS at first (like RDS files in `data` folder). Then you can use `` flag to increase the speed of data loading
* You can find sample data files in `data` folder. Main file contains number of flows found in each application in rows. Currently Mudflow accepts only nonnegative integers as feature values. Though, for jaccard distance only binary values are considered (i.e. zero and nonzero)
* Each pair `name;category` in `` file means that an app `name` has a source associated with susi category `category`.
* It's better to surround each value in files with double quotes.
* Classification results are stored in file `Results.txt`.The Mudflow process consists of the following stages:
* Prepare data for orca. For each susi category it constitutes training and testing sets from apps having at least one flow from a source that belongs to this category. Depending on settings all data flows or only ones from sources associated with a category are considered.
* Perform outlierness score calculation with help of orca;
* Aggregate scores into a maliciogram and calculate feature weights;
* Do one-class SVM classification using maliciograms as features.