https://github.com/anacletolab/parsmurf
High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants
https://github.com/anacletolab/parsmurf
Last synced: 12 months ago
JSON representation
High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants
- Host: GitHub
- URL: https://github.com/anacletolab/parsmurf
- Owner: AnacletoLAB
- License: gpl-3.0
- Created: 2019-04-02T15:26:31.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2020-12-08T10:17:06.000Z (over 5 years ago)
- Last Synced: 2025-06-02T07:28:28.694Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 3.8 MB
- Stars: 7
- Watchers: 4
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: readme.MD
- License: LICENSE
Awesome Lists containing this project
README
# parSMURF
This package contains parSMURF, a High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants.
---
### Table of Contents
Overview
Requirements
Downloading and compiling
General architecture
Running parSMURF
Command line options
Running parSMURF1
Running parSMURFn
Running the Bayesian optimizer
Configuration file
name
exec
data
simulate
folds
params
autogp_params
Data Format
Data file format
Label file format
Fold file format
Output file format
Random dataset generation
Examples
License
---
### Overview
parSMURF is a fast and scalable C++ implementation of the HyperSMURF algorithm - hyper-ensemble of SMOTE Undersampled Random Forests - an ensemble approach explicitly designed to deal with the huge imbalance between deleterious and neutral variants.
The algorithm is outlined in the following papers:\
A. Petrini, M. Mesiti, M. Schubach, M. Frasca, D. Danis, M. Re, G.Grossi, L. Cappelletti, T. Castrignanò, P. N. Robinson, and G. Valentini, "parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants", GigaScience, vol. 9, 05 2020. giaa052.
https://doi.org/10.1093/gigascience/giaa052
Schubach, Matteo Re, Peter N. Robinson & Giorgio Valentini, "Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants", Scientific Reports, 2017/06/07\
https://www.nature.com/articles/s41598-017-03011-5
Two variants of parSMURF are currently available in this repository:
- "parSMURF1" is a fast multi-threaded implementation of the algorithm and is meant to be run on a single machine
- "parSMURFn" is a multi-threaded and parallel implementation (under the MPI programming paradigm) and is meant to be run on a single machine or on cluster
Both versions share the same design and functionalities outlined in the paper, in particular:
- fast, optimized and scalable C++ implementation
- auto tuning of the learning parameters by grid search or by means of a Bayesian optimizer
---
### Requirements
parSMURF is designed for x86-64 and Intel Xeon Phi architectures running Linux OSes.\
This software is distributed as source code.
A compilier which supports the C++11 language specification is required. It has been tested with GCC (vers. >= 5) and Intel CC (2015, 2017 and 2019).\
Code is also optimized for Intel XeonPhi architectures, and it has been successfully tested on Knights Landing family processors.
Multithreading and multiprocessing are managed differently in parSMURF1 and parSMURFn: the former is a multithread-only implementation and thread management is performed through OpenMP APIs. Any reasonably recent compiler has its specification already built-in, hence this requirement is usually met. parSMURFn, instead, is a multiprocess and multithread implementation of the algorithm. Thread management is performed by the Linux built-in pthread library and multiprocessing is performed through the MPI APIs. Hence, for compilation and running, parSMURFn requires an implementation of the MPI standard. It has been tested with OpenMPI 1.10.3, OpenMPI 2.0, IntelMPI 2016, IntelMPI 2017 and IntelMPI 2019.
Notice that MPI is not required for parSMURF1, hence if no MPI libraries are found on the target system, is still possible to compile and run this version of the software
On Ubuntu, it is possible to install the OpenMPI library via apt package manager:
```
sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev
```
Makefiles are generated by the cmake (vers. >= 2.8) utility. On Ubuntu it is possible to install this package via apt:
```
sudo apt-get install cmake
```
Bayesian Optimization is done by the Spearmint package. This package require python2 and it depends on several Python packages. The best way to use this feature is by creating and configuring a Python virtual environment and installing the required Python packages there. On Ubuntu:
```
sudo apt-get install virtualenv
virtualenv parSMURFvenv -p /usr/bin/python2 #This command creates a parSMURFvenv directory
source parSMURFvenv/bin/activate #This command activates the virtual environment
pip install numpy==1.13.0 #The following commands install the required packages in the virtual environment
pip install scipy==1.2.1
pip install weave==0.17.0
pip install six==1.12.0
pip install protobuf==3.7.1
deactivate #Deactivate the virtual environment
```
parSMURF uses several external libraries that are included ad source code in this repository or are automatically downloaded and compiled. In particular, the following libraries are included:
- ANN: A Library for Approximate Nearest Neighbor Searching, by David M. Mount and Sunil Arya, Version 1.1.2. The modified version is supplied in the src/ann_1.1.2 directory. This version has been adapted for multi-thread execution, since the original package available at https://www.cs.umd.edu/~mount/ANN/ is not thread safe and is not compatible with this package.
- Ranger: A Fast Implementation of Random Forests, by Marvin N. Wright, Version 0.11.2. The modified version, stripped from the R code, is supplied in the src/ranger directory. The main codebase is located at https://github.com/imbs-hl/ranger
- Spearmint, a Python package to perform Bayesian optimization, by Jasper Snoek. The original version at https://github.com/JasperSnoek/spearmint seems no longer maintained and needed a few updates to run on parSMURF.
The following libraries are not included in this code repository, but are automatically downloaded during the compilation process:
- easylogging++: A single header C++ logging library, by Zuhd Web Services. Automatically cloned in src/easyloggingpp and compiled from https://github.com/zuhd-org/easyloggingpp
- jsoncons: A C++, header-only library for constructing JSON and JSON-like text and binary data formats, by Daniel Parker. Automatically cloned in src/jsoncons and compiled from https://github.com/danielaparker/jsoncons
- zlib: A massively spiffy yet delicately unobtrusive compression library, by Jean-loup Gailly and Mark Adler. Autmatically cloned from in src/zlib and compiled from https://github.com/madler/zlib
All the libraries have been modified and redistributed according to their own licenses. For each included library, a copy of the associated license is contained in each library folder.
---
### Downloading and compiling
Download the latest version from this page or clone the git repository altogether:
git clone https://github.com/anacletolab/parSMURF
Once the package has been downloaded, move to the main directory, create a build dir, invoke cmake and build the software (`-j n` make option enables multithread compilation over n threads):
cd parSMURF
mkdir build
cd build
cmake ../src
make -j 4
This will generate two executables: "parSMURF1" and "parSMURFn".
For a quick test, launch the following command from the build directory:
```
./parSMURF1 --cfg ../cfgEx/simulCV.json
```
---
### General architecture
While both versions strictly follow the paper and its original R implementation (available on CRAN repository https://cran.r-project.org/web/packages/hyperSMURF/index.html), the novelties of this package resides in the fast C++ code and in the parallel execution which lead to a dramatic decrease of the computing time while keeping the same results, in term of quality of prediction, of the original implementation. Also, it features two different approaches for automatically find the best learning parameters.
Hence, execution roughly follows this scheme:
```
- data reading from file(s) (or random dataset generation)
- folds and partitions generation [by index!]
- for each fold
---- for each partition in the current fold
---- ---- over-sampling of the minority class and under-sampling of the majority class
---- ---- random forest training
---- ---- random forest test
---- prediction accumulation
- prediction averaging
```
Results are evaluated according to an n-fold validation process. Folds can be randomly generated (the user is free to specify the number of folds) or can be read from a file. When randomly generated, folds are stratified, i.e. the generation algorithm tries to evenly distribute the number of positive examples amongst the folds.
Parallelization happens at partition level: since the SMOTE algorithm and the subsequent RF train and test stages are almost embarrassingly parallel inside each fold, (i.e. they require the same operations to be performed on different data, with no synchronization points or data communication involved) these steps can be executed concurrently for each partition belonging to the same fold.
In parSMURF1, this process is parallelized by means of multi-threading. As an example, if the user specifies x partitions and y processing threads, each thread is assigned x/y partitions which are sequentially processed by each thread. If enough cpu cores are available, each thread will execute concurrently, leading to an almost linear speed-up, especially on CPUs characterized by an high number of cores, like the Intel XeonPhi family of processors.
Parallelization in parSMURFn follows the same model which is further expanded for exploiting the computational power of several processing nodes in a cluster. The execution scheme follows a simple master-slave model, where a single master MPI process reads the data from file and delegates the processing of each partition (SMOTE and rf steps) to k working MPI processes. The master process also manages the recollection and accumulation of the predictions from the working processes. Moreover, as in parSMURFa, processing of the partitions in each working process is parallelized by means of multi-threading.
As an example, suppose that the user specifies x partitions, k working processes and y processing threads for each working process.
The master process assigns x/k "chunks" of partitions to each working process and sends them the relevant data for the computation. Inside each working process, each chunk is further divided amongst the thread pool, and each thread is assigned to (x/k)/y partitions. Predictions for each chunks are locally accumulated inside each working process and are sent back to the master process only once the work for the chunk is finished.
Several strategies have been used to minimize latencies due to data transmission or broadcasting between the master and working processes, not limited to:
- the master process sends only the data strictly needed for the computation of each partition; moreover, it is sent as a single big array with an header, instead of several small chunks.
- sends and receives in the master process are managed in two different threads, hence interleaving data preparation + transmission and data receive.
- sends in the master process can be single- or multi-threaded: in the latter case, the master process spawns a number of threads equal to the number of working processes, and each of these thread is assigned to prepare the data and send it to the corresponding worker, concurrently. This is the default operation mode, but might be memory consuming, therefore a command line option to disable this feature is provided.
parSMURF features two subsystems for the automatic fine tuning of the learning parameters, aimed to maximize the prediction performances of the algorithm. The first strategy is by performing an exhaustive grid search: given a set of values for each hyper-parameter, the resulting set of all the possible combinations of hyper-parameters is calculated, and each combination evaluated through internal cross validation. The other strategy is by Bayesian optimization: given a range for each hyper-parameter, the Bayesian optimizer generate a sequence of possible candidates whose sequence tends to a probable global maximum. An high level of the execution is given by this pseudo-code snippet:
```
iter = 0
- while (iter < maxIter) and (error > tolerance):
-- BO generates a new possible candidate of hyper-parameters h
-- evaluation of h in a context of internal cross validation
-- submit (h, AUPRC(h)) to the BO
-- iter <- iter + 1
```
Both strategies are performed in a context of internal cross validation, hence it is performed for each fold of the external CV.
The output of the procedure is the set of best learning parameter for each fold of the external cross validation.
---
### Running parSMURF
parSMURF is a command line executable.\
All the options are submitted to the main executable through configuration file written in json format.
#### Command line options
Only two command line options are available, since every other parameter or option is defined by json configuration files.\
`--cfg ` specifies the configuration file for the run\
`--help` prints a brief help screen
#### Running parSMURF1
parSMURF1 does not require anything special to run, besides a proper configuration file. Hence, it can be launched as following:
```
./parSMURF1 --cfg
```
#### Running parSMURFn
parSMURFn requires MPI to be installed on the target system or in all the nodes of a cluster. It must be invoked with `mpirun` or, depending on the scheduling system installed on the cluster, with a proper mpirun wrapper.\
The `-n` option of `mpirun` also specifies how many processes have to be launched. parSMURFn requires at least two processes, one as master and one as worker. As an example:
```
mpirun -n 5 ./parSMURFn --cfg
```
launches an instance of parSMURFn over 5 processes (one master and four worker).\
As now, the number of master process is limited to one.
#### Running the Bayesian optimizer
Using the Bayesian optimizer requires more effort, but we are currently finding a way to properly manage the whole procedure more user friendly.\
As noted in the "Requirements" section, it may be preferable to setup a Python virtual environment and launch parSMURF1 or parSMURFn from there.\
Also, the entire `src/spearmint` folder must be copied in the same directory where the parSMURF executable is.\
As final requirement, the environmental variable PYTHONPATH must contain the path to the Spearmint folder.\
As an example, assume that the git repository has been copied to `/home/user01/git/parSMURF` and the package has been successfully compiled in the `/home/user01/git/parSMURF/build` directory. Also, assume that a Python virtual environment has been created and is located at `/home/user01/pythonVenvs/parSMURFvenv`.\
To prepare a folder containing everything it is needed to parSMURF to run, do the following:
```
mkdir /home/user01/parSMURFexp
cd /home/user01/parSMURFexp
cp /home/user01/git/parSMURF/build/parSMURF1 .
cp /home/user01/git/parSMURF/build/parSMURFn .
cp -r /home/user01/git/parSMURF/src/spearmint .
```
Now for launching an experiment with the Bayesian optimizer, do the following:
```
cd /home/user01/parSMURFexp
export PYTHONPATH=$PYTHONPATH:/home/user01/parSMURFexp/spearmint/spearmint:/home/user01/parSMURFexp/spearmint/spearmint/spearmint
source /home/user01/pythonVenvs/parSMURFvenv/bin/activate
deactivate
```
---
#### Configuration file
parSMURF1 and parSMURFn use configuration files in json format for setting the parameters of each run.\
Examples of configuration files are available in the cfgEx folder of the repository.
A configuration file is composed by seven dictionaries:
```
{
"name": ...,
"exec": {...},
"data": {...},
"folds": {...},
"simulate": {...},
"params": {...},
"autogp_params": {...}
}
```
Depending on the configuration itself, some dictionaries are not mandatory and can be left out.
##### "name"
```
"name": string
```
Mandatory: no\
Exec: parSMURF1 / parSMURFn\
A string for labeling the name of the experiment
##### "exec"
```
"exec": {
"name": string,
"nProcs": int,
"ensThrd": int,
"rfThrd": int,
"noMtSender": bool,
"seed": int,
"verboseLevel": int,
"verboseMPI": bool,
"saveTime": bool,
"timeFile": string,
"printCfg": bool,
"mode": string
},
```
Mandatory: yes\
Exec: parSMURF1 / parSMURFn\
General configuration of the run.
```
"name": string
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
Label used for marking the name of the executable (parSMURF1 or parSMURFn). It does not affect the computation itself, since this field is ignored by the json parser
```
"nProcs": int
```
Mandatory: No\
Exec: parSMURFn\
Label used for marking the number of processes for a run of parSMURFn. It does not affect the computation itself, since the total number of processes is detected at runtime by the MPI APIs.
```
"ensThrd": int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Number of threads assigned to perform the partition processing.
```
"rfThrd": int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Number of threads assigned to perform the random forest train and test.
```
"noMtSender": bool
```
Mandatory: No\
Exec: parSMURFn\
This option disables multithreading in the master process. It may affect performances, but it may be necessary when processing particularly large datasets.
```
"seed": int
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
Optional seed for the random number generators. If unspecified, a random seed is generated.
```
"verboseLevel": int
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
Level of verbosity on stdout and on the logfile of the computational task. Range is 0-3 (default: 0).
```
"verboseMPI": bool
```
Mandatory: No\
Exec: parSMURFn\
Verbose on stdout and logfile the calls to MPI APIs. (Default: false)
```
"saveTime": bool
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
Option for saving a report of the computation time of the run (Default: false)
```
"timeFile": string
```
Mandatory: Yes, if "saveTime" is set to true\
Exec: parSMURF1 / parSMURFn\
File name for saving the execution time report
```
"printCfg": bool
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
Option for printing a detailed description of the run before it starts (Default: false)
```
"mode": string
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Execution mode. Allowed strings are:\
"cv": Dataset is splitted in folds, and evaluated in a process of k-fold cross validation. The run returns a set of predictions (default).\
"train": The whole dataset is treated as training set. The run returns a folder of trained models for later usage.\
"test": The whole dataset is treated as test set. It is mandatory to submit a directory of trained models to perform the evaluation. The run returns a set of predictions.\
Note that the autotuning of the learning parameters is available only for "cv" mode
```
"optimizer": string
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Execution mode. Allowed strings are:\
"no": external cross-validation only (default)\
"grid": automatic tuning of the learning parameters by grid search in the internal cross validation loop\
"autogp": automatic tuning of the learning parameters by Bayesian optimization (Gaussian process) in the internal cross validation loop
##### "data"
```
"data": {
"dataFile": string
"foldFile": string
"labelFile": string
"outFile": string
"forestDir": string
}
```
Mandatory: yes\
Exec: parSMURF1 / parSMURFn\
This field contains all the required information for accessing data from and to the system.
```
"dataFile": string
```
Mandatory: Yes (No if simulation mode is enabled)\
Exec: parSMURF1 / parSMURFn\
Input data file
```
"foldFile": string
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
Optional input file containing the fold division of the dataset
```
"labelFile": string
```
Mandatory: Yes (No, if simulation mode is enabled)\
Exec: parSMURF1 / parSMURFn\
Input file containing the labels of the examples of the dataset
```
"outFile": string
```
Mandatory: Yes (No, if in train mode)\
Exec: parSMURF1 / parSMURFn\
Output file containing the output predictions
```
"forestDir": string
```
Mandatory: No (Yes, if in train mode)\
Exec: parSMURF1 / parSMURFn\
Output directory for saving the trained models. Must be a valid directory on the filesystem.
##### "simulate"
```
"simulate": {
"simulation": bool,
"prob": float,
"n": int,
"m": int
},
```
Mandatory: no\
Exec: parSMURF1 / parSMURFn\
This field contains all the required information for enabling the internal dataset generator
```
"simulation": bool
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
On true, it enables the internal dataset generator. The fields "dataFile", "foldFile" and "labelFile" are ignored and a random dataset is generated.
```
"prob": float
```
Mandatory: Yes if simulation mode is enabled\
Exec: parSMURF1 / parSMURFn\
This field represent the probability of generating a positive example. Must be a float in the [0,1] range, possibly very small for simulating highly unbalanced datasets
```
"n": int
```
Mandatory: Yes if simulation mode is enabled\
Exec: parSMURF1 / parSMURFn\
Number of examples to be generated
```
"m": int
```
Mandatory: Yes if simulation mode is enabled\
Exec: parSMURF1 / parSMURFn\
Number of features to be generated
##### "folds"
```
"folds": {
"nFolds": int,
"startingFold": int,
"endingFold": int
}
```
Mandatory: Yes (No, if "foldFile" specified)\
Exec: parSMURF1 / parSMURFn\
This section specified the fold subdivision and to which fold execute the run.
```
"nFolds": int
```
Mandatory: Yes (No, if "foldFile" specified)\
Exec: parSMURF1 / parSMURFn\
This field specifies in how many folds the dataset should be subdivided into. Ignored if "foldFile" has been declared.
```
"startingFold": int,
"endingFold": int
```
Mandatory: No\
Exec: parSMURF1 / parSMURFn\
These fields specify the starting and ending fold that parSMURF have to evaluate. This is useful for parallelizing runs across different folds. If unspecified, parSMURF performs the evaluation of the predictions on all folds.
##### "params"
```
"params": {
"nParts": array of int,
"fp": array of int,
"ratio": array of int,
"k": array of int,
"nTrees": array of int,
"mtry": array of int
},
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
This field contains the learning parameters for the run. All values must be passed as arrays.\
When "optimizer" is set to "no", only one combination is used for the run.\
When "optimizer" is set to "grid", parSMURF generates all the possible hyper-parameter combinations and evaluate them in the internla CV loop.\
For a deeper explanation of each parameter, please refer to the article
```
"nParts": array of int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Number of partitions (ensembles)\
Default: 10
```
"fp": array of int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Over-sampling factor (0 disables over-sampling)\
Default: 1
```
"ratio": array of int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Under-sampling factor (0 disables under-sampling)\
Defaul: 1
```
"k": array of int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Number of the nearest neighbors for SMOTE oversampling of the minority class\
Default: 5
```
"nTrees": array of int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
Number of trees in each ensemble\
Default: 10
```
"mtry": array of int
```
Mandatory: Yes\
Exec: parSMURF1 / parSMURFn\
mtry random forest parameter\
Default: sqrt(m)
##### "autogp_params"
```
"autogp_params":
"nParts" : {
"name": "nParts",
"type": "int",
"min": int,
"max": int,
"size": 1
},
"fp" : {
"name": "fp",
"type": "int",
"min": int,
"max": int,
"size": 1
},
"ratio" : {
"name": "ratio",
"type": "int",
"min": int,
"max": int,
"size": 1
},
"k" : {
"name": "k",
"type": "int",
"min": int,
"max": int,
"size": 1
},
"numTrees" : {
"name": "numTrees",
"type": "int",
"min": int,
"max": int,
"size": 1
},
"mtry" : {
"name": "mtry",
"type": "int",
"min": int,
"max": int,
"size": 1
}
```
Mandatory: No (Yes, if "optimizer" is set to "autogp")\
Exec: parSMURF1 / parSMURFn\
This section is used for defining the search space of the Bayesing optimizer. It is composed by six sub-fields, each one defining the search space of one learning parameter. The only parts that can be modified are the "min" and "max" fields of each parameters.\
Every sub-field is mandatory. If the user needs to perform a partial search (i.e. tuning only some of the six parameters), please set the "min" and "max" values of the fixed parameters to the same value.
---
### Data format
As previously stated, data is provided to the application in two or three files.
##### Data file
this file should contain the main data needed for computing the predictions. It consists in an n x m matrix of double, where n is the number of examples and m the features. The matrix is read row-wise, i.e. :
```
| m1 m2 m3 m4 ...
---------------------------
n1 | ------------>
n2 | ------------>
n3 |
n4 |
. |
. |
. |
```
Most, if not any, datafile is in this format, so just be sure that the number of features for each row is consistent across the samples.\
The number of features is detected from the file itself - actually, from the number of items read in the first row.\
All input files must be HEADERLESS.
##### Label file
this file should contain the labelling of the examples. It consists in n space or tab separated values, where n is the number of examples. It can also be a column vector file, i.e. newline separated values.\
It is a plain text file where each positive example is marked with "1" and negative examples with "0".
##### Fold file
this optional file should contain the fold sub division. If specified, examples will be divided in folds as specified in this file. If not, a random stratified division will be performed. This file consists in n space or tab separated integer values, where n is the number of examples. It can also be a column vector file, i.e. newline separated values.\
It is a plain text file where each number represents the fold to which each example is assigned. Fold numbering starts from "0" (zero).
Note that specifying the fold file name overrides the "nFolds" option in the ocnfiguration file.
The following code snippet converts two R vectors in the corresponding labelling and folding files for proper use with this package:
```
write(vectorOfLabels, file = "labels.txt", sep = "\n")
write(vectorOfFolds, file = "folds.txt", sep = "\n")
```
##### Output file
Predictions will be saved as plain text file.\
The output file consists of two columns of tab separated double values. For each sample, both probabilities of belonging to either class is saved: each value in the first column represents the probability of the associated sample to be in the minority class, while each value in the second column, the probability to be in the majority class.
Note about dimensionality:\
When reading data from file, parSMURF1 and parSMURFn automatically detect the number of samples and features, following these rules:
- at first, the number of samples is detected from the label file.
- then, the number of features is detected from the data file, evaluating the number of different items from hte first row of the data file.
Hence, the sizes of these files should be consistent, otherwise a warning message is printed to the console.\
Also, the number of folds is detected from the fold file if specified. In this case, the option "nFolds" in the configuration file is ignored, and the total number of folds will be equal to the number of the total unique elements of the fold file.
---
### Random dataset generation
parSMURF1 and parSMURFn are provided with a random dataset generator for testing purposes.\
When enabled, a random dataset will be created according to two normal distribution having the same variance but different average value, depending if an example falls in the positive or negative class.\
The user enables this mode by using the `"simulate: true"` option in the configuration file.\
The user is also forced to specify the the probability that an example belongs to the minority class (`"prob": float`) and dimensionality of the dataset with the `"n": float` and `"m": int` options.\
An additional column will be added to the output file, containing the labelling that has been randomly generated according to the `"prob"` value.
---
### Examples
Folder cfgEx of the repository contains several example of configuration files to be used either with parSMURF1 or parSMURFn.\
- `simulCV.json` (for parSMURF1): it generates a random dataset of 1200 examples and 25 features; probability of a positive example is very low (0.02). Execute a 10-fold cross validation with random stratified fold sub-division. Learning parameters are fixed to: nParts = 10, fp = ratio = 1, k = 5, nTrees = 10, mtry = 5. Results are saved into the "predicitons.txt" file. Also a report of the execution time is generated in the timeout.txt. Seed fixed at 1. parSMURF spawns 4 threads for partition processing, and for each one of them it spawns another thread for random forest train and test.
- `simulCVn.json` (for parSMURFn): it generates a random dataset of 12000 examples and 75 features; probability of a positive example is very low (0.025). Execute a 10-fold cross validation with random stratified fold sub-division. Learning parameters are fixed to: nParts = 100, fp = 1, ratio = 2, k = 3, nTrees = 100, mtry = 9. Results are saved into the "predicitons.txt" file. Also a report of the execution time is generated in the timeout.txt. Seed fixed at 1. It must be launched as `mpirun -n 5 ./parSMURFn --cfg simulCVn.json`, so that 4 worker processes are spawned, each one with 6 threads for partition processing, and for each of them 2 threads for random forest train and test are spawned. This execution also verbose to stdout all the MPI API calls.
- `dataFromFile.json` (for parSMURF1): execute a 10 fold cross validation over the dataset read from file. Fold subdivision is specified in the "folds.txt" file. No hyper-parameters autotuning.
- `gridTune.json` (for parSMURF1): data is read from file, as well for the labelling. Folds are randomly generated. Execute an partial automatic tuning of the learning parameters over a 5-fold cross validation. Parameters to be tuned are: nParts, fp and mtry. This configuration generates 18 possible hyper-parameter combinations that are tested in the internal cross validation. AUPRC results for each combination are saved in the files "fold0.dat" to "fold4.dat". It also generates a prediction file contianing the predictions for each fold obtained by the best hyper-parameter combiantion for the relative fold.
- `gridTune2.json` (for parSMURF1): as in `gridTune.json`, but the whole procedure is executed over folds 3 and 4 only.
- `train.json` (for parSMURFn):data is read from file, as well for the labelling. Parameter "nFolds" is ignored. Treats the whole dataset as training set and generates a trained model. The model is saved in the "/home/user01/models/trainedModel/" folder. It must be launched as `mpirun -n 2 ./parSMURFn --cfg train.json`. 1 worker process, with 3 threads for partition processing and 4 for random forest train and test. Logs are more verbose than the previous examples. Multithreading in the master process is disabled.
- `autoGpTune.json` (for parSMURFn): full auto-tuning of the learning parameters via Bayesian Optimization. Data, labels and fold sub-division are read from file. Parameter "nFolds" is ignored. "params" section of the config file is ignored as well. The parameter search space is defined as follows: nParams in [10, 50], fp in [1, 3], ratio in [1, 3], k in [2, 6], nTrees in [5, 10], mtry in [2, 5].
---
### License
This package is distributed under the GNU GPLv3 license. Please see the http://github.com/anacletolab/parSMURF/LICENSE file for the complete version of the license.
parSMURF includes several third-party libraries which are distributed with their own license. In particular, source code of the following libraries is included in this package:
**ANN: Approximate Nearest Neighbor Searching**\
David M. Mount and Sunil Arya\
Version 1.1.2\
(https://www.cs.umd.edu/~mount/ANN/) \
Modified and redistributed under the GNU Lesser Public License v2.1\
Copy of the license is available in the src/ann_1.1.2 directory
**Ranger: A Fast Implementation of Random Forests**\
Marvin N. Wright\
Version 0.11.1\
(https://github.com/imbs-hl/ranger) \
Modified and redistributed under the MIT license\
Copy of the license is available in the src/ranger folder
**Spearmint**\
Jasper Snoek, Hugo Larochelle and Ryan P. Adams\
(https://github.com/JasperSnoek/spearmint/) \
Modified and redistributed under the NU General Public License v3\
Copy of the license is available in the src/spearmint/spearmint folder
Also, parSMURF uses several libraries whose source code is not included in the package, but it is automatically downloaded at compile time. These libraries are:
**Easylogging++**\
Zuhd Web Services\
(https://github.com/zuhd-org/easyloggingpp) \
Distributed under the MIT license\
Copy of the license is available at the project homepage
**Jsoncons**\
Daniel Parker\
(https://github.com/danielaparker/jsoncons) \
Distributed under the Boost license\
Copy of the license is available at the project homepage
**zlib**\
Jean-loup Gailly and Mark Adler\
(https://github.com/madler/zlib) \
Distributed under the zlib license\
Copy of the license is available at the project homepage