{"id":28326049,"url":"https://github.com/anacletolab/parsmurf","last_synced_at":"2025-06-23T15:30:38.417Z","repository":{"id":60721389,"uuid":"179106878","full_name":"AnacletoLAB/parSMURF","owner":"AnacletoLAB","description":"High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants","archived":false,"fork":false,"pushed_at":"2020-12-08T10:17:06.000Z","size":3986,"stargazers_count":7,"open_issues_count":3,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-06-02T07:28:28.694Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AnacletoLAB.png","metadata":{"files":{"readme":"readme.MD","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-02T15:26:31.000Z","updated_at":"2025-02-05T00:45:27.000Z","dependencies_parsed_at":"2022-10-03T20:00:27.610Z","dependency_job_id":null,"html_url":"https://github.com/AnacletoLAB/parSMURF","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/AnacletoLAB/parSMURF","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnacletoLAB%2FparSMURF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnacletoLAB%2FparSMURF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnacletoLAB%2FparSMURF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnacletoLAB%2FparSMURF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AnacletoLAB","download_url":"https://codeload.github.com/AnacletoLAB/parSMURF/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AnacletoLAB%2FparSMURF/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261503961,"owners_count":23168732,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-25T22:09:39.393Z","updated_at":"2025-06-23T15:30:38.402Z","avatar_url":"https://github.com/AnacletoLAB.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# parSMURF\n\nThis package contains parSMURF, a High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants.\n\n---\n\n### Table of Contents\n\u003cpre\u003e\n\u003ca href=\"#Overview\"\u003eOverview\u003c/a\u003e\n\u003ca href=\"#Requirements\"\u003eRequirements\u003c/a\u003e\n\u003ca href=\"#Downloading-and-compiling\"\u003eDownloading and compiling\u003c/a\u003e\n\u003ca href=\"#General-architecture\"\u003eGeneral architecture\u003c/a\u003e\n\u003ca href=\"#Running-parSMURF\"\u003eRunning parSMURF\u003c/a\u003e\n\t\u003ca href=\"#Command line options\"\u003eCommand line options\u003c/a\u003e\n\t\u003ca href=\"#Running-parSMURF1\"\u003eRunning parSMURF1\u003c/a\u003e\n\t\u003ca href=\"#Running-parSMURFn\"\u003eRunning parSMURFn\u003c/a\u003e\n\t\u003ca href=\"#Running-the-Bayesian-optimizer\"\u003eRunning the Bayesian optimizer\u003c/a\u003e\n\t\u003ca href=\"#Configuration-file\"\u003eConfiguration file\u003c/a\u003e\n\t\t\u003ca href=\"#name\"\u003ename\u003c/a\u003e\n\t\t\u003ca href=\"#exec\"\u003eexec\u003c/a\u003e\n\t\t\u003ca href=\"#data\"\u003edata\u003c/a\u003e\n\t\t\u003ca href=\"#simulate\"\u003esimulate\u003c/a\u003e\n\t\t\u003ca href=\"#folds\"\u003efolds\u003c/a\u003e\n\t\t\u003ca href=\"#params\"\u003eparams\u003c/a\u003e\n\t\t\u003ca href=\"#autogp_params\"\u003eautogp_params\u003c/a\u003e\n\u003ca href=\"#Data-format\"\u003eData Format\u003c/a\u003e\n\t\u003ca href=\"#Data-file\"\u003eData file format\u003c/a\u003e\n\t\u003ca href=\"#Label-file\"\u003eLabel file format\u003c/a\u003e\n\t\u003ca href=\"#Fold-file\"\u003eFold file format\u003c/a\u003e\n\t\u003ca href=\"#Output-file\"\u003eOutput file format\u003c/a\u003e\n\u003ca href=\"#Random-dataset-generation\"\u003eRandom dataset generation\u003c/a\u003e\n\u003ca href=\"#Examples\"\u003eExamples\u003c/a\u003e\n\u003ca href=\"#Licensa\"\u003eLicense\u003c/a\u003e\n\u003c/pre\u003e\n\n---\n\n### Overview\nparSMURF is a fast and scalable C++ implementation of the HyperSMURF algorithm - hyper-ensemble of SMOTE Undersampled Random Forests - an ensemble approach explicitly designed to deal with the huge imbalance between deleterious and neutral variants.\n\nThe algorithm is outlined in the following papers:\\\nA. Petrini, M. Mesiti, M. Schubach, M. Frasca, D. Danis, M. Re, G.Grossi, L. Cappelletti, T. Castrignanò, P. N. Robinson, and G. Valentini,  \"parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants\",  GigaScience, vol. 9, 05 2020. giaa052.\nhttps://doi.org/10.1093/gigascience/giaa052\n\nSchubach, Matteo Re, Peter N. Robinson \u0026 Giorgio Valentini, \"Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants\", Scientific Reports, 2017/06/07\\\nhttps://www.nature.com/articles/s41598-017-03011-5\n\nTwo variants of parSMURF are currently available in this repository:\n- \"parSMURF1\" is a fast multi-threaded implementation of the algorithm and is meant to be run on a single machine\n- \"parSMURFn\" is a multi-threaded and parallel implementation (under the MPI programming paradigm) and is meant to be run on a single machine or on cluster\n\nBoth versions share the same design and functionalities outlined in the paper, in particular:\n- fast, optimized and scalable C++ implementation\n- auto tuning of the learning parameters by grid search or by means of a Bayesian optimizer\n\n---\n\n### Requirements\n\nparSMURF is designed for x86-64 and Intel Xeon Phi architectures running Linux OSes.\\\nThis software is distributed as source code.\n\nA compilier which supports the C++11 language specification is required. It has been tested with GCC (vers. \u003e= 5) and Intel CC (2015, 2017 and 2019).\\\nCode is also optimized for Intel XeonPhi architectures, and it has been successfully tested on Knights Landing family processors.\n\nMultithreading and multiprocessing are managed differently in parSMURF1 and parSMURFn: the former is a multithread-only implementation and thread management is performed through OpenMP APIs. Any reasonably recent compiler has its specification already built-in, hence this requirement is usually met. parSMURFn, instead, is a multiprocess and multithread implementation of the algorithm. Thread management is performed by the Linux built-in pthread library and multiprocessing is performed through the MPI APIs.  Hence, for compilation and running, parSMURFn requires an implementation of the MPI standard. It has been tested with OpenMPI 1.10.3, OpenMPI 2.0, IntelMPI 2016, IntelMPI 2017 and IntelMPI 2019.\n\nNotice that MPI is not required for parSMURF1, hence if no MPI libraries are found on the target system, is still possible to compile and run this version of the software\n\nOn Ubuntu, it is possible to install the OpenMPI library via apt package manager:\n```\nsudo apt-get install openmpi-bin openmpi-common libopenmpi-dev\n```\nMakefiles are generated by the cmake (vers. \u003e= 2.8) utility. On Ubuntu it is possible to install this package via apt:\n```\nsudo apt-get install cmake\n```\nBayesian Optimization is done by the Spearmint package. This package require python2 and it depends on several Python packages. The best way to use this feature is by creating and configuring a Python virtual environment and installing the required Python packages there. On Ubuntu:\n```\nsudo apt-get install virtualenv\n\u003cmove to an appropriate folder\u003e\nvirtualenv parSMURFvenv -p /usr/bin/python2\t#This command creates a parSMURFvenv directory\nsource parSMURFvenv/bin/activate\t\t#This command activates the virtual environment\npip install numpy==1.13.0\t\t\t#The following commands install the required packages in the virtual environment\npip install scipy==1.2.1\npip install weave==0.17.0\npip install six==1.12.0\npip install protobuf==3.7.1\ndeactivate\t\t\t\t\t#Deactivate the virtual environment\n```\n\nparSMURF uses several external libraries that are included ad source code in this repository or are automatically downloaded and compiled. In particular, the following libraries are included:\n- ANN: A Library for Approximate Nearest Neighbor Searching, by David M. Mount and Sunil Arya, Version 1.1.2. The modified version is supplied in the src/ann_1.1.2 directory. This version has been adapted for multi-thread execution, since the original package available at https://www.cs.umd.edu/~mount/ANN/ is not thread safe and is not compatible with this package.\n- Ranger: A Fast Implementation of Random Forests, by Marvin N. Wright, Version 0.11.2. The modified version, stripped from the R code, is supplied in the src/ranger directory. The main codebase is located at https://github.com/imbs-hl/ranger\n- Spearmint, a Python package to perform Bayesian optimization, by Jasper Snoek. The original version at https://github.com/JasperSnoek/spearmint seems no longer maintained and needed a few updates to run on parSMURF.\n\nThe following libraries are not included in this code repository, but are automatically downloaded during the compilation process:\n- easylogging++: A single header C++ logging library, by Zuhd Web Services. Automatically cloned in src/easyloggingpp and compiled from https://github.com/zuhd-org/easyloggingpp\n- jsoncons: A C++, header-only library for constructing JSON and JSON-like text and binary data formats, by Daniel Parker. Automatically cloned in src/jsoncons and compiled from https://github.com/danielaparker/jsoncons\n- zlib: A massively spiffy yet delicately unobtrusive compression library, by Jean-loup Gailly and Mark Adler. Autmatically cloned from in src/zlib and compiled from https://github.com/madler/zlib\n\nAll the libraries have been modified and redistributed according to their own licenses. For each included library, a copy of the associated license is contained in each library folder.\n\n---\n\n### Downloading and compiling\n\nDownload the latest version from this page or clone the git repository altogether:\n\n\tgit clone https://github.com/anacletolab/parSMURF\n\n\nOnce the package has been downloaded, move to the main directory, create a build dir, invoke cmake and build the software (`-j n` make option enables multithread compilation over n threads):\n\n\tcd parSMURF\n\tmkdir build\n\tcd build\n\tcmake ../src\n\tmake -j 4\n\nThis will generate two executables: \"parSMURF1\" and \"parSMURFn\".\n\nFor a quick test, launch the following command from the build directory:\n```\n./parSMURF1 --cfg ../cfgEx/simulCV.json\n```\n\n---\n\n### General architecture\n\nWhile both versions strictly follow the paper \u003clink or citation\u003e and its original R implementation (available on CRAN repository https://cran.r-project.org/web/packages/hyperSMURF/index.html), the novelties of this package resides in the fast C++ code and in the parallel execution which lead to a dramatic decrease of the computing time while keeping the same results, in term of quality of prediction, of the original implementation. Also, it features two different approaches for automatically find the best learning parameters.\n\nHence, execution roughly follows this scheme:\n```\n- data reading from file(s) (or random dataset generation)\n- folds and partitions generation [by index!]\n- for each fold\n---- for each partition in the current fold\n---- ---- over-sampling of the minority class and under-sampling of the majority class\n---- ---- random forest training\n---- ---- random forest test\n---- prediction accumulation\n- prediction averaging\n```\n\nResults are evaluated according to an n-fold validation process. Folds can be randomly generated (the user is free to specify the number of folds) or can be read from a file. When randomly generated, folds are stratified, i.e. the generation algorithm tries to evenly distribute the number of positive examples amongst the folds.\n\nParallelization happens at partition level: since the SMOTE algorithm and the subsequent RF train and test stages are almost embarrassingly parallel inside each fold, (i.e. they require the same operations to be performed on different data, with no synchronization points or data communication involved) these steps can be executed concurrently for each partition belonging to the same fold.\n\nIn parSMURF1, this process is parallelized by means of multi-threading. As an example, if the user specifies x partitions and y processing threads, each thread is assigned x/y partitions which are sequentially processed by each thread. If enough cpu cores are available, each thread will execute concurrently, leading to an almost linear speed-up, especially on CPUs characterized by an high number of cores, like the Intel XeonPhi family of processors.\n\nParallelization in parSMURFn follows the same model which is further expanded for exploiting the computational power of several processing nodes in a cluster. The execution scheme follows a simple master-slave model, where a single master MPI process reads the data from file and delegates the processing of each partition (SMOTE and rf steps) to k working MPI processes. The master process also manages the recollection and accumulation of the predictions from the working processes. Moreover, as in parSMURFa, processing of the partitions in each working process is parallelized by means of multi-threading.\n\nAs an example, suppose that the user specifies x partitions, k working processes and y processing threads for each working process.\nThe master process assigns x/k \"chunks\" of partitions to each working process and sends them the relevant data for the computation. Inside each working process, each chunk is further divided amongst the thread pool, and each thread is assigned to (x/k)/y partitions. Predictions for each chunks are locally accumulated inside each working process and are sent back to the master process only once the work for the chunk is finished.\n\nSeveral strategies have been used to minimize latencies due to data transmission or broadcasting between the master and working processes, not limited to:\n- the master process sends only the data strictly needed for the computation of each partition; moreover, it is sent as a single big array with an header, instead of several small chunks.\n- sends and receives in the master process are managed in two different threads, hence interleaving data preparation + transmission and data receive.\n- sends in the master process can be single- or multi-threaded: in the latter case, the master process spawns a number of threads equal to the number of working processes, and each of these thread is assigned to prepare the data and send it to the corresponding worker, concurrently. This is the default operation mode, but might be memory consuming, therefore a command line option to disable this feature is provided.\n\nparSMURF features two subsystems for the automatic fine tuning of the learning parameters, aimed to maximize the prediction performances of the algorithm. The first strategy is by performing an exhaustive grid search: given a set of values for each hyper-parameter, the resulting set of all the possible combinations of hyper-parameters is calculated, and each combination evaluated through internal cross validation. The other strategy is by Bayesian optimization: given a range for each hyper-parameter, the Bayesian optimizer generate a sequence of possible candidates whose sequence tends to a probable global maximum. An high level of the execution is given by this pseudo-code snippet:\n\n```\niter = 0\n- while (iter \u003c maxIter) and (error \u003e tolerance):\n-- BO generates a new possible candidate of hyper-parameters h\n-- evaluation of h in a context of internal cross validation\n-- submit (h, AUPRC(h)) to the BO\n-- iter \u003c- iter + 1\n```\n\nBoth strategies are performed in a context of internal cross validation, hence it is performed for each fold of the external CV.\nThe output of the procedure is the set of best learning parameter for each fold of the external cross validation.\n\n---\n\n### Running parSMURF\nparSMURF is a command line executable.\\\nAll the options are submitted to the main executable through configuration file written in json format.\n\n#### Command line options\nOnly two command line options are available, since every other parameter or option is defined by json configuration files.\\\n`--cfg \u003cfilename\u003e` specifies the configuration file for the run\\\n`--help` prints a brief help screen\n\n#### Running parSMURF1\nparSMURF1 does not require anything special to run, besides a proper configuration file. Hence, it can be launched as following:\n```\n./parSMURF1 --cfg \u003cconfigFile.json\u003e\n```\n\n#### Running parSMURFn\nparSMURFn requires MPI to be installed on the target system or in all the nodes of a cluster. It must be invoked with `mpirun` or, depending on the scheduling system installed on the cluster, with a proper mpirun wrapper.\\\nThe `-n` option of `mpirun` also specifies how many processes have to be launched. parSMURFn requires at least two processes, one as master and one as worker. As an example:\n```\nmpirun -n 5 ./parSMURFn --cfg \u003cconfigFile.json\u003e\n```\nlaunches an instance of parSMURFn over 5 processes (one master and four worker).\\\nAs now, the number of master process is limited to one.\n\n#### Running the Bayesian optimizer\nUsing the Bayesian optimizer requires more effort, but we are currently finding a way to properly manage the whole procedure more user friendly.\\\nAs noted in the \"Requirements\" section, it may be preferable to setup a Python virtual environment and launch parSMURF1 or parSMURFn from there.\\\nAlso, the entire `src/spearmint` folder must be copied in the same directory where the parSMURF executable is.\\\nAs final requirement, the environmental variable PYTHONPATH must contain the path to the Spearmint folder.\\\n\nAs an example, assume that the git repository has been copied to `/home/user01/git/parSMURF` and the package has been successfully compiled in the `/home/user01/git/parSMURF/build` directory. Also, assume that a Python virtual environment has been created and is located at `/home/user01/pythonVenvs/parSMURFvenv`.\\\nTo prepare a folder containing everything it is needed to parSMURF to run, do the following:\n```\nmkdir /home/user01/parSMURFexp\ncd /home/user01/parSMURFexp\ncp /home/user01/git/parSMURF/build/parSMURF1 .\ncp /home/user01/git/parSMURF/build/parSMURFn .\ncp -r /home/user01/git/parSMURF/src/spearmint .\n```\nNow for launching an experiment with the Bayesian optimizer, do the following:\n```\ncd /home/user01/parSMURFexp\nexport PYTHONPATH=$PYTHONPATH:/home/user01/parSMURFexp/spearmint/spearmint:/home/user01/parSMURFexp/spearmint/spearmint/spearmint\nsource /home/user01/pythonVenvs/parSMURFvenv/bin/activate\n\u003claunch parSMURF1 or parSMURFn as stated earlier\u003e\ndeactivate\n```\n\n---\n\n#### Configuration file\n\nparSMURF1 and parSMURFn use configuration files in json format for setting the parameters of each run.\\\nExamples of configuration files are available in the cfgEx folder of the repository.\n\nA configuration file is composed by seven dictionaries:\n```\n{\n\t\"name\": ...,\n\t\"exec\": {...},\n\t\"data\": {...},\n\t\"folds\": {...},\n\t\"simulate\": {...},\n\t\"params\": {...},\n\t\"autogp_params\": {...}\n}\n```\nDepending on the configuration itself, some dictionaries are not mandatory and can be left out.\n\n##### \"name\"\n```\n\t\"name\": string\n```\nMandatory: no\\\nExec: parSMURF1 / parSMURFn\\\nA string for labeling the name of the experiment\n\n##### \"exec\"\n```\n\t\"exec\": {\n\t\t\"name\": string,\n\t\t\"nProcs\": int,\n\t\t\"ensThrd\": int,\n\t\t\"rfThrd\": int,\n\t\t\"noMtSender\": bool,\n\t\t\"seed\": int,\n\t\t\"verboseLevel\": int,\n\t\t\"verboseMPI\": bool,\n\t\t\"saveTime\": bool,\n\t\t\"timeFile\": string,\n\t\t\"printCfg\": bool,\n\t\t\"mode\": string\n},\n```\nMandatory: yes\\\nExec: parSMURF1 / parSMURFn\\\nGeneral configuration of the run.\n\n\n```\n\t\"name\": string\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nLabel used for marking the name of the executable (parSMURF1 or parSMURFn). It does not affect the computation itself, since this field is ignored by the json parser\n\n\n```\n\t\"nProcs\": int\n```\nMandatory: No\\\nExec: parSMURFn\\\nLabel used for marking the number of processes for a run of parSMURFn. It does not affect the computation itself, since the total number of processes is detected at runtime by the MPI APIs.\n\n\n```\n\t\"ensThrd\": int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nNumber of threads assigned to perform the partition processing.\n\n\n```\n\t\"rfThrd\": int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nNumber of threads assigned to perform the random forest train and test.\n\n\n```\n\t\"noMtSender\": bool\n```\nMandatory: No\\\nExec: parSMURFn\\\nThis option disables multithreading in the master process. It may affect performances, but it may be necessary when processing particularly large datasets.\n\n\n```\n\t\"seed\": int\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nOptional seed for the random number generators. If unspecified, a random seed is generated.\n\n\n```\n\t\"verboseLevel\": int\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nLevel of verbosity on stdout and on the logfile of the computational task. Range is 0-3 (default: 0).\n\n\n```\n\t\"verboseMPI\": bool\n```\nMandatory: No\\\nExec: parSMURFn\\\nVerbose on stdout and logfile the calls to MPI APIs. (Default: false)\n\n\n```\n\t\"saveTime\": bool\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nOption for saving a report of the computation time of the run (Default: false)\n\n\n```\n\t\"timeFile\": string\n```\nMandatory: Yes, if \"saveTime\" is set to true\\\nExec: parSMURF1 / parSMURFn\\\nFile name for saving the execution time report\n\n\n```\n\t\"printCfg\": bool\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nOption for printing a detailed description of the run before it starts (Default: false)\n\n\n```\n\t\"mode\": string\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nExecution mode. Allowed strings are:\\\n\"cv\": Dataset is splitted in folds, and evaluated in a process of k-fold cross validation. The run returns a set of predictions (default).\\\n\"train\": The whole dataset is treated as training set. The run returns a folder of trained models for later usage.\\\n\"test\": The whole dataset is treated as test set. It is mandatory to submit a directory of trained models to perform the evaluation. The run returns a set of predictions.\\\nNote that the autotuning of the learning parameters is available only for \"cv\" mode\n\n\n```\n\t\"optimizer\": string\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nExecution mode. Allowed strings are:\\\n\"no\": external cross-validation only (default)\\\n\"grid\": automatic tuning of the learning parameters by grid search in the internal cross validation loop\\\n\"autogp\": automatic tuning of the learning parameters by Bayesian optimization (Gaussian process) in the internal cross validation loop\n\n\n##### \"data\"\n```\n\"data\": {\n\t\"dataFile\": string\n\t\"foldFile\": string\n\t\"labelFile\": string\n\t\"outFile\": string\n\t\"forestDir\": string\n}\n```\nMandatory: yes\\\nExec: parSMURF1 / parSMURFn\\\nThis field contains all the required information for accessing data from and to the system.\n\n\n```\n\t\"dataFile\": string\n```\nMandatory: Yes (No if simulation mode is enabled)\\\nExec: parSMURF1 / parSMURFn\\\nInput data file\n\n\n```\n\t\"foldFile\": string\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nOptional input file containing the fold division of the dataset\n\n\n```\n\t\"labelFile\": string\n```\nMandatory: Yes (No, if simulation mode is enabled)\\\nExec: parSMURF1 / parSMURFn\\\nInput file containing the labels of the examples of the dataset\n\n\n```\n\t\"outFile\": string\n```\nMandatory: Yes (No, if in train mode)\\\nExec: parSMURF1 / parSMURFn\\\nOutput file containing the output predictions\n\n\n```\n\t\"forestDir\": string\n```\nMandatory: No (Yes, if in train mode)\\\nExec: parSMURF1 / parSMURFn\\\nOutput directory for saving the trained models. Must be a valid directory on the filesystem.\n\n\n##### \"simulate\"\n```\n\"simulate\": {\n\t\"simulation\": bool,\n\t\"prob\": float,\n\t\"n\": int,\n\t\"m\": int\n},\n```\nMandatory: no\\\nExec: parSMURF1 / parSMURFn\\\nThis field contains all the required information for enabling the internal dataset generator\n\n\n```\n\t\"simulation\": bool\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nOn true, it enables the internal dataset generator. The fields \"dataFile\", \"foldFile\" and \"labelFile\" are ignored and a random dataset is generated.\n\n\n```\n\t\"prob\": float\n```\nMandatory: Yes if simulation mode is enabled\\\nExec: parSMURF1 / parSMURFn\\\nThis field represent the probability of generating a positive example. Must be a float in the [0,1] range, possibly very small for simulating highly unbalanced datasets\n\n\n```\n\t\"n\": int\n```\nMandatory: Yes if simulation mode is enabled\\\nExec: parSMURF1 / parSMURFn\\\nNumber of examples to be generated\n\n\n```\n\t\"m\": int\n```\nMandatory: Yes if simulation mode is enabled\\\nExec: parSMURF1 / parSMURFn\\\nNumber of features to be generated\n\n\n##### \"folds\"\n```\n\"folds\": {\n\t\"nFolds\": int,\n\t\"startingFold\": int,\n\t\"endingFold\": int\n}\n```\nMandatory: Yes (No, if \"foldFile\" specified)\\\nExec: parSMURF1 / parSMURFn\\\nThis section specified the fold subdivision and to which fold execute the run.\n\n\n```\n\t\"nFolds\": int\n```\nMandatory: Yes (No, if \"foldFile\" specified)\\\nExec: parSMURF1 / parSMURFn\\\nThis field specifies in how many folds the dataset should be subdivided into. Ignored if \"foldFile\" has been declared.\n\n\n```\n\t\"startingFold\": int,\n\t\"endingFold\": int\n```\nMandatory: No\\\nExec: parSMURF1 / parSMURFn\\\nThese fields specify the starting and ending fold that parSMURF have to evaluate. This is useful for parallelizing runs across different folds. If unspecified, parSMURF performs the evaluation of the predictions on all folds.\n\n\n##### \"params\"\n```\n\"params\": {\n\t\"nParts\": array of int,\n\t\"fp\": array of int,\n\t\"ratio\": array of int,\n\t\"k\": array of int,\n\t\"nTrees\": array of int,\n\t\"mtry\": array of int\n},\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nThis field contains the learning parameters for the run. All values must be passed as arrays.\\\nWhen \"optimizer\" is set to \"no\", only one combination is used for the run.\\\nWhen \"optimizer\" is set to \"grid\", parSMURF generates all the possible hyper-parameter combinations and evaluate them in the internla CV loop.\\\nFor a deeper explanation of each parameter, please refer to the article\n\n\n```\n\t\"nParts\": array of int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nNumber of partitions (ensembles)\\\nDefault: 10\n\n\n```\n\t\"fp\": array of int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nOver-sampling factor (0 disables over-sampling)\\\nDefault: 1\n\n\n```\n\t\"ratio\": array of int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nUnder-sampling factor (0 disables under-sampling)\\\nDefaul: 1\n\n\n```\n\t\"k\": array of int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nNumber of the nearest neighbors for SMOTE oversampling of the minority class\\\nDefault: 5\n\n\n```\n\t\"nTrees\": array of int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nNumber of trees in each ensemble\\\nDefault: 10\n\n\n```\n\t\"mtry\": array of int\n```\nMandatory: Yes\\\nExec: parSMURF1 / parSMURFn\\\nmtry random forest parameter\\\nDefault: sqrt(m)\n\n\n##### \"autogp_params\"\n```\n\t\"autogp_params\":\n\t\t\"nParts\" : {\n\t\t\t\"name\": \"nParts\",\n\t\t\t\"type\": \"int\",\n\t\t\t\"min\": int,\n\t\t\t\"max\": int,\n\t\t\t\"size\": 1\n\t\t},\n\t\t\"fp\" : {\n\t\t\t\"name\": \"fp\",\n\t\t\t\"type\": \"int\",\n\t\t\t\"min\": int,\n\t\t\t\"max\": int,\n\t\t\t\"size\": 1\n\t\t},\n\t\t\"ratio\" : {\n\t\t\t\"name\": \"ratio\",\n\t\t\t\"type\": \"int\",\n\t\t\t\"min\": int,\n\t\t\t\"max\": int,\n\t\t\t\"size\": 1\n\t\t},\n\t\t\"k\" : {\n\t\t\t\"name\": \"k\",\n\t\t\t\"type\": \"int\",\n\t\t\t\"min\": int,\n\t\t\t\"max\": int,\n\t\t\t\"size\": 1\n\t\t},\n\t\t\"numTrees\" : {\n\t\t\t\"name\": \"numTrees\",\n\t\t\t\"type\": \"int\",\n\t\t\t\"min\": int,\n\t\t\t\"max\": int,\n\t\t\t\"size\": 1\n\t\t},\n\t\t\"mtry\" : {\n\t\t\t\"name\": \"mtry\",\n\t\t\t\"type\": \"int\",\n\t\t\t\"min\": int,\n\t\t\t\"max\": int,\n\t\t\t\"size\": 1\n\t\t}\n```\nMandatory: No (Yes, if \"optimizer\" is set to \"autogp\")\\\nExec: parSMURF1 / parSMURFn\\\nThis section is used for defining the search space of the Bayesing optimizer. It is composed by six sub-fields, each one defining the search space of one learning parameter. The only parts that can be modified are the \"min\" and \"max\" fields of each parameters.\\\nEvery sub-field is mandatory. If the user needs to perform a partial search (i.e. tuning only some of the six parameters), please set the \"min\" and \"max\" values of the fixed parameters to the same value.\n\n---\n\n### Data format\n\nAs previously stated, data is provided to the application in two or three files.\n\n##### Data file\nthis file should contain the main data needed for computing the predictions. It consists in an n x m matrix of double, where n is the number of examples and m the features. The matrix is read row-wise, i.e. :\n```\n   | m1   m2   m3   m4 ...\n---------------------------\nn1 | ------------\u003e\nn2 | ------------\u003e\nn3 |\nn4 |\n.  |\n.  |\n.  |\n```\nMost, if not any, datafile is in this format, so just be sure that the number of features for each row is consistent across the samples.\\\nThe number of features is detected from the file itself - actually, from the number of items read in the first row.\\\nAll input files must be HEADERLESS.\n\n##### Label file\nthis file should contain the labelling of the examples. It consists in n space or tab separated values, where n is the number of examples. It can also be a column vector file, i.e. newline separated values.\\\nIt is a plain text file where each positive example is marked with \"1\" and negative examples with \"0\".\n\n##### Fold file\nthis optional file should contain the fold sub division. If specified, examples will be divided in folds as specified in this file. If not, a random stratified division will be performed. This file consists in n space or tab separated integer values, where n is the number of examples. It can also be a column vector file, i.e. newline separated values.\\\nIt is a plain text file where each number represents the fold to which each example is assigned. Fold numbering starts from \"0\" (zero).\nNote that specifying the fold file name overrides the \"nFolds\" option in the ocnfiguration file.\n\nThe following code snippet converts two R vectors in the corresponding labelling and folding files for proper use with this package:\n```\nwrite(vectorOfLabels, file = \"labels.txt\", sep = \"\\n\")\nwrite(vectorOfFolds, file = \"folds.txt\", sep = \"\\n\")\n```\n##### Output file\nPredictions will be saved as plain text file.\\\nThe output file consists of two columns of tab separated double values. For each sample, both probabilities of belonging to either class is saved: each value in the first column represents the probability of the associated sample to be in the minority class, while each value in the second column, the probability to be in the majority class.\n\nNote about dimensionality:\\\nWhen reading data from file, parSMURF1 and parSMURFn automatically detect the number of samples and features, following these rules:\n- at first, the number of samples is detected from the label file.\n- then, the number of features is detected from the data file, evaluating the number of different items from hte first row of the data file.\nHence, the sizes of these files should be consistent, otherwise a warning message is printed to the console.\\\nAlso, the number of folds is detected from the fold file if specified. In this case, the option \"nFolds\" in the configuration file is ignored, and the total number of folds will be equal to the number of the total unique elements of the fold file.\n\n---\n\n### Random dataset generation\n\nparSMURF1 and parSMURFn are provided with a random dataset generator for testing purposes.\\\nWhen enabled, a random dataset will be created according to two normal distribution having the same variance but different average value, depending if an example falls in the positive or negative class.\\\nThe user enables this mode by using the `\"simulate: true\"` option in the configuration file.\\\nThe user is also forced to specify the the probability that an example belongs to the minority class (`\"prob\": float`) and dimensionality of the dataset with the `\"n\": float` and `\"m\": int` options.\\\nAn additional column will be added to the output file, containing the labelling that has been randomly generated according to the `\"prob\"` value.\n\n---\n\n### Examples\nFolder cfgEx of the repository contains several example of configuration files to be used either with parSMURF1 or parSMURFn.\\\n- `simulCV.json` (for parSMURF1): it generates a random dataset of 1200 examples and 25 features; probability of a positive example is very low (0.02). Execute a 10-fold cross validation with random stratified fold sub-division. Learning parameters are fixed to: nParts = 10, fp = ratio = 1, k = 5, nTrees = 10, mtry = 5. Results are saved into the \"predicitons.txt\" file. Also a report of the execution time is generated in the timeout.txt. Seed fixed at 1. parSMURF spawns 4 threads for partition processing, and for each one of them it spawns another thread for random forest train and test.\n- `simulCVn.json` (for parSMURFn): it generates a random dataset of 12000 examples and 75 features; probability of a positive example is very low (0.025). Execute a 10-fold cross validation with random stratified fold sub-division. Learning parameters are fixed to: nParts = 100, fp = 1, ratio = 2, k = 3, nTrees = 100, mtry = 9. Results are saved into the \"predicitons.txt\" file. Also a report of the execution time is generated in the timeout.txt. Seed fixed at 1. It must be launched as `mpirun -n 5 ./parSMURFn --cfg simulCVn.json`, so that 4 worker processes are spawned, each one with 6 threads for partition processing, and for each of them 2 threads for random forest train and test are spawned. This execution also verbose to stdout all the MPI API calls.\n- `dataFromFile.json` (for parSMURF1): execute a 10 fold cross validation over the dataset read from file. Fold subdivision is specified in the \"folds.txt\" file. No hyper-parameters autotuning.\n- `gridTune.json` (for parSMURF1): data is read from file, as well for the labelling. Folds are randomly generated. Execute an partial automatic tuning of the learning parameters over a 5-fold cross validation. Parameters to be tuned are: nParts, fp and mtry. This configuration generates 18 possible hyper-parameter combinations that are tested in the internal cross validation. AUPRC results for each combination are saved in the files \"fold0.dat\" to \"fold4.dat\". It also generates a prediction file contianing the predictions for each fold obtained by the best hyper-parameter combiantion for the relative fold.\n- `gridTune2.json` (for parSMURF1): as in `gridTune.json`, but the whole procedure is executed over folds 3 and 4 only.\n- `train.json` (for parSMURFn):data is read from file, as well for the labelling. Parameter \"nFolds\" is ignored. Treats the whole dataset as training set and generates a trained model. The model is saved in the \"/home/user01/models/trainedModel/\" folder. It must be launched as `mpirun -n 2 ./parSMURFn --cfg train.json`. 1 worker process, with 3 threads for partition processing and 4 for random forest train and test. Logs are more verbose than the previous examples. Multithreading in the master process is disabled.\n- `autoGpTune.json` (for parSMURFn): full auto-tuning of the learning parameters via Bayesian Optimization. Data, labels and fold sub-division are read from file. Parameter \"nFolds\" is ignored. \"params\" section of the config file is ignored as well. The parameter search space is defined as follows: nParams in [10, 50], fp in [1, 3], ratio in [1, 3], k in [2, 6], nTrees in [5, 10], mtry in [2, 5].\n\n---\n\n### License\n\nThis package is distributed under the GNU GPLv3 license. Please see the http://github.com/anacletolab/parSMURF/LICENSE file for the complete version of the license.\n\nparSMURF includes several third-party libraries which are distributed with their own license. In particular, source code of the following libraries is included in this package:\n\n**ANN: Approximate Nearest Neighbor Searching**\\\nDavid M. Mount and Sunil Arya\\\nVersion 1.1.2\\\n(https://www.cs.umd.edu/~mount/ANN/) \\\nModified and redistributed under the GNU Lesser Public License v2.1\\\nCopy of the license is available in the src/ann_1.1.2 directory\n\n**Ranger: A Fast Implementation of Random Forests**\\\nMarvin N. Wright\\\nVersion 0.11.1\\\n(https://github.com/imbs-hl/ranger) \\\nModified and redistributed under the MIT license\\\nCopy of the license is available in the src/ranger folder\n\n**Spearmint**\\\nJasper Snoek, Hugo Larochelle and Ryan P. Adams\\\n(https://github.com/JasperSnoek/spearmint/) \\\nModified and redistributed under the NU General Public License v3\\\nCopy of the license is available in the src/spearmint/spearmint folder\n\nAlso, parSMURF uses several libraries whose source code is not included in the package, but it is automatically downloaded at compile time. These libraries are:\n\n**Easylogging++**\\\nZuhd Web Services\\\n(https://github.com/zuhd-org/easyloggingpp) \\\nDistributed under the MIT license\\\nCopy of the license is available at the project homepage\n\n**Jsoncons**\\\nDaniel Parker\\\n(https://github.com/danielaparker/jsoncons) \\\nDistributed under the Boost license\\\nCopy of the license is available at the project homepage\n\n**zlib**\\\nJean-loup Gailly and Mark Adler\\\n(https://github.com/madler/zlib) \\\nDistributed under the zlib license\\\nCopy of the license is available at the project homepage\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanacletolab%2Fparsmurf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanacletolab%2Fparsmurf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanacletolab%2Fparsmurf/lists"}