{"id":15624764,"url":"https://github.com/xuanxu/nimbus","last_synced_at":"2025-04-13T10:44:27.308Z","repository":{"id":56885575,"uuid":"1471352","full_name":"xuanxu/nimbus","owner":"xuanxu","description":"Nimbus is a Ruby gem to implement Random Forest algorithms in a genomic selection context","archived":false,"fork":false,"pushed_at":"2024-04-06T17:46:15.000Z","size":570,"stargazers_count":58,"open_issues_count":0,"forks_count":10,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-11T12:11:49.406Z","etag":null,"topics":["random-forest"],"latest_commit_sha":null,"homepage":"https://github.com/xuanxu/nimbus","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xuanxu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"MIT-LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-03-12T11:45:48.000Z","updated_at":"2024-10-24T16:59:49.000Z","dependencies_parsed_at":"2024-06-11T20:31:57.681Z","dependency_job_id":"a950bd02-135d-4b8d-b31e-fa531a4f9719","html_url":"https://github.com/xuanxu/nimbus","commit_stats":{"total_commits":198,"total_committers":3,"mean_commits":66.0,"dds":"0.010101010101010055","last_synced_commit":"ec4760579ea4602756069cc8e49d2698953641c0"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xuanxu%2Fnimbus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xuanxu%2Fnimbus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xuanxu%2Fnimbus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xuanxu%2Fnimbus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xuanxu","download_url":"https://codeload.github.com/xuanxu/nimbus/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248604456,"owners_count":21131978,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["random-forest"],"created_at":"2024-10-03T10:00:49.448Z","updated_at":"2025-04-13T10:44:27.283Z","avatar_url":"https://github.com/xuanxu.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Nimbus\nRandom Forest algorithm for genomic selection.\n\n[![Build Status](https://github.com/xuanxu/nimbus/actions/workflows/tests.yml/badge.svg)](https://github.com/xuanxu/nimbus/actions/workflows/tests.yml)\n[![Gem Version](https://badge.fury.io/rb/nimbus.png)](http://badge.fury.io/rb/nimbus)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/xuanxu/nimbus/blob/master/MIT-LICENSE.txt)\n[![DOI](http://joss.theoj.org/papers/10.21105/joss.00351/status.svg)](https://doi.org/10.21105/joss.00351)\n\n## Random Forest\n\nThe [random forest algorithm](http://en.wikipedia.org/wiki/Random_forest) is a classifier consisting in many random decision trees. It is based on choosing random subsets of variables for each tree and using the most frequent, or the averaged tree output as the overall classification. In machine learning terms, it is an ensemble classifier, so it uses multiple models to obtain better predictive performance than could be obtained from any of the constituent models.\n\nThe forest outputs the class that is the mean or the mode (in regression problems) or the majority class (in classification problems) of the node's output by individual trees.\n\n## Genomic selection context\n\nNimbus is a Ruby gem implementing Random Forest in a genomic selection context, meaning every input file is expected to contain genotype and/or fenotype data from a sample of individuals.\n\nOther than the ids of the individuals, Nimbus handle the data as genotype values for [single-nucleotide polymorphisms](http://en.wikipedia.org/wiki/SNPs) (SNPs), so the variables in the classifier must have values of 0, 1 or 2, corresponding with SNPs classes of AA, AB and BB.\n\nNimbus can be used to:\n\n* Create a random forest using a training sample of individuals with fenotype data.\n* Use an existent random forest to get predictions for a testing sample.\n\n## Learning algorithm\n\n**Training**: Each tree in the forest is constructed using the following algorithm:\n\n1. Let the number of training cases be N, and the number of variables (SNPs) in the classifier be M.\n1. We are told the number mtry of input variables to be used to determine the decision at a node of the tree; m should be much less than M\n1. Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases (Out Of Bag sample) to estimate the error of the tree, by predicting their classes.\n1. For each node of the tree, randomly choose m SNPs on which to base the decision at that node. Calculate the best split based on these m SNPs in the training set.\n1. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).\n1. When in a node there is not any SNP split that minimizes the general loss function of the node, or the number of individuals in the node is less than the minimum node size then label the node with the average fenotype value of the individuals in the node.\n\n**Testing**: For prediction a sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction.\n\n## Regression and Classification\n\nNimbus can be used both with regression and classification problems.\n\n**Regression**: is the default mode.\n\n* The split of nodes uses quadratic loss as loss function.\n* Labeling of nodes is made averaging the fenotype values of the individuals in the node.\n\n**Classification**: user-activated declaring `classes` in the configuration file.\n\n* The split of nodes uses the Gini index as loss function.\n* Labeling of nodes is made finding the majority fenotype class of the individuals in the node.\n\n## Variable importances\n\nBy default Nimbus will estimate SNP importances everytime a training file is run to create a forest.\n\nYou can disable this behaviour (and speed up the training process) by setting the parameter `var_importances: No` in the configuration file.\n\n## Installation\n\nYou need to have [Ruby](https://www.ruby-lang.org) (2.6 or higher) with Rubygems installed in your computer. Then install Nimbus with:\n\n````shell\n\u003e gem install nimbus\n````\n\nThere are not extra dependencies needed.\n\n## Getting Started\n\nOnce you have nimbus installed in your system, you can run the gem using the `nimbus` executable:\n\n````shell\n\u003e nimbus\n````\n\nIt will look for these files in the directory where Nimbus is running:\n\n* `training.data`: If found it will be used to build a random forest.\n* `testing.data` : If found it will be pushed down the forest to obtain predictions for every individual in the file.\n* `random_forest.yml`: If found it will be the forest used for the testing instead of building one.\n* `config.yml`: A file detailing random forest parameters and datasets. If not found default values will be used.\n\nThat way in order to train a forest a training file is needed. And to do the testing you need two files: the testing file and one of the other two: the training OR the random_forest file, because Nimbus needs a forest from which obtain the predictions.\n\n## Configuration (config.yml)\n\nThe names for the input data files and the forest parameters can be specified in the `config.yml` file that should be located in the directory where you are running `nimbus`.\n\nThe `config.yml` has the following structure and parameters:\n\n    #Input files\n    input:\n      training: training_regression.data\n      testing: testing_regression.data\n      forest: my_forest.yml\n      classes: [0, 1]\n\n    #Forest parameters\n    forest:\n      forest_size: 10 #how many trees\n      SNP_sample_size_mtry: 60 #mtry\n      SNP_total_count: 200\n      node_min_size: 5\n\n### Under the input chapter:\n\n * `training`: specify the path to the training data file (optional, if specified `nimbus` will create a random forest).\n * `testing`: specify the path to the testing data file (optional, if specified `nimbus` will traverse this data through a random forest).\n * `forest`: specify the path to a file containing a random forest structure (optional, if there is also testing file, this will be the forest used for the testing).\n * `classes`: **optional (needed only for classification problems)**. Specify the list of classes in the input files as a comma separated list between squared brackets, e.g.:`[A, B]`.\n\n### Under the forest chapter:\n\n * `forest_size`: number of trees for the forest.\n * `SNP_sample_size_mtry`: size of the random sample of SNPs to be used in every tree node.\n * `SNP_total_count`: total count of SNPs in the training and/or testing files\n * `node_min_size`: minimum amount of individuals in a tree node to make a split.\n * `var_importances`: **optional**. If set to `No` Nimbus will not calculate SNP importances.\n\n### Default values\n\nIf there is no config.yml file present, Nimbus will use these default values:\n\n````yaml\nforest_size:          300\ntree_SNP_sample_size: 60\ntree_SNP_total_count: 200\ntree_node_min_size:   5\ntraining_file: 'training.data'\ntesting_file:  'testing.data'\nforest_file:   'forest.yml\n````\n\n## Input files\n\nThe three input files you can use with Nimbus should have proper format:\n\n**The training file** has any number of rows, each representing data for an individual, with this columns:\n\n1. A column with the fenotype for the individual\n1. A column with the ID of the individual\n1. M columns (where M = SNP_total_count in `config.yml`) with values 0, 1 or 2, representing the genotype of the individual.\n\n**The testing file** has any number of rows, each representing data for an individual, similar to the training file but without the fenotype column:\n\n1. A column with the ID of the individual\n1. M columns (where M = SNP_total_count in `config.yml`) with values 0, 1 or 2, representing the genotype of the individual.\n\n**The forest file** contains the structure of a forest in YAML format. It is the output file of a nimbus training run.\n\n## Output files\n\nNimbus will generate the following output files:\n\nAfter training:\n\n * `random_forest.yml`: A file defining the structure of the computed Random Forest. It can be used as input forest file.\n * `generalization_errors.txt`: A file with the generalization error for every tree in the forest.\n * `training_file_predictions.txt`: A file with predictions for every individual from the training file.\n * `snp_importances.txt`: A file with the computed importance for every SNP. _(unless `var_importances` set to `No` in config file)_\n\nAfter testing:\n\n * `testing_file_predictions.txt`: A file detailing the predicted results for the testing dataset.\n\n## Example usage\n\n### Sample files\n\nSample files are located in the `/spec/fixtures` directory, both for regression and classification problems. They can be used as a starting point to tweak your own configurations.\n\nDepending on the kind of problem you want to test different files are needed:\n\n### Regression\n\n**Test with a Random Forest created from a training data set**\n\nDownload/copy the `config.yml`, `training.data` and `testing.data` files from the [regression folder](./tree/master/spec/fixtures/regression).\n\nThen run nimbus:\n\n````shell\n\u003e nimbus\n````\n\nIt should output a `random_forest.yml` file with the nodes and structure of the resulting random forest, the `generalization_errors` and `snp_importances` files, and the predictions for both training and testing datasets (`training_file_predictions.txt` and `testing_file_predictions.txt` files).\n\n**Test with a Random Forest previously created**\n\nDownload/copy the `config.yml`, `testing.data` and `random_forest.yml` files from the [regression folder](./tree/master/spec/fixtures/regression).\n\nEdit the `config.yml` file to comment/remove the training entry.\n\nThen use nimbus to run the testing:\n\n````shell\n\u003e nimbus\n````\n\nIt should output a `testing_file_predictions.txt` file with the resulting predictions for the testing dataset using the given random forest.\n\n### Classification\n\n**Test with a Random Forest created from a training data set**\n\nDownload/copy the `config.yml`, `training.data` and `testing.data` files from the [classification folder](./tree/master/spec/fixtures/classification).\n\nThen run nimbus:\n\n````shell\n\u003e nimbus\n````\n\nIt should output a `random_forest.yml` file with the nodes and structure of the resulting random forest, the `generalization_errors` file, and the predictions for both training and testing datasets (`training_file_predictions.txt` and `testing_file_predictions.txt` files).\n\n**Test with a Random Forest previously created**\n\nDownload/copy the `config.yml`, `testing.data` and `random_forest.yml` files from the [classification folder](./tree/master/spec/fixtures/classification).\n\nEdit the `config.yml` file to comment/remove the training entry.\n\nThen use nimbus to run the testing:\n\n````shell\n\u003e nimbus\n````\n\nIt should output a `testing_file_predictions.txt` file with the resulting predictions for the testing dataset using the given random forest.\n\n\n## Test suite\n\nNimbus includes a test suite located in the `spec` directory. The current state of the build is [publicly tracked by Travis CI](https://travis-ci.org/xuanxu/nimbus). You can run the specs locally if you clone the code to your local machine and run the default rake task:\n\n````shell\n\u003e git clone git://github.com/xuanxu/nimbus.git\n\u003e cd nimbus\n\u003e bundle install\n\u003e rake\n````\n\n## Resources\n\n* [Source code](http://github.com/xuanxu/nimbus) – Fork the code\n* [Issues](http://github.com/xuanxu/nimbus/issues) – Bugs and feature requests\n* [Online rdocs](http://rubydoc.info/gems/nimbus/frames)\n* [Nimbus at rubygems.org](https://rubygems.org/gems/nimbus)\n* [Random Forest at Wikipedia](http://en.wikipedia.org/wiki/Random_forest)\n* [RF Leo Breiman page](http://www.stat.berkeley.edu/~breiman/RandomForests/)\n\n\n## Contributing\n\nContributions are welcome. We encourage you to contribute to the Nimbus codebase.\n\nPlease read the [CONTRIBUTING](CONTRIBUTING.md) file.\n\n\n## Credits and DOI\n\nIf you use Nimbus, please cite our [JOSS paper: http://dx.doi.org/10.21105/joss.00351](http://dx.doi.org/10.21105/joss.00351)\n\nYou can find the citation info in [Bibtex format here](CITATION.bib).\n\n**Cite as:**  \n*Bazán et al, (2017), Nimbus: a Ruby gem to implement Random Forest algorithms in a genomic selection context, Journal of Open Source Software, 2(16), 351, doi:10.21105/joss.0035*\n\nNimbus was developed by [Juanjo Bazán](http://twitter.com/xuanxu) in collaboration with Oscar González-Recio.\n\n\n## LICENSE\n\nCopyright © Juanjo Bazán, released under the [MIT license](MIT-LICENSE.txt)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxuanxu%2Fnimbus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxuanxu%2Fnimbus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxuanxu%2Fnimbus/lists"}