{"id":13413343,"url":"https://github.com/ryanbressler/CloudForest","last_synced_at":"2025-03-14T19:31:58.853Z","repository":{"id":5168440,"uuid":"6339453","full_name":"ryanbressler/CloudForest","owner":"ryanbressler","description":"Ensembles of decision trees in go/golang.","archived":false,"fork":false,"pushed_at":"2022-02-05T06:54:29.000Z","size":1743,"stargazers_count":736,"open_issues_count":34,"forks_count":92,"subscribers_count":44,"default_branch":"master","last_synced_at":"2024-07-31T20:52:15.360Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"torvalds/linux","license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ryanbressler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-10-22T17:38:16.000Z","updated_at":"2024-06-16T08:09:04.000Z","dependencies_parsed_at":"2022-09-09T18:12:11.774Z","dependency_job_id":null,"html_url":"https://github.com/ryanbressler/CloudForest","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanbressler%2FCloudForest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanbressler%2FCloudForest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanbressler%2FCloudForest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanbressler%2FCloudForest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ryanbressler","download_url":"https://codeload.github.com/ryanbressler/CloudForest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243635368,"owners_count":20322927,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T20:01:38.230Z","updated_at":"2025-03-14T19:31:58.536Z","avatar_url":"https://github.com/ryanbressler.png","language":"Go","funding_links":[],"categories":["Machine Learning","Go","机器学习","機器學習","Relational Databases","Decision Trees","\u003cspan id=\"机器学习-machine-learning\"\u003e机器学习 Machine Learning\u003c/span\u003e"],"sub_categories":["Search and Analytic Databases","Advanced Console UIs","检索及分析资料库","Tools","高級控制台界面","[Tools](#tools-1)","Speech Recognition","Vector Database","Middlewares","\u003cspan id=\"高级控制台用户界面-advanced-console-uis\"\u003e高级控制台用户界面 Advanced Console UIs\u003c/span\u003e","SQL 查询语句构建库","高级控制台界面","交流"],"readme":"CloudForest\n==============\n\n[![Build Status](https://travis-ci.org/ryanbressler/CloudForest.png?branch=master)](https://travis-ci.org/ryanbressler/CloudForest) \n[![GoDoc](https://godoc.org/github.com/ryanbressler/CloudForest?status.svg)](https://godoc.org/github.com/ryanbressler/CloudForest)\n\n[Google Group](https://groups.google.com/forum/#!forum/cloudforest-dev)\n\nFast, flexible, multi-threaded ensembles of decision trees for machine\nlearning in pure Go (golang). \n\nCloudForest allows for a number of related algorithms for classification, regression, feature selection \nand structure analysis on heterogeneous numerical / categorical data with missing values. These include:\n\n* Breiman and Cutler's Random Forest for Classification and Regression\n* Adaptive Boosting (AdaBoost) Classification \n* Gradient Boosting Tree Regression and Two Class Classification\n* Hellinger Distance Trees for Classification\n* Entropy, Cost driven and Class Weighted classification\n* L1/Absolute Deviance Decision Tree regression\n* Improved Feature Selection via artificial contrasts with ensembles (ACE)\n* Roughly Balanced Bagging for Unbalanced Data\n* Improved robustness using out of bag cases and artificial contrasts.\n* Support for missing values via bias correction or three way splitting.\n* Proximity/Affinity Analysis suitable for manifold learning\n* A number of experimental splitting criteria\n\nThe Design Prioritizes:\n\n* Training speed\n* Performance on highly dimensional heterogeneous datasets (e.g. genetic and clinical data).\n* An optimized set of core functionality. \n* The flexibility to quickly implement new impurities and algorithms using the common core.\n* The ability to natively handle non numerical data types and missing values.\n* Use in a multi core or multi machine environment.\n\nIt can achieve quicker training times then many other popular implementations on some datasets. \nThis is the result of cpu cache friendly memory utilization well suited to modern processors and\nseparate, optimized paths to learn splits from binary, numerical and categorical data.\n\n![Benchmarks](benchmark.png \"Benchmarks on heterogeneous clinical data.\")\n\nCloudForest offers good general accuracy and the alternative and augmented algorithms it implements can \noffer reduced error rate for specific use cases including especially recovering a signal from noisy, \nhigh dimensional data prone to over-fitting and predicting rare events and unbalanced classes\n(both of which are typical in genetic studies of diseases). These methods should be included in \nparameter sweeps to maximize accuracy.\n\n![Error](error.png \"Balanced error rates of different augmented algorithms on an example dataset.\") \n\n(Work on benchmarks and optimization is ongoing, if you find a slow use case please raise an issue.)\n \nCommand line utilities to grow, apply and analyze forests and do cross validation are provided or \nCloudForest can be used as a library in go programs.\n\nThis Document covers command line usage, file formats and some algorithmic background.\n\nDocumentation for coding against CloudForest has been generated with godoc and can be viewed live at:\nhttp://godoc.org/github.com/ryanbressler/CloudForest\n\nPull requests, spelling corrections and bug reports are welcome; Code Repo and Issue tracker can be found at:\nhttps://github.com/ryanbressler/CloudForest\n\nA google discussion group can be found at:\nhttps://groups.google.com/forum/#!forum/cloudforest-dev\n\nCloudForest was created in the Shumelivich Lab at the Institute for Systems\nBiology.\n\n([Build status](https://travis-ci.org/ryanbressler/CloudForest.png?branch=master) includes accuracy tests on \niris and Boston housing price datasets and multiple go versions.)\n\nInstallation\n-------------\nWith [go installed](http://golang.org/doc/install):\n\n```bash\ngo get github.com/ryanbressler/CloudForest\ngo install github.com/ryanbressler/CloudForest/growforest\ngo install github.com/ryanbressler/CloudForest/applyforest\n\n#optional utilities\ngo install github.com/ryanbressler/CloudForest/leafcount\ngo install github.com/ryanbressler/CloudForest/utils/nfold\ngo install github.com/ryanbressler/CloudForest/utils/toafm\n```\n\nTo update to the latest version use the -u flag\n```bash\ngo get -u github.com/ryanbressler/CloudForest\ngo install -u github.com/ryanbressler/CloudForest/growforest\ngo install -u github.com/ryanbressler/CloudForest/applyforest\n\n#optional utilities\ngo install -u github.com/ryanbressler/CloudForest/leafcount\ngo install -u github.com/ryanbressler/CloudForest/utils/nfold\ngo install -u github.com/ryanbressler/CloudForest/utils/toafm\n```\n\n\n\nQuick Start\n-------------\n\nData can be provided in a tsv based anotated feature matrix or in arff or libsvm formats with\n\".arff\" or \".libsvm\" extensions. Details are discussed in the [Data File Formats](#data-file-formats) section below \nand a few example data sets are included in the \"data\" directory.\n\n```bash\n#grow a predictor forest with default parameters and save it to forest.sf\ngrowforest -train train.fm -rfpred forest.sf -target B:FeatureName\n\n#grow a 1000 tree forest using, 16 cores and report out of bag error \n#with minimum leafSize 8 \ngrowforest -train train.fm -rfpred forest.sf -target B:FeatureName -oob \\\n-nCores 16 -nTrees 1000 -leafSize 8\n\n#grow a 1000 tree forest evaluating half the features as candidates at each \n#split and reporting out of bag error after each tree to watch for convergence\ngrowforest -train train.fm -rfpred forest.sf -target B:FeatureName -mTry .5 -progress \n\n#growforest with weighted random forest\ngrowforest -train train.fm -rfpred forest.sf -target B:FeatureName \\\n-rfweights '{\"true\":2,\"false\":0.5}'\n\n#report all growforest options\ngrowforest -h\n\n#Print the (balanced for classification, least squares for regression error \n#rate on test data to standard out\napplyforest -fm test.fm -rfpred forest.sf\n\n#Apply the forest, report errorrate and save predictions\n#Predictions are output in a tsv as:\n#CaseLabel\tPredicted\tActual\napplyforest -fm test.fm -rfpred forest.sf -preds predictions.tsv\n\n#Calculate counts of case vs case (leaves) and case vs feature (branches) proximity.\n#Leaves are reported as:\n#Case1 Case2 Count\n#Branches Are Reported as:\n#Case Feature Count\nleafcount -train train.fm -rfpred forest.sf -leaves leaves.tsv -branches branches.tsv\n\n#Generate training and testing folds\nnfold -fm data.fm\n\n#growforest with internal training and testing\ngrowforest -train train_0.fm -target N:FeatureName -test test_0.fm\n\n#growforest with internal training and testing, 10 ace feature selection permutations and\n#testing performed only using significant features\ngrowforest -train train_0.fm -target N:FeatureName -test test_0.fm -ace 10 -cutoff .05\n\n```\n\nGrowforest Utility\n--------------------\n\ngrowforest trains a forest using the following parameters which can be listed with -h\n\nParameter's are implemented using go's parameter parser so that boolean parameters can be\nset to true with a simple flag:\n    \n    #the following are equivalent\n    growforest -oob\n    growforest -oob=true\n\nAnd equals signs and quotes are optional for other parameters:\n\t\n    #the following are equivalent\n\tgrowforest -train featurematrix.afm\n\tgrowforest -train=\"featurematrix.afm\"\n\n\n### Basic options ###\n\n ```\n   -target=\"\": The row header of the target in the feature matrix.\n   -train=\"featurematrix.afm\": AFM formated feature matrix containing training data.\n   -rfpred=\"rface.sf\": File name to output predictor forest in sf format.\n   -leafSize=\"0\": The minimum number of cases on a leaf node. If \u003c=0 will be inferred to 1 for classification 4 for regression.\n   -maxDepth=0: Maximum tree depth. Ignored if 0.\n   -mTry=\"0\": Number of candidate features for each split as a count (ex: 10) or portion of total (ex: .5). Ceil(sqrt(nFeatures)) if \u003c=0.\n   -nSamples=\"0\": The number of cases to sample (with replacement) for each tree as a count (ex: 10) or portion of total (ex: .5). If \u003c=0 set to total number of cases.\n   -nTrees=100: Number of trees to grow in the predictor.\n  \n   -importance=\"\": File name to output importance.\n \n   -oob=false: Calculate and report oob error.\n  \n ```\n\n### Advanced Options ###\n\n ```\n   -blacklist=\"\": A list of feature id's to exclude from the set of predictors.\n   -includeRE=\"\": Filter features that DON'T match this RE.\n   -blockRE=\"\": A regular expression to identify features that should be filtered out.\n   -force=false: Force at least one non constant feature to be tested for each split as in scikit-learn.\n   -impute=false: Impute missing values to feature mean/mode before growth.\n   -nCores=1: The number of cores to use.\n   -progress=false: Report tree number and running oob error.\n   -oobpreds=\"\": Calculate and report oob predictions in the file specified.\n   -cpuprofile=\"\": write cpu profile to file\n   -multiboost=false: Allow multi-threaded boosting which may have unexpected results. (highly experimental)\n   -nobag=false: Don't bag samples for each tree.\n   -evaloob=false: Evaluate potential splitting features on OOB cases after finding split value in bag.\n   -selftest=false: Test the forest on the data and report accuracy.\n   -splitmissing=false: Split missing values onto a third branch at each node (experimental).\n   -test=\"\": Data to test the model on after training.\n ```\n\n### Regression Options ###\n\n ```\n   -gbt=0: Use gradient boosting with the specified learning rate.\n   -l1=false: Use l1 norm regression (target must be numeric).\n   -ordinal=false: Use ordinal regression (target must be numeric).\n ```\n\n### Classification Options ###\n\n ```\n   -adaboost=false: Use Adaptive boosting for classification.\n   -balanceby=\"\": Roughly balanced bag the target within each class of this feature.\n   -balance=false: Balance bagging of samples by target class for unbalanced classification.\n   -cost=\"\": For categorical targets, a json string to float map of the cost of falsely identifying each category.\n   -entropy=false: Use entropy minimizing classification (target must be categorical).\n   -hellinger=false: Build trees using hellinger distance.\n   -positive=\"True\": Positive class to output probabilities for.\n   -rfweights=\"\": For categorical targets, a json string to float map of the weights to use for each category in Weighted RF.\n   -NP=false: Do approximate Neyman-Pearson classification.\n   -NP_a=0.1: Constraint on percision in NP classification [0,1]\n   -NP_k=100: Weight of constraint in NP classification [0,Inf+)\n   -NP_pos=\"1\": Class label to constrain percision in NP classification.\n ```\n\nNote: rfweights and cost should use json to specify the weights and or costs per class using the strings used to represent the class in the boolean or categorical feature:\n\n```\n   growforest -rfweights '{\"true\":2,\"false\":0.5}'\n```\n### Randomizing Data and Artificial Contrasts ###\n\n Randomizing shuffling parts of the data or including shuffled \"Artifichal Contrasts\" can be useful to establish baselines for comparison.\n\n The \"vet\" option extends the principle to tree growth. When evaluating potential splitters it subtracts the impurity decrease from the best \n split candidate splitters can make on a shuffled target from the impurity decrease of the actual best split. This is intended to penalizes \n certain types of features that contribute to over-fitting including unique identifiers and sparse features\n\n ```\n   -ace=0: Number ace permutations to do. Output ace style importance and p values.\n   -permute: Permute the target feature (to establish random predictive power).\n   -contrastall=false: Include a shuffled artificial contrast copy of every feature.\n   -nContrasts=0: The number of randomized artificial contrast features to include in the feature matrix.\n   -shuffleRE=\"\": A regular expression to identify features that should be shuffled.\n   -vet=false: Penalize potential splitter impurity decrease by subtracting the best split of a permuted target.\n ```\n\n\n\n\nApplyforrest Utility\n----------------------\n\napplyforest applies a forest to the specified feature matrix and outputs predictions as a two column\n(caselabel\tpredictedvalue) tsv.\n\n```\nUsage of applyforest:\n  -expit=false: Expit (inverst logit) transform data (for gradient boosting classification).\n  -fm=\"featurematrix.afm\": AFM formated feature matrix containing data.\n  -mean=false: Force numeric (mean) voting.\n  -mode=false: Force categorical (mode) voting.\n  -preds=\"\": The name of a file to write the predictions into.\n  -rfpred=\"rface.sf\": A predictor forest.\n  -sum=false: Force numeric sum voting (for gradient boosting etc).\n  -votes=\"\": The name of a file to write categorical vote totals to.\n```\n\nLeafcount Utility\n-------------------\n\nleafcount outputs counts of case case co-occurrence on leaf nodes (leaves.tsv, Brieman's proximity) and counts of the\nnumber of times a feature is used to split a node containing each case (branches.tsv a measure of relative/local\nimportance).\n\n```\nUsage of leafcount:\n  -branches=\"branches.tsv\": a case by feature sparse matrix of leaf co-occurrence in tsv format\n  -fm=\"featurematrix.afm\": AFM formated feature matrix to use.\n  -leaves=\"leaves.tsv\": a case by case sparse matrix of leaf co-occurrence in tsv format\n  -rfpred=\"rface.sf\": A predictor forest.\n```\n\nnfold utility\n--------------\n\nnfold is a utility for generating cross validation folds. It can read in and ouput any of the supported formats.\nYou can specify a catagorical target feature to do stratified sampeling which will balance the classes between the folds.\n\nIf no target feature is specified, a numerical target feature is specified or the -unstratified option is provided\nunstratified sampeling will be used.\n\n```\nUsage of nfold:\n  -fm=\"featurematrix.afm\": AFM formated feature matrix containing data.\n  -folds=5: Number of folds to generate.\n  -target=\"\": The row header of the target in the feature matrix.\n  -test=\"test_%v.fm\": Format string for testing fms.\n  -train=\"train_%v.fm\": Format string for training fms.\n  -unstratified=false: Force unstratified sampeling of categorical target.\n  -writeall=false: Output all three formats.\n  -writearff=false: Output arff.\n  -writelibsvm=false: Output libsvm.\n```\n\nImportance\n----------\n\nVariable Importance in CloudForest is based on the as the mean decrease in impurity over all of\nthe splits made using a feature. It is output in a tsv as:\n\n0       | 1                | 2         | 3                 | 4                      | 5               | 6\n--------|------------------|-----------|-------------------|------------------------|-----------------|-------------------\nFeature | Decrease Per Use | Use Count | Decrease Per Tree | Decrease Per Tree Used | Tree Used Count | Mean Minimal Depth\n\nDecrease per tree  (col 3 starting from 0) is the most common definition of importance in other implementations and \nis calculated over all trees, not just the ones the feature was used in.\n\nEach of these scores has different properties:\n* Per-use and per-tree-used scores may be more resistant to feature redundancy, \n* Per-tree-used and per-tree scores may better pick out complex effects.\n* Mean Minimal Depth has been proposed (see \"Random Survival Forests\") as an alternative importance.\n\nTo provide a baseline for evaluating importance, artificial contrast features can be used by\nincluding shuffled copies of existing features (-nContrasts, -contrastAll).\n\nA feature that performs well when randomized (or when the target has been randomized) may be causing\nover-fitting. \n\nThe option to permute the target (-permute) will establish a minimum random baseline. Using a \nregular expression (-shuffleRE) to shuffle part of the data can be useful in teasing out the contributions of \ndifferent subsets of features.\n\nImportance with P-Values Via Artificial Contrasts/ACE\n-----------------------------------------------------\nP-values can be established for importance scores by comparing the importance score for each feature to that of\nshuffled copy of itself or artificial contrast over a number of runs. This algorithm is described in Tuv's \n\"Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination.\"\n\nFeature selection based on these p-values can increase the model's resistance to issues including\n over-fitting from high cardinality features.\n\nIn CloudForest these p-values are produces with a Welch's t-test and the null hypthesis that the mean importance\nof a features contrasts is greater then that of the feature itself over all of the forests. To use this method \nspecify the number of forests/repeats to perform using the \"-ace\" option and provide a file name for importance \nscores via the -importance option. Importance scores will be the mean decrease per tree over all of the forests.\n\n```\ngrowforest -train housing.arff -target class -ace 10 -importance bostanimp.tsv\n```\n\nThe output tsv will be a tsv with the following columns:\n\n0      | 1         | 2               | 3\n-------|-----------|-----------------|--------\ntarget | predictor | p-value         | mean importance\n\nThis method is often combined with the -evaloob method described bellow.\n\n```\ngrowforest -train housing.arff -target class -ace 10 -importance bostanimp.tsv -evaloob\n``` \n\n\nImproved Feature Selection \n--------------------------\n\nGenomic data is frequently has many noisy, high cardinality, uninformative features which can lead to in bag over fitting. To combat this, \nCloudForest implements some methods designed to help better filter out uninformative features.\n\nThe -evaloob method evaluates potential best splitting features on the oob data after learning the split value for\neach splitter as normal from the in bag/branch data as normal. Importance scores are also calcualted using OOB cases. \nThis idea is discussed in Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's paper \"Feature Selection with\nEnsembles, Artificial Variables, and Redundancy Elimination.\"\n\nThe -vet option penalizes the impurity decrease of potential best split by subtracting the best split they can make after\nthe target values cases on which the split is being evaluated have been shuffled.\n\nIn testing so far evaloob provides better performance and is less computationally intensive. These options can be used together which\nmay provide the best performance in very noisy data. When used together vetting is also done on the out of bag cases. \n\n\nData With Unbalanced Classes\n----------------------------\n\nGenomic studies also frequently have unbalanced target classes. Ie you might be interested in a rare disease but have \nsamples drawn from the general population. CloudForest implements three methods for dealing with such studies, roughly \nbalanced bagging (-balance), cost weighted classification (-costs) and weighted gini impurity driven classification \n(-rfweights). See the references bellow for a discussion of these options.\n\n\nMissing Values\n----------------\n\nBy default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature\nwith missing data the missing cases are removed and the impurity value is corrected to use three way impurity\nwhich reduces the bias towards features with lots of missing data:\n\n                I(split) = p(l)I(l)+p(r)I(r)+p(m)I(m)\n\nMissing values in the target variable are left out of impurity calculations.\n\nThis provided generally good results at a fraction of the computational costs of imputing data.\n\nOptionally, -impute can be called before forest growth to impute missing values to the feature mean/mode which Brieman \nsuggests as a fast method for imputing values.\n\nThis forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the\nmore accurate proximity weighted imputation Brieman describes.\n\nExperimental support (-splitmissing) is provided for 3 way splitting which splits missing cases onto a third branch.\nThis has so far yielded mixed results in testing.\n\n\n\n\nData File Formats\n------------------\n\nData files in cloud forest are assumed to be in our Anotated Feature Matrix tsv based format unless a .libsvm or .arff file extension is used.\n\n### Anotated Feature Matrix Tsv Files ###\n\nCloudForest borrows the annotated feature matrix (.afm) and stochastic forest (.sf) file formats\nfrom Timo Erkkila's rf-ace which can be found at https://code.google.com/p/rf-ace/\n\nAn annotated feature matrix (.afm) file is a tab delineated file with column and row headers. By default columns represent cases and rows represent features/variables though the transpose (rows as cases/observations) is also detected and supported. \n\nA row header / feature id includes a prefix to specify the feature type. These prefixes are also used to detect column vs row orientation.\n\n```\n\"N:\" Prefix for numerical feature id.\n\"C:\" Prefix for categorical feature id.\n\"B:\" Prefix for boolean feature id.\n```\n\nCategorical and boolean features use strings for their category labels. Missing values are represented\nby \"?\",\"nan\",\"na\", or \"null\" (case insensitive). A short example:\n\n```\nfeatureid\tcase1\tcase2\tcase3\nN:NumF1\t0.0\t.1\tna\nC:CatF2 red\tred\tgreen\n```\n\nSome sample feature matrix data files are included in the \"data\" directory.\n\n### ARFF Data Files ###\n\nCloudFores also supports limited import of weka's ARFF format. This format will be detected via the \".arff\" file extension. Only numeric and nominal/catagorical attributes are supported, all other attribute types will be assumed to be catagorical and should usully be removed or blacklisted. There is no support for spaces in feature names, quoted strings or sparse data. Trailing space or comments after the data field may cause odd behavior. \n\nThe ARFF format also provides an easy way to annotate a cvs file with information about the supplied fields:\n\n```\n@relation data\n\n@attribute NumF1 numeric\n@attribute CatF2 {red,green}\n\n@data\n0.0,red\n.1,red\n?,green\n```\n\n### LibSvm/Svm Light Data Files ###\n\nThere is also basic support for sparse numerical data in libsvm's file format. This format will be detected by the \".libsvm\" file extension and has some limitations. A simple libsvm file might look like:\n\n```\n24.0 1:0.00632 2:18.00 3:2.310 4:0\n21.6 1:0.02731 2:0.00 3:7.070 7:78.90\n34.7 1:0.02729 2:0.00 5:0.4690\n```\n\nThe target field will be given the designation \"0\" and be in the \"0\" position of the matrix and you will need to use \"-target 0\" as an option with growforest. No other feature can have this designation.\n\nThe catagorical or numerical nature of the target variable will be detected from the value of the first line. If it is an integer value like 0,1 or 1200 the target will be parsed as catagorical and classification peformed. If it is a floating point value including a decmil place like 1.0, 1.7 etc the target will be parsed as numerical and regession performed. There is currentelly no way to override this behavior. \n\nModels - Stochastic Forest Files\n--------------------------------\n\nA stochastic forest (.sf) file contains a forest of decision trees. The main advantage of this\nformat as opposed to an established format like json is that an sf file can be written iteratively\ntree by tree and multiple .sf files can be combined with minimal logic required allowing for\nmassively parallel growth of forests with low memory use.\n\nAn .sf file consists of lines each of which is a comma separated list of key value pairs. Lines can\ndesignate either a FOREST, TREE, or NODE. Each tree belongs to the preceding forest and each node to\nthe preceding tree. Nodes must be written in order of increasing depth.\n\nCloudForest generates fewer fields then rf-ace but requires the following. Other fields will be\nignored\n\nForest requires forest type (only RF currently), target and ntrees:\n\n\tFOREST=RF|GBT|..,TARGET=\"$feature_id\",NTREES=int\n\nTree requires only an int and the value is  ignored though the line is needed to designate a new tree:\n\n\tTREE=int\n\nNode requires a path encoded so that the root node is specified by \"*\" and each split left or right as \"L\" or \"R\".\nLeaf nodes should also define PRED such as \"PRED=1.5\" or \"PRED=red\". Splitter nodes should define SPLITTER with\na feature id inside of double quotes, SPLITTERTYPE=[CATEGORICAL|NUMERICAL] and a LVALUE term which can be either\na float inside of double quotes representing the highest value sent left or a \":\" separated list of categorical\nvalues sent left.\n\n\tNODE=$path,PRED=[float|string],SPLITTER=\"$feature_id\",SPLITTERTYPE=[CATEGORICAL|NUMERICAL] LVALUES=\"[float|: separated list\"\n\nAn example .sf file:\n\n\tFOREST=RF,TARGET=\"N:CLIN:TermCategory:NB::::\",NTREES=12800\n\tTREE=0\n\tNODE=*,PRED=3.48283,SPLITTER=\"B:SURV:Family_Thyroid:F::::maternal\",SPLITTERTYPE=CATEGORICAL,LVALUES=\"false\"\n\tNODE=*L,PRED=3.75\n\tNODE=*R,PRED=1\n\nCloud forest can parse and apply .sf files generated by at least some versions of rf-ace.\n\nCompiling for Speed\n----------------------\n\nWhen compiled with go1.1 CloudForest achieves running times similar to implementations in\nother languages. Using gccgo (4.8.0 at least) results in longer running times and is not\nrecommended. This may change as gcc go adopts the go 1.1 way of implementing closures. \n\n\nReferences\n-------------\n\nThe idea for (and trademark of the term) Random Forests originated with Leo Brieman and\nAdele Cuttler. Their code and paper's can be found at:\n\nhttp://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm\n\nAll code in CloudForest is original but some ideas for methods and optimizations were inspired by\nTimo Erkilla's rf-ace and Andy Liaw and Matthew Wiener randomForest R package based on Brieman and\nCuttler's code:\n\nhttps://code.google.com/p/rf-ace/\nhttp://cran.r-project.org/web/packages/randomForest/index.html\n\nThe idea for Artificial Contrasts is based on:\nEugene Tuvand and Kari Torkkola's \"Feature Filtering with Ensembles Using Artiﬁcial Contrasts\"\nhttp://enpub.fulton.asu.edu/workshop/FSDM05-Proceedings.pdf#page=74\nand\nEugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's \"Feature Selection with\nEnsembles, Artificial Variables, and Redundancy Elimination\"\nhttp://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf\n\nThe idea for growing trees to minimize categorical entropy comes from Ross Quinlan's ID3:\nhttp://en.wikipedia.org/wiki/ID3_algorithm\n\n\"The Elements of Statistical Learning\" 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman\nwas also consulted during development.\n\nMethods for classification from unbalanced data are covered in several papers:\nhttp://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf\nhttp://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/\nhttp://www.biomedcentral.com/1471-2105/11/523\nhttp://bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs006\nhttp://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067863\n\nDenisty Estimating Trees/Forests are Discussed:\nhttp://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p627.pdf\nhttp://research.microsoft.com/pubs/158806/CriminisiForests_FoundTrends_2011.pdf\nThe later also introduces the idea of manifold forests which can be learned using down stream analysis of the\noutputs of leafcount to find the Fiedler vectors of the graph laplacian. \n\n\n\n    \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryanbressler%2FCloudForest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fryanbressler%2FCloudForest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryanbressler%2FCloudForest/lists"}