{"id":13401120,"url":"https://github.com/szilard/benchm-ml","last_synced_at":"2025-05-15T15:07:02.417Z","repository":{"id":29478563,"uuid":"33015554","full_name":"szilard/benchm-ml","owner":"szilard","description":"A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).","archived":false,"fork":false,"pushed_at":"2022-09-16T14:01:14.000Z","size":1133,"stargazers_count":1881,"open_issues_count":12,"forks_count":334,"subscribers_count":146,"default_branch":"master","last_synced_at":"2025-04-07T20:11:15.519Z","etag":null,"topics":["data-science","deep-learning","gradient-boosting-machine","h2o","machine-learning","python","r","random-forest","spark","xgboost"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/szilard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-28T00:34:18.000Z","updated_at":"2025-04-05T09:35:45.000Z","dependencies_parsed_at":"2023-01-14T15:15:19.744Z","dependency_job_id":null,"html_url":"https://github.com/szilard/benchm-ml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2Fbenchm-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2Fbenchm-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2Fbenchm-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2Fbenchm-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/szilard","download_url":"https://codeload.github.com/szilard/benchm-ml/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254364270,"owners_count":22058878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","deep-learning","gradient-boosting-machine","h2o","machine-learning","python","r","random-forest","spark","xgboost"],"created_at":"2024-07-30T19:00:58.919Z","updated_at":"2025-05-15T15:06:57.401Z","avatar_url":"https://github.com/szilard.png","language":"R","readme":"\n## Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification\n\n_**All benchmarks are wrong, but some are useful**_\n\nThis project aims at a *minimal* benchmark for scalability, speed and accuracy of commonly used implementations\nof a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of \nlimited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business\napplications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of *n* x *p*, *n* is \nvaried as 10K, 100K, 1M, 10M, while *p* is ~1K (after expanding the categoricals into dummy \nvariables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in \nsome particular business applications.\n\n**A large part of this benchmark was done in 2015, with a number of updates later on as things have changed. Make sure you read \nat the [end](https://github.com/szilard/benchm-ml#summary) of this repo a summary of how the focus has changed over time,\nand why instead of updating this benchmark I started a new one (and where to find it).**\n\nThe algorithms studied are \n- linear (logistic regression, linear SVM)\n- random forest\n- boosting \n- deep neural network\n\nin various commonly used open source implementations like \n- R packages\n- Python scikit-learn\n- Vowpal Wabbit\n- H2O \n- xgboost\n- lightgbm (added in 2017)\n- Spark MLlib.\n\n**Update (June 2015):** It turns out these are the [most popular tools](https://github.com/szilard/list-ml-tools)\nused for machine learning indeed. If your software tool of choice is not here, you can do a minimal benchmark\nwith little work with the [following instructions](z-other-tools).\n\nRandom forest, boosting and more recently deep neural networks are the algos expected to perform the best on the structure/sizes\ndescribed above (e.g. vs alternatives such as *k*-nearest neighbors, naive-Bayes, decision trees, linear models etc). \nNon-linear SVMs are also among the best in accuracy in general, but become slow/cannot scale for the larger *n*\nsizes we want to deal with. The linear models are less accurate in general and are used here only \nas a baseline (but they can scale better and some of them can deal with very sparse features, so they are great in other use cases). \n\nBy scalability we mean here that the algos are able to complete (in decent time) for the given data sizes with \nthe main constraint being RAM (a given algo/implementation will crash if running out of memory). Some \nof the algos/implementations can work in a distributed setting, although the largest dataset in this\nstudy *n* = 10M is less than 1GB, so scaling out to multiple machines should not be necessary and\nis not the focus of this current study. (Also, some of the algos perform relatively poorly speedwise in the multi-node setting, where \ncommunication is over the network rather than via updating shared memory.)\nSpeed (in the single node setting) is determined by computational\ncomplexity but also if the algo/implementation can use multiple processor cores.\nAccuracy is measured by AUC. The interpretability of models is not of concern in this project.\n\nIn summary, we are focusing on which algos/implementations can be used to train relatively accurate binary classifiers for data\nwith millions of observations and thousands of features processed on commodity hardware (mainly one machine with decent RAM and several cores).\n\n## Data\n\nTraining datasets of sizes 10K, 100K, 1M, 10M are [generated](0-init/2-gendata.txt) from the well-known airline dataset (using years 2005 and 2006). \nA test set of size 100K is generated from the same (using year 2007). The task is to predict whether a flight will\nbe delayed by more than 15 minutes. While we study primarily the scalability of algos/implementations, it is also interesting\nto see how much more information and consequently accuracy the same model can obtain with more data (more observations).\n\n## Setup \n\nThe tests have been carried out on a Amazon EC2 c3.8xlarge instance (32 cores, 60GB RAM). The tools are freely available and \ntheir [installation](0-init/1-install.md) is trivial ([version information here](0-init/1a-versions.txt)). For some\nof the models that ran out of memory for the larger data sizes a r3.8xlarge instance (32 cores, 250GB RAM) has been used\noccasionally. For deep learning on GPUs, p2.xlarge (1 GPU with 12GB video memory, 4 CPU cores, 60GB RAM) instance has been used.\n\n**Update (January 2018):** A more modern approach would use docker for fully automated installing of all ML software and automated\ntiming/running of tests (which would make it more easy to rerun the tests on new versions of the tools, would make them more reproducible etc).\nThis approach has been actually used in a successor of this benchmark focusing on the top performing GBM implementations only, see \n[here](https://github.com/szilard/GBM-perf).\n\n## Results\n\nFor each algo/tool and each size *n* we observe the following: training time, maximum memory usage during training, CPU usage on the cores, \nand AUC as a measure for predictive accuracy. \nTimes to read the data, pre-process the data, score the test data are also observed but not\nreported (not the bottleneck).\n\n### Linear Models\n\nThe linear models are not the primary focus of this study because of their not so great accuracy vs\nthe more complex models (on this type of data). \nThey are analyzed here only to get some sort of baseline.\n\nThe R glm function (the basic R tool for logistic regression) is very slow, 500 seconds on *n* = 0.1M (AUC 70.6).\nTherefore, for R the glmnet package is used. For Python/scikit-learn LogisticRegression\n(based on the LIBLINEAR C++ library) has been used.\n\nTool    | *n*  |   Time (sec)  | RAM (GB) | AUC\n--------|------|---------------|----------|--------\nR       | 10K  |      0.1      |   1      | 66.7\n.       | 100K |      0.5      |   1      | 70.3\n.       | 1M   |      5        |   1      | 71.1\n.       | 10M  |      90       |   5      | 71.1\nPython  | 10K  |      0.2      |   2      | 67.6\n.       | 100K |       2       |   3      | 70.6\n.       | 1M   |       25      |   12     | 71.1\n.       | 10M  |  crash/360    |          | 71.1\nVW      | 10K  |     0.3 (/10) |          | 66.6\n.       | 100K |      3 (/10)  |          | 70.3\n.       | 1M   |      10 (/10) |          | 71.0\n.       | 10M  |     15        |          | 71.0\nH2O     | 10K  |      1        |   1      | 69.6\n.       | 100K |      1        |   1      | 70.3\n.       | 1M   |      2        |   2      | 70.8\n.       | 10M  |      5        |   3      | 71.0\nSpark   | 10K  |      1        |   1      | 66.6\n.       | 100K |      2        |   1      | 70.2\n.       | 1M   |      5        |   2      | 70.9\n.       | 10M  |      35       |   10     | 70.9\n\nPython crashes on the 60GB machine, but completes\nwhen RAM is increased to 250GB (using a [sparse format](https://github.com/szilard/benchm-ml/issues/27) \nwould help with memory footprint\nand likely runtime as well).\nThe Vowpal Wabbit (VW) running times are reported in the table for 10 passes (online learning) \nover the data for \nthe smaller sizes. While VW can be run on multiple cores (as multiple processes communicating with each\nother), it has been run here in \nthe simplest possible way (1 core). Also keep in mind that VW reads the data on the fly while for the other tools\nthe times reported exclude reading the data into memory.\n\nOne can play with various parameters (such as regularization) and even do some search in the parameter space with\ncross-validation to get better accuracy. However, very quick experimentation shows that at least for the larger\nsizes regularization does not increase accuracy significantly (which is expected since *n* \u003e\u003e *p*).\n\n![plot-time](1-linear/x-plot-time.png)\n![plot-auc](1-linear/x-plot-auc.png)\n\nThe main conclusion here is that **it is trivial to train linear models even for *n* = 10M rows virtually in\nany of these tools** on a single machine in a matter of seconds. \nH2O and VW are the most memory efficient (VW needs only 1 observation in memory\nat a time therefore is the ultimately scalable solution). H2O and VW are also the fastest (for VW the time reported\nincludes the time to read the data as it is read on the fly).\nAgain, the differences in memory efficiency and speed will start to really matter only for\nlarger sizes and beyond the scope of this study.\n\n\n#### Learning Curve of Linear vs Non-Linear Models\n\n\u003ca name=\"rf-vs-linear\"\u003e\u003c/a\u003e\nFor *this dataset* the accuracy of the linear\nmodel tops-off at moderate sizes while the accuracy of non-linear models (e.g. random forest) \ncontinues to increase with increasing data size.\nThis is because a simple linear structure can be extracted already from \na smaller dataset and having more data points will not change the classification boundary significantly.\nOn the other hand, more complex models such as random forests can improve further with increasing \ndata size by adjusting further the classification boundary.\n\nThis means that having more data (\"big data\") does not improve further the accuracy of the *linear* model\n(at least for this dataset).\n\nNote also that the random forest model is more accurate than the linear one for any size, and \ncontrary to the conventional wisdom of \"more data beats better algorithms\", \nthe random forest model \non 1% of the data (100K records) beats the linear model on all the data (10M records). \n\n![plot-auc](1-linear/z-auc-lin-rf.png)\n\nSimilar behavior can be observed in other *non-sparse* datasets, e.g. the \n[Higgs dataset](x1-data-higgs). Contact me (e.g. submit a [github issue](https://github.com/szilard/benchm-ml/issues)) \nif you have learning curves for linear vs non-linear models on other datasets (dense or sparse).\n\nOn the other hand, there is certainly a price for higher accuracy in terms of larger required training (CPU) time.\n\nUltimately, there is a data size - algo (complexity) - cost (CPU time) - accuracy tradeoff \n(to be studied in more details later). Some quick results for H2O:\n\nn     |  Model  |  Time (sec) |   AUC \n------|---------|-------------|--------\n10M   |  Linear |    5        |   71.0  \n0.1M  |  RF     |    150      |   72.5\n10M   |  RF     |    4000     |   77.8\n\n\n### Random Forest\n\n**Note:** The random forests results have been published in a more organized and self-contained form\nin [this blog post](http://datascience.la/benchmarking-random-forest-implementations/).\n\nRandom forests with 500 trees have been trained in each tool choosing the default of square root of *p* as the number of\nvariables to split on.\n\nTool    | *n*  |   Time (sec)  | RAM (GB) | AUC\n-------------------------|------|---------------|----------|--------\nR       | 10K  |      50       |   10     | 68.2\n.       | 100K |     1200      |   35     | 71.2\n.       | 1M   |     crash     |          |\nPython  | 10K  |      2        |   2      | 68.4\n.       | 100K |     50        |   5      | 71.4\n.       | 1M   |     900       |   20     | 73.2\n.       | 10M  |     crash     |          |\nH2O     | 10K  |      15       |   2      | 69.8\n.       | 100K |      150      |   4      | 72.5\n.       | 1M   |      600      |    5     | 75.5\n.       | 10M  |     4000      |   25     | 77.8\nSpark   | 10K  |      50       |   10     | 69.1\n.       | 100K |      270      |   30     | 71.3\n.       | 1M   |  crash/2000   |          | 71.4\nxgboost | 10K  |     4         |    1     | 69.9\n.       | 100K |    20         |    1     | 73.2\n.       | 1M   |    170        |    2     | 75.3\n.       | 10M  |    3000       |    9     | 76.3\n\n![plot-time](2-rf/x-plot-time.png)\n![plot-auc](2-rf/x-plot-auc.png)\n\nThe [R](2-rf/1.R) implementation (randomForest package) is slow and inefficient in memory use. \nIt cannot cope by default with a large number of categories, therefore the data had\nto be one-hot encoded. The implementation uses 1 processor core, but with 2 lines of extra code\nit is easy to build\nthe trees in parallel using all the cores and combine them at the end. However, it runs out\nof memory already for *n* = 1M. I have to emphasize this has nothing to do with R per se (and I still stand by\narguing R is the best data science platform esp. when it comes to data munging of structured data or\nvisualization), it is just this\nparticular (C and Fortran) RF implementation used by the randomForest package that is inefficient.\n\nThe [Python](2-rf/2.py) (scikit-learn) implementation is faster, more memory efficient and uses all the cores.\nVariables needed to be one-hot encoded (which is more involved than for R) \nand for *n* = 10M doing this exhausted all the memory. Even if using a larger machine\nwith 250GB of memory (and 140GB free for RF after transforming all the data) the Python implementation\nruns out of memory and crashes for this larger size. The algo \n[finished successfully](https://github.com/szilard/benchm-ml/issues/1) \nthough when run on the larger box with simple integer encoding (which\nfor some datasets/cases might be actually a good approximation/choice).\n\nThe [H2O](2-rf/4-h2o.R) implementation is fast, memory efficient and uses all cores. It deals\nwith categorical variables automatically. It is also more accurate than the studied R/Python packages, \nwhich may be because\nof dealing properly with the categorical variables, i.e. internally in the algo\nrather than working from a previously 1-hot encoded dataset (where the link between the dummies \nbelonging to the same original variable is lost).\n\n[Spark](2-rf/5b-spark.txt) (MLlib) implementation is slower and has a larger memory footprint.\nIt runs out of memory already at *n* = 1M (with 250G of RAM it finishes for *n* = 1M, \nbut it crashes for *n* = 10M). However, as Spark\ncan run on a cluster one can throw in even more RAM by using more nodes.\nI also tried to provide the categorical\nvariables encoded simply as integers and passing the `categoricalFeaturesInfo` parameter, but that made\ntraining much slower.\nA convenience issue, reading the data is more than one line of code and at the start of this benchmark project\nSpark did not provide a one-hot encoder\nfor the categorical data (therefore I used R for that). This has been ammnded since, thanks @jkbradley\nfor native 1-hot encoding [code](https://github.com/szilard/benchm-ml/blob/a04f7136438598ce700c3adbb0fee2efa29488f3/z-other-tools/5xa-spark-1hot.txt).\nIn earlier versions of this benchmark there was an issue of Spark random forests having\nlow prediction accuracy vs the other methods. This was due to aggregating votes rather than probabilities\nand it has been addressed by @jkbradley in this \n[code](https://github.com/szilard/benchm-ml/blob/master/2-rf/5b-spark.txt#L64) (will be included in next Spark release).\nThere is still an open issue on the accuracy for *n* = 1M (see the breaking trend in the AUC graph).\nTo get more insights on the issues above see\n[more comments](http://datascience.la/benchmarking-random-forest-implementations/#comment-53599) \nby Joseph Bradley @jkbradley of Databricks/Spark project (thanks, Joseph).\n\n**Update (September 2016):** Spark 2.0 introduces a new API (Pipelines/\"Spark ML\" vs \"Spark MLlib\") and the \n[code](https://github.com/szilard/benchm-ml/blob/406a00e9e501405589d234607e56f64a35ab1ddf/z-other-tools/5xb-spark-trainpred--sp20.txt) becomes significantly simpler.\nFurthermore, Spark 1.5, 1.6 and 2.0 introduced several optimizations (\"Tungsten\") that have improved significantly for example the speed on queries (SparkSQL).\nHowever, there is no speed improvement for random forests, they actually got a bit \n[slower](https://github.com/szilard/benchm-ml/tree/master/z-other-tools#how-to-benchmark-your-tool-of-choice-with-minimal-work).\n\nI also tried [xgboost](2-rf/6-xgboost.R), a popular library for boosting which is capable to build \nrandom forests as well. It is fast, memory efficient and of high accuracy. Note the different shapes of the\nAUC and runtime vs dataset size curves for H2O and xgboost, some discussions \n[here](https://github.com/szilard/benchm-ml/issues/14).\n\nBoth H2O and xgboost have interfaces from R and Python.\n\nA few other RF implementations (open source and commercial as well) \nhave been benchmarked quickly on 1M records and runtime and AUC are \n[reported here](z-other-tools).\n\nIt would be nice to study the dependence of running time and accuracy as a function of\nthe (hyper)parameter values of the algorithm, but a quick idea can be obtained easily for the\nH2O implementation from this table (*n* = 10M on 250GB RAM):\n\nntree    | depth  |   nbins  | mtries  | Time (hrs)   |  AUC\n---------|--------|----------|---------|--------------|--------\n500      |  20    |    20    | -1 (2)  |      1.2     |  77.8 \n500      |  50    |    200   | -1 (2)  |      4.5     |  78.9\n500      |  50    |    200   |   3     |      5.5     |  78.9\n5000     |  50    |    200   | -1 (2)  |      45      |  79.0\n500      |  100   |   1000   | -1 (2)  |      8.3     |  80.1\n\nother hyperparameters being sample rate (at each tree), min number of observations in nodes, impurity\nfunction.\n\nOne can see that the AUC could be improved further and the best AUC from this dataset with random forests\nseems to be around 80 (the best AUC from linear models seems to be around 71, and we will compare\nwith boosting and deep learning later).\n\n\n\n### Boosting (Gradient Boosted Trees/Gradient Boosting Machines)\n\nCompared to random forests, GBMs have a more complex relationship between hyperparameters\nand accuracy (and also runtime). The main hyperparameters are learning (shrinkage) rate, number of trees, \nmax depth of trees, while some others are number of bins, sample rate (at each tree), min number of \nobservations in nodes. To add to complexity, GBMs can overfit in the sense that adding more trees at some point will\nresult in decreasing accuracy on a test set (while on the training set \"accuracy\" keeps increasing).\n\nFor example using xgboost for `n = 100K` `learn_rate = 0.01` `max_depth = 16` (and the\n`printEveryN = 100` and `eval_metric = \"auc\"` options) the AUC on the train and test sets,\nrespectively after `n_trees` number of iterations are:\n\n![plot-overfit](3-boosting/x-overfit.png)\n\nOne can see the AUC on the test set decreases after 1000 iterations (overfitting). \nxgboost has a handy early stopping option (`early_stop_round = k` - training\nwill stop if performance e.g. on a holdout set keeps getting worse consecutively \nfor `k` rounds). If one does not know where to stop, one might underfit (too few iterations)\nor overfit (too many iterations) and the resulting model will be suboptimal in accuracy\n(see Fig. above).\n\nDoing an extensive search for the best model is not the main goal of this project.\nNevertheless, a quick \n[exploratory search](https://github.com/szilard/benchm-ml/blob/master/3-boosting/0-xgboost-init-grid.R) \nin the hyperparameter space has been\nconducted using xgboost (with the early stopping option). For this a separate validation\nset of size 100K from 2007 data not used in the test set has been generated. The goal is\nto find parameter values that provide decent accuracy and then run all GBM implementations\n(R, Python scikit-learn, etc) with those parameter values to compare speed/scalability (and \naccuracy).\n\nThe smaller the `learn_rate` the better the AUC, but for very small values training time increases dramatically, \ntherefore we use `learn_rate = 0.01` as a compromise. \nUnlike recommended in much of the literature, shallow trees don't produce best (or close to best) results, \nthe grid search showed better accuracy e.g. with `max_depth = 16`.\nThe number of trees to produce optimal results for the above hyperparameter values depend though on the training set size. \nFor `n_trees = 1000` we don't reach the overfitting regime\nfor either size and we use this value for studying the speed/scalability of the different implementations. \n(Values for the other hyper-parameters that seem to work well are: \n`sample_rate = 0.5` `min_obs_node = 1`.) We call this experiment A (in the table below).\n\nUnfortunately some implementations take too much time to run for the above parameter values\n(and Spark runs out of memory). Therefore, another set of parameter values (that provide lower accuracy but faster training times)\nhas been also used to study speed/scalability: `learn_rate = 0.1` `max_depth = 6` `n_trees = 300`. \nWe call this experiment B.\n\nI have to emphasize that while I make the effort to match parameter values for all algos/implementations,\nevery implementation is different, some don't have all the above parameters, while some might\nuse the existing ones in a slightly different way (you can also see the resulting model/AUC is somewhat different).\nNevertheless, the results below give us a pretty good idea of how the implementations compare to each other.\n\n\nTool    | *n*  | Time (s) A    | Time (s) B | AUC A  | AUC B  | RAM(GB) A | RAM(GB) B\n--------|------|---------------|------------|--------|--------|-----------|-----------\nR       | 10K  |   20          |   3        |   64.9 |  63.1  |    1      |     1\n.       | 100K |   200         |   30       |   72.3 |  71.6  |    1      |     1\n.       | 1M   |   3000        |   400      |   74.1 |  73.9  |    1      |     1\n.       | 10M  |               |   5000     |        |  74.3  |           |     4\nPython  | 10K  |    1100       |    120     |   69.9 |  69.1  |    2      |     2\n.       | 100K |               |   1500     |        |  72.9  |           |     3\n.       | 1M   |               |            |        |        |           |\n.       | 10M  |               |            |        |        |           |\nH2O     | 10K  |    90         |    7       |  68.2  |  67.7  |    3      |   2\n.       | 100K |   500         |    40      |  71.8  |  72.3  |    3      |   2\n.       | 1M   |   900         |    60      |  75.9  |  74.3  |    9      |   2\n.       | 10M  |   3500        |    300     |  78.3  |  74.6  |    11     |   20\nSpark   | 10K  |  180000       |   700      |  66.4  |  67.8  |    30     |   10\n.       | 100K |               |   1200     |        |  72.3  |           |   30\n.       | 1M   |               |   6000     |        |  73.8  |           |   30 \n.       | 10M  |               |   (60000)  |        | (74.1) |           | crash (110) \nxgboost | 10K  |   6           |     1      |  70.3  |  69.8  |   1       |  1\n.       | 100K |   40          |     4      |  74.1  |  73.5  |   1       |  1\n.       | 1M   |   400         |     45     |  76.9  |  74.5  |   1       |  1\n.       | 10M  |   9000        |    1000    |  78.7  |  74.7  |   6       |  5\n\n![plot-time](3-boosting/x-plot-time.png)\n![plot-auc](3-boosting/x-plot-auc.png)\n\nThe memory footprint of GBMs is in general smaller than for random forests, therefore the\nbottleneck is mainly training time (although besides being slow Spark is inefficient in memory use as well\nespecially for deeper trees, therefore it crashes).\n\nSimilar to random forests, H2O and xgboost are the fastest (both use\nmultithreading). R does relatively well considering that it's a single-threaded implementation.\nPython is very slow with one-hot encoding of categoricals, but almost as fast as R (just 1.5x slower) with\nsimple/integer encoding. Spark is slow and memory inefficient,\nbut at least for shallow trees it achieves similar accuracy to the other methods (unlike in\nthe case of random forests, where Spark provides lower accuracy than\nits peers).\n\nCompared to random forests, boosting requires more tuning to get a good choice of hyperparameters.\nQuick results for H2O and xgboost with `n = 10M` (largest data)\n`learn_rate = 0.01` (the smaller the better\nAUC, but also longer and longer training times) `max_depth = 20` (after rough search with \n`max_depth = 2,5,10,20,50`) `n_trees = 5000` (close to xgboost early stop)\n`min_obs_node = 1` (and `sample_rate = 0.5` for xgboost, `n_bins = 1000` for H2O):\n\nTool    |  Time (hr) |   AUC\n--------|------------|---------\nH2O     |   7.5      |   79.8\nH2O-3   |   9.5      |   81.2\nxgboost |   14       |   81.1\n\nCompare with H2O random forest from previous section (Time 8.3\thr, AUC 80.1).\nH2O-3 is the new generation/version of H2O. \n\n**Update (May 2017):** A new tool for GBMs, LightGBM came out recently. While it's not (yet) as widely used as the tools above,\nit is now the fastest one. There is also recent work in running xgboost and LightGBM on GPUs. Therefore I started a new \n(leaner) github repo to keep track of the best GBM tools \n[here](https://github.com/szilard/GBM-perf) (and ignore mediocre tools such as Spark).\n\n**Update (January 2018)**: I dockerized the GBM measurements for h2o, xgboost and lightgbm (both CPU and GPU versions). The repo linked in \nthe paragraph above will contain all further development w.r.t. GBM implementations. GBMs are typically the most accurate algos\nfor supervised learning on structured/tabular data and therefore of my main interest \n(e.g. compared with the other 3 algos discussed in this current benchmark - linear models, random forests and neural networks), \nand the dockerization makes it easier to keep that other repo up to date with tests on the newest versions of the tools and\npotentially adding new ML tools. **Therefore this new [GBM-perf](https://github.com/szilard/GBM-perf) repo can be considered as\na \"successor\" of the current one.**\n\n### Deep neural networks\n\nDeep learning has been extremely successful on a few classes of data/machine learning problems such as involving images, \nspeech and text (supervised learning) and games (reinforcement learning).\nHowever, it seems that in \"traditional\" machine learning problems such as fraud detection, credit scoring or churn,\ndeep learning is not as successful and it provides lower accuracy than random forests or gradient boosting machines. \nMy experiments (November 2015) on the airline dataset used in this repo and also on another \ncommercial dataset have [conjectured](https://github.com/szilard/benchm-ml/issues/28) this, \nbut unfortunately most of the hype surrounding deep learning and \"artificial intelligence\" overwhelms this reality,\nand there are only a few references in this direction e.g. \n[here](https://www.quora.com/Why-is-xgboost-given-so-much-less-attention-than-deep-learning-despite-its-ubiquity-in-winning-Kaggle-solutions/answer/Tianqi-Chen-1),\n[here](https://speakerdeck.com/datasciencela/tianqi-chen-xgboost-implementation-details-la-workshop-talk?slide=28)\nor [here](https://www.youtube.com/watch?v=8KzjARKIgTo#t=28m15s).\n\nHere are the results of a few fully connected network architectures \n[trained](4-DL/1-h2o.R)\nwith various optimization schemes (adaptive, rate annealing, momentum etc.) \nand various regularizers (dropout, L1, L2) \nusing H2O with early stopping on the 10M dataset:\n\nParams                                                               |  AUC  |  Time (s) | Epochs \n---------------------------------------------------------------------|-------|-----------|----------\ndefault: `activation = \"Rectifier\", hidden = c(200,200)`             | 73.1  |    270    |  1.8\n`hidden = c(50,50,50,50), input_dropout_ratio = 0.2`                 | 73.2  |    140    |  2.7\n`hidden = c(50,50,50,50)`                                            | 73.2  |    110    |  1.9\n`hidden = c(20,20)`                                                  | 73.1  |    100    |  4.6\n`hidden = c(20)`                                                     | 73.1  |    120    |  6.7\n`hidden = c(10)`                                                     | 73.2  |    150    |  12\n`hidden = c(5)`                                                      | 72.9  |    110    |  9.3\n`hidden = c(1)` (~logistic regression)                               | 71.2  |    120    |  13\n`hidden = c(200,200), l1 = 1e-5, l2 = 1e-5`                          | 73.1  |    260    |  1.8\n`RectifierWithDropout, c(200,200,200,200), dropout=c(0.2,0.1,0.1,0)` | 73.3  |    440    |  2.0\n`ADADELTA rho = 0.95, epsilon = 1e-06`                               | 71.1  |    240    |  1.7\n` rho = 0.999, epsilon = 1e-08`                                      | 73.3  |    270    |  1.9\n`adaptive = FALSE` default: `rate = 0.005, decay = 1, momentum = 0`  | 73.0  |    340    |  1.1\n`rate = 0.001, momentum = 0.5 / 1e5 / 0.99`                          | 73.2  |    410    |  0.7\n`rate = 0.01, momentum = 0.5 / 1e5 / 0.99`                           | 73.3  |    280    |  0.9\n`rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 / 1e5 / 0.99`   | 73.5  |    360    |  1\n`rate = 0.01, rate_annealing = 1e-04, momentum = 0.5 / 1e5 / 0.99`   | 72.7  |    3700   |  8.7\n`rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 / 1e5 / 0.9`    | 73.4  |    350    |  0.9\n\n\nIt looks like the neural nets are underfitting and are not able to capture the same structure in the\ndata as the random forests/GBMs can (AUC 80-81). Therefore adding various forms of regularization\ndoes not improve accuracy (see above). Note also that by using early stopping (based on the decrease of\naccuracy on a validation dataset during training iterations) the training takes relatively short time\n(compared to RF/GBM), also a sign of effectively low model complexity.\nRemarkably, the nets with more layers (deep) are not performing better than a simple net with\n1 hidden layer and a small number of neurons in that layer (10). \n\nTiming on the 1M dataset of various tools (fully connected networks, 2 hidden layers, 200 neurons each, ReLU,  \nSGD, learning rate 0.01, momentum 0.9, 1 epoch), code \n[here](https://github.com/szilard/benchm-ml/tree/master/4-DL):\n\nTool         | Time GPU  | Time CPU\n-------------|-----------|-----------\nh2o          |    -      |   50\nmxnet        |    35     |   65\nkeras+TF     |    35     |   60\nkeras+theano |    25     |   70\n\n(GPU = p2.xlarge, CPU = r3.8xlarge 32c for h2o/mxnet, p2.xlarge 4c for TF/theano, theano uses 1 core only)\n\nDespite not being great (in accuracy) on tabular data of the type above, \ndeep learning has been a blast in domains such as image, speech and somewhat text,\nand I'm planing to do a [benchmark of tools](https://github.com/szilard/benchm-dl) \nin that area as well (mostly conv-nets and RNNs/LSTMs).\n\n\n\n### Big(ger) Data\n\nWhile my primary interest is in machine learning on datasets of 10M records, you might be interested in \nlarger datasets. Some problems might need a cluster, though there has been a tendency recently \nto solve every problem with distributed computing, needed or not. As a reminder, sending data\nover a network vs using shared memory is a big speed difference. Also several popular distributed systems\nhave significant computation and memory overhead, and more fundamentally, their communication patterns\n(e.g. map-reduce style) are not the best fit for many of the machine learning algos.\n\n#### Larger Data Sizes (on a Single Server)\n\nFor linear models, most tools, including single-core R work well on 100M records still\non a single server (r3.8xlarge instance with 32 cores, 250GB RAM used here).\n(A 10x copy of the 10M dataset has been used, therefore information on AUC vs size is invalid\nand is not considered here.)\n\nLinear models, 100M rows:\n\nTool    |   Time[s]   |   RAM[GB]\n--------|-------------|-----------\nR       |   1000      |    60\nSpark   |    160      |    120\nH2O     |    40       |    20\nVW      |    150      |\n\nSome tools can handle 1B records on a single machine\n(in fact VW never runs out of memory, so if larger runtimes are acceptable,\nyou can go further still on one machine).\n\nLinear models, 1B rows:\n\nTool    |   Time[s]   |   RAM[GB]\n--------|-------------|-----------\nH2O     |    500      |    100\nVW      |    1400     |\n\nFor tree-based ensembles (RF, GBM) H2O and xgboost can train on 100M records\non a single server, though the training times become several hours:\n\nRF/GBM, 100M rows:\n\nAlgo    |Tool     |   Time[s]   |   Time[hr]  | RAM[GB]\n--------|---------|-------------|-------------|----------\nRF      | H2O     |   40000     |     11      | 80\n.       | xgboost |   36000     |     10      | 60\nGBM     | H2O     |   35000     |     10      | 100   \n.       | xgboost |   110000    |     30      | 50\n\nOne usually hopes here (and most often gets) much better accuracy for the 1000x in training time vs linear models.\n\n\n#### Distributed Systems\n\nSome quick results:\n\nH2O logistic runtime (sec):\n\nsize    |  1 node |  5 nodes\n--------|---------|----------\n100M    |   42    |   9.9 \n1B      |  480    |   101 \n\nH2O RF runtime (sec) (5 trees):\n\nsize    |  1 node |  5 nodes\n--------|---------|----------\n10M     |   42    |   41     \n100M    |  405    |   122\n\n\n\n## Summary\n\nAs of January 2018:\n\nWhen I started this benchmark in March 2015, the \"big data\" hype was all the rage, and the fanboys wanted to do\nmachine learning on \"big data\" with distributed computing (Hadoop, Spark etc.), while for the datasets most people had\nsingle-machine tools were not only good enough, but also faster, with more features and less bugs. I gave quite a few\ntalks at conferences and meetups about these benchmarks starting 2015 \nand while at the beginning I had several people asking angrily about my results on Spark, by 2017 most people realized single machine\ntools are much better for solving most of their ML problems. While Spark is a decent tool for ETL on raw data (which \noften is indeed \"big\"), its ML libraries are totally garbage and outperformed (in training time, memory footpring and\neven accuracy) by much better tools by orders of magnitude. \nFurthermore, the increase in available RAM over the last years in servers and also in the cloud,\nand the fact that for machine learning one typically refines the raw data \ninto a much smaller sized data matrix is making the mostly single-machine highly-performing tools \n(such as xgboost, lightgbm, VW but also h2o) the best choice for most \npractical applications now. The big data hype is finally over.\n\nWhat's happening now is a new wave of hype, namely deep learning. The fanboys now think deep learning (or as they miscall it:\nAI) is the best solution to all machine learning problems. While deep learning has been extremely\nsuccessful indeed on a few classes of data/machine learning problems such as involving images, \nspeech and somewhat text (supervised learning) and games/virtual environments (reinforcement learning),\nin more \"traditional\" machine learning problems encountered in business such as fraud detection, credit scoring or churn\n(with structured/tabular data) deep learning is not as successful and it provides lower accuracy \nthan random forests or gradient boosting machines (GBM). Therefore, lately I'm concentrating my benchmarking efforts \nmostly on GBM implementations and \nI have started a new github repo [GBM-perf](https://github.com/szilard/GBM-perf) that's more \"focused\" and lean \nand also uses more modern tools (such as docker) to make the benchmarks more maintainable and reproducible. Also, it has become\napparent recently that GPUs can be a powerful computing platform for GBMs too, and the new repo includes benchmarks \nof the available GPU implementations as well.\n\nI started these benchmarks mostly out of curiousity and the desire to learn (and also in order to be able to choose \ngood tools for my projects). It's been quite some experience and I'd like to thank all the folks (especially the developers of\nthe tools) for helping me in tuning and getting the most out of their ML tools. \nAs a side effect of this work I had the pleasure to be invited to talk at several conferences\n(KDD, R-finance, useR!, eRum, H2O World, Crunch, Predictive Analytics World, EARL, Domino Data Science Popup, Big Data Day LA,\nBudapest Data Forum) and to over 10 meetups, e.g.:\n\n- KDD **Invited Talk** - Machine Learning Software in Practice: Quo Vadis? - Halifax, Canada, August 2017\n- R in Finance **Keynote** - No-Bullshit Data Science - Chicago, May 2017\n- LA Data Science Meetup - Machine Learning in Production - Los Angeles, May 2017\n- useR! 2016 - Size of Datasets for Analytics and Implications for R - Stanford, June 2016\n- H2O World - Benchmarking open source ML platforms - Mountain View, November 2015\n- LA Machine Learning Meetup - Benchmarking ML Tools for Scalability, Speed and Accuracy - LA, June 2015\n\n(see code/slides and for some video recordings [here](https://github.com/szilard/benchm-ml-talks)). These talks/materials are also\nprobably the best place to get a grasp on the findings of this benchmark (and if you want to pick the one that is most\nup to date and summarizes the most watch the \n[video of my KDD talk](https://www.youtube.com/watch?v=8wyOwUNw7D8\u0026list=PLliTSxmRFGVO6Vag6FX5Jfq-RG-kUtKFZ\u0026index=11)).\nThe work goes on, expect more results...\n\n## Citation\n\nIf `benchm-ml` was useful for your research, please consider citing it, for instance using the latest commit:\n\n```\n@misc{,\n\tauthor = {Pafka, Szilard},\n\ttitle = {benchm-ml},\n\tpublisher = {GitHub},\n\tyear = {2019},\n\tjournal = {GitHub repository},\n\turl = {https://github.com/szilard/benchm-ml},\n\thowpublished = {\\url{https://github.com/szilard/benchm-ml}},\n\tcommit = {13325ce3edd7c902390197f43bcc7938c306bbe3}\n}\n```\n","funding_links":[],"categories":["R","Benchmarks","Uncategorized","Technical Resources"],"sub_categories":["Uncategorized","Benchmarks"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszilard%2Fbenchm-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fszilard%2Fbenchm-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszilard%2Fbenchm-ml/lists"}