{"id":16597834,"url":"https://github.com/szilard/gbm-perf","last_synced_at":"2025-03-06T19:42:17.793Z","repository":{"id":48168325,"uuid":"92568788","full_name":"szilard/GBM-perf","owner":"szilard","description":"Performance of various open source GBM implementations","archived":false,"fork":false,"pushed_at":"2024-06-20T11:42:20.000Z","size":15175,"stargazers_count":215,"open_issues_count":34,"forks_count":28,"subscribers_count":22,"default_branch":"master","last_synced_at":"2024-10-13T00:06:43.708Z","etag":null,"topics":["benchmark","gbm","gradient-boosting-machine","h2oai","lightgbm","machine-learning","xgboost"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/szilard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-27T03:45:57.000Z","updated_at":"2024-07-10T22:39:25.000Z","dependencies_parsed_at":"2024-06-21T01:46:08.628Z","dependency_job_id":"40870a1d-e264-405c-92b5-fb44447dad59","html_url":"https://github.com/szilard/GBM-perf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-perf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-perf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-perf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szilard%2FGBM-perf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/szilard","download_url":"https://codeload.github.com/szilard/GBM-perf/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242277207,"owners_count":20101530,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","gbm","gradient-boosting-machine","h2oai","lightgbm","machine-learning","xgboost"],"created_at":"2024-10-12T00:06:45.419Z","updated_at":"2025-03-06T19:42:17.767Z","avatar_url":"https://github.com/szilard.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# GBM Performance\n\nPerformance of the top/most widely used open source gradient boosting machines (GBM)/ boosted trees (GBDT)\nimplementations (h2o, xgboost, lightgbm, catboost) \non the airline dataset (100K, 1M and 10M records) and with `100` trees, depth `10`, learning rate `0.1`.\n\n\n\n## Popularity of GBM implementations\n\nPoll conducted via twitter (April, 2019):\n\n![](poll.png)\n\nMore recent twitter poll (September, 2020):\n\n![](poll2.png)\n\nJune 2024:\n\n![](poll3.png)\n\n\n\n## How to run/reproduce the benchmark\n\nInstalling to latest software versions and running/timing is easy and fully automated with docker: \n\n### CPU\n\n(requires docker)\n\n```\ngit clone https://github.com/szilard/GBM-perf.git\ncd GBM-perf/cpu\nsudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) -t gbmperf_cpu .\nsudo docker run --rm gbmperf_cpu\n```\n\n### GPU\n\n(requires docker, NVIDIA drivers and the `nvidia-docker` utility)\n\n```\ngit clone https://github.com/szilard/GBM-perf.git\ncd GBM-perf/gpu\nsudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) -t gbmperf_gpu .\nsudo nvidia-docker run --rm gbmperf_gpu\n```\n\n\n\n## Results\n\n### CPU \n\nr4.8xlarge (32 cores, but run on physical cores only/no hyperthreading) with software as of 2024-06-04:\n\nTool              | Time[s] 100K | Time[s] 1M  |  Time[s] 10M  |   AUC 1M  |   AUC 10M\n------------------|--------------|-------------|---------------|-----------|------------\nh2o               |   11         |   12        |     60        |   0.762   |   0.776\n**xgboost**       |   **0.4**    |   **2.7**   |     40        |   0.749   |   0.757\n**lightgbm**      |   2.3        |   4.0       |     **20**    |   0.765   |   0.792\ncatboost          |   1.9        |   7.0       |     70        |   0.734?! |   0.735?! \n\nResults on newer hardware (m7i/c7i/r7i) [here](https://github.com/szilard/GBM-perf/issues/56) \n(TLDR: ~2x speedup on the newer hardware).\n\n\n### GPU\n\np3.2xlarge (1 GPU, NVIDIA V100) with software as of 2024-06-06 (and CUDA 12.5):\n\nTool            | Time[s] 100K | Time[s] 1M  |  Time[s] 10M  |   AUC 1M  |   AUC 10M\n----------------|--------------|-------------|---------------|-----------|------------\nh2o xgboost     |   6.4        |    14       |     42        |   0.749   |   0.756  \n**xgboost**     |   **0.7**    |   **1.3**   |   **5**       |   0.748   |   0.756\nlightgbm        |   7          |    9        |     40        |   0.766   |   0.791\ncatboost        |   1.6        |    3.4      |     23        | 0.735 ?!  |   0.737 ?!\n\nResults on newer hardware (A10/A100/H100) [here](https://github.com/szilard/GBM-perf/issues/57)\n(TLDR: more modest speedups compared to neural nets, ~1.3x for XGBoost on the largest data). \n\n\n## Additional results \n\nSome additional studies obtained \"manually\" (not fully automated with docker as the main benchmark above).\nThanks [@Laurae2](https://github.com/Laurae2) for lots of help with some if these. \n\n### Faster CPUs\n\nAWS has now better CPUs than r4.8xlarge (Xeon E5-2686 v4 2.30GHz, 32 cores), for example with higher CPU frequency \nc5.9xlarge (Xeon Platinum 8124M 3.00GHz, 36 cores) or more number of cores \nm5.12xlarge (Xeon Platinum 8175M 2.50GHz, 48 cores).\n\nc5.9xlarge and m5.12xlarge are typically 20-50% faster than r4.8xlarge, for larger data more cores (m5.12xlarge) is the best, \nfor smaller data high-frequency CPU (c5.9xlarge) is the best. Nevertheless, the ranking of libs by\ntraining time stays the same for a given data size when changing CPU. More details\n[here](https://github.com/szilard/GBM-perf/issues/13).\n\nEven more recently a CPU with both higher frequency and more cores became available on AWS: c5.12xlarge (Xeon Platinum 8275CL 3.00GHz, 48 cores)\nand also instances with 2 of these CPUs (but see results for multi-socket systems below): c5.24xlarge and c5.metal. \nResults for c5.metal are [here](https://github.com/szilard/GBM-perf/issues/41).\n\n**2024 update:** latest results for the newest CPUs (c7i.metal-48xl and c7a.metal-48xl) are [here](https://github.com/szilard/GBM-perf/issues/59).\n\n\n### Multi-core scaling (CPU)\n\nWhile GBM trees must be grown sequentially (as building each tree depends on the results of the previous ones), GBM training can be parallelized e.g. \nby parallelizing the computation in each split (more exactly the histogram calculations). Modern CPUs have many cores, but the scaling of these GBM implementations is far \nworse from being proportional to the number of cores. Furthermore, it has been known for long (2016) that xgboost (and later lightgbm) slow down (!) on systems\nwith 2 or more CPU sockets or when hyperthreaded cores are used. These problems have been very recently mitigated (2020), but it is\nstill usually best to restrict your training process to the physical cores (avoid hyperthreading) and only 1 CPU socket (if the server has\n2 or more sockets). \n\nEven if only physical (no hyperthreading) CPU cores are used on 1 socket only, the speedup for example from 1 core to 16 cores \nis not 16x, but (on r4.8xlarge):\n\ndata size   |  h2o |  xgboost | lightgbm | catboost\n------------|------|----------|----------|----------\n0.1M        |  3x  |    6.5x  |   1.5x   |      3.5x\n 1M         |  8x  |    6.5x  |     4x   |      6x\n 10M        | 24x  |    5x    |  7.5x    |      8x\n\nwith more details [here](https://github.com/szilard/GBM-perf/issues/29#issuecomment-691646736). In fact the scaling was worse until\nvery recently, for example xgboost was at 2.5x at 1M rows (vs 6.5x now) before several optimizations have been implemented in 2020. \n\n**2024 update:** latest results for the newest CPUs (c7i.metal-48xl and c7a.metal-48xl) with \nmulticore scaling up to 48 physical cores (no hyperthreading) and beyond (with hyperthreading) \nare [here](https://github.com/szilard/GBM-perf/issues/59).\n\n\n\n### Multi-socket CPUs\n\nMost high-end servers have nowadays more than 1 CPU on the motherboard. For example c5.18xlarge has 2 CPUs\n(2x of the c5.9xlarge CPUs mentioned above), same for r4.16xlarge or m5.24xlarge. There are even EC2 instances with \n4 CPUs e.g. x1.32xlarge (128 cores) or more.\n\nOne would think more CPU cores means higher training speed, though because of RAM topology and NUMA, most of the above tools\nused to run slower on 2 CPUs than 1 CPU (!) until very recently (2020). The slowdown was sometimes pretty \ndramatic, e.g. 2x for lightgbm or 3-5x for xgboost even for the largest data in this benchmark. \nVery recently these effects have been mitigated by several optimizations in lightgbm and even more notably in xgboost. \nMore details on the NUMA issue \n[here](https://github.com/szilard/GBM-perf/issues/13),\n[here](https://github.com/szilard/GBM-multicore) and\n[here](https://github.com/szilard/GBM-perf/issues/29).\n\nCurrently, the difference in training speed e.g. on r4.16xlarge (2 sockets, 16 cores + 16 HT each, so total of 64 cores) between \n16 physical cores and 64 total cores is:\n\ndata size   |  h2o |  xgboost | lightgbm | catboost\n------|-----|----------|----------|----------\n0.1M  |  -40%     |    -50%        |        -70%    |      15%\n   1M  |  -15 %    |    -2%        |     -60%         |      -20%\n 10M  | 25%   |    35%          |  -20%          |      10%\n\nwhere negative numbers mean on 64 cores it is slower than on 16 cores (by that much %) (e.g. -50% means a decrease in speed by 50% that is\na doubling of training time). These numbers were much much worse until very recently (2020), for example training time (sec) for xgboost 1M rows:\n\ncores       |  May 2019  | Sept 2020\n------------|------------|------------\n1           |    30      |   34\n16 (1so)    |    12      |   5.1\n64 (2so+HT) |   120      |   5.2\n\nthat is xgboost was 10x slower on 64 cores vs 16 cores and it was slower on 64 cores vs even 1 core (!). One can see that the recent\noptimizations have improved both the multicore scaling and the NUMA (multi-socket) issue.\n\n**2024 update:** latest results for the newest CPUs (c7i.metal-48xl and c7a.metal-48xl) \nwith 192 cores (with 2 CPU sockets) are [here](https://github.com/szilard/GBM-perf/issues/59).\n\n\n\n### 100M records and RAM usage\n\nResults on the fastest CPU (most cores, 1 socket, see above why this is the fastest) and the fastest GPU on EC2.\nThe data is obtained by replicating the 10M dataset 10x, so the AUC is not indicative of a learning curve, just used to\nsee if it is equal approximately the 10M AUC (it should be).\n\nFor the CPU runs, \"RAM train\" is measured as the increase in memory usage during training (on top of the RAM used by the data). \nFor the GPU runs, the \"GPU memory\" usage is the total GPU memory used (cannot separate training from copies of the data),\nwhile the \"extra RAM\" is the additional RAM used by some of the tools (on the CPU) if any.\n\nCPU (m5.12xlarge):\n\nTool              | time [s]   | AUC       | RAM train [GB]\n------------------|------------|-----------|-------------------------\nh2o               | 520        |  0.775    |   8\nxgboost           | 510        |  0.751    |  15\n**lightgbm ohe**  | **310**    |  0.774    |   **5**\ncatboost          | 930        |   0.736   |  50\n\n\nGPU (Tesla V100):\n\nTool              | time [s]    |  AUC      | GPU mem [GB]   | extra RAM [GB]\n------------------|-------------|-----------|----------------|----------------\nh2o xgboost       | 270         | 0.755     | 4              | 30\n**xgboost**       | **80**      | 0.756     | 6              | **0**\nlightgbm ohe      | 400         | 0.774     | 3              | 6\ncatboost          | crash (OOM) |           | \u003e16            | 14\n\ncatboost GPU crashes out-of-memory on the 16GB GPU.\n\nh2o xgboost on GPU is slower than native xgboost on GPU and also adds\na lot of overhead in RAM usage (\"extra RAM\") (this must be due to some pre- and post-processing of data in h2o as one can\nsee by looking at the GPU utilization patterns as discussed next).\n\nMore details [here](https://github.com/szilard/GBM-perf/issues/14).\n\n**2024 update:** latest CPU results for the newest c7i.metal-48xl (using 48 physical cores on 1 socket, no hyperthreading,\nno NUMA/2 sockets):\n\nTool              | time [s]   \n------------------|-----------\nxgboost           | 190       \n**lightgbm**      | **55**   \n\nThe 2.7x speedup in CPU XGBoost (vs results above) are due to the significant improvement in multicore scaling (2020),\nfurther improvements in speed and also the increase in number of cores of the top CPUs. \nThe 5.6x speedup in CPU LightGBM are due to implementation of direct handling of categorical variables and increase in\nnumber of cores of the top CPUs (LightGBM benefits more than XGBoost with many CPU cores for large datasets).\n\n\n### GPU utilization patterns\n\nFor the GPU runs, it is interesting to observe the GPU utilization patterns and also the CPU utilization meanwhile\n(usually 1 CPU thread).\n\nxgboost uses GPU at ~80% and 1 CPU core at 100%.\n\nh2o xgboost shows 3 phases: first only using CPU at ~30% (all cores) and no GPU, then GPU at ~70% and CPU at 100%, then\nno GPU and CPU at 100%. This means 3-4x longer training time vs native xgboost. \n\nlightgbm uses GPU at 5-10% and meanwhile CPU at 100% (all cores). It can be made to use 1 CPU core only (`nthread = 1`), but\nthen it may be slower.\n\ncatboost uses GPU at ~80% and 1 CPU core at 100%. Unlike the other tools catboost takes all the GPU memory available when it\nstarts training no matter of the data size (so we don't know how much memory it needs by using the standard monitoring tools).\n\nMore details [here](https://github.com/szilard/GBM-perf/issues/11).\n\n\n### Spark MLlib \n\nIn my previous broader benchmark of ML libraries, Spark MLlib GBT (and random forest as well) performed very poorly \n(10-100x running time vs top libs, 10-100x memory usage and an accuracy issue for larger data) and therefore it\nwas not included in the current GBM/GBT benchmark. However, people might still be interested if there has been any\nimprovements since 2016 and Spark 2.0.\n\nWith Spark 2.4.2 as of 2019-05-05  the accuracy issue for larger data has been fixed, but the\nspeed and the memory footprint did not improve:\n\nsize  | time lgbm [s] | time spark [s] | ratio | AUC lgbm | AUC spark\n------|---------------|----------------|-------|----------|-------------\n100K  |           2.4 |           1020 | 425   |    0.730 | 0.721\n1M    |           5.2 |           1380 | 265   |    0.764 | 0.748\n10M   |            42 |           8390 | 200   |    0.774 | 0.755\n\n(compared to lighgbm CPU) (Spark code [here](https://github.com/szilard/GBM-perf/tree/master/analysis/spark))\n\nSo Spark MLlib GBT is still 100x slower than the top tools. In case you are wondering if more nodes or\nbigger data would help, the answer in nope (see below).\n\n#### Spark MLlib on 100M records and RAM usage\n\nBesides being slow, Spark also uses 100x RAM compared to the top tools. In fact, on 100M records \n(20GB after being loaded from disk and cached in RAM) it crashes out-of-memory even on servers with almost 1 TB RAM.\n\n      |       | 100M      |       |            | 10M      |       |  \n----- | ----- | --------- | ----- | ---------- | -------- | ----- | --\ntrees | depth | time [s]  | AUC   | RAM [GB]   | time [s] | AUC   | RAM [GB]\n1     | 1     | 1150      | 0.634 | 620        | 70       | 0.635 | 110\n1     | 10    | 1350      | 0.712 | 620        | 90       | 0.712 | 112\n10    | 10    | 7850      | 0.731 | 780        | 830      | 0.731 | 125\n100   | 10    | crash OOM |       | \u003e960 (OOM) | 8390     | 0.755 | 230\n\n(100M ran on x1e.8xlarge [32 cores, 960GB RAM], 10M ran on r4.8xlarge [32 cores, 240GB RAM])\n\n(compare this with 100M records 100 trees depth 10, lightgbm 5GB RAM usage)\n\nMore details [here](https://github.com/szilard/GBM-perf/issues/18). \n\nNote the situation is much better for linear models in Spark MLlib, only 3-4x slower and 10x more memory\nfootprint vs h2o for example, see results [here](https://github.com/szilard/GBM-perf/issues/20) (and training\nlinear models is much much faster than trees, so training times are reasonable even for large data).\n\n#### Spark on a cluster\n\nResults on a EMR cluster with master+10 slave nodes and comparison with local mode on 1 server (and \n\"cluster\" with 1 master+1 slave). To run in reasonable time only 10 trees (depth 10) have been used.\n\nsize | hw | nodes | cores | partitions | time [s] | RAM [GB] | avail RAM [GB]\n-- | -- | -- | -- | -- | -- | -- | --\n10M | local | r4.8xl | 32 | 32 | 830 | 125 | 240\n10M | Cluster_1 | r4.8xl | 32 | 64 | 1180 | 73 | 240\n10M | Cluster_10 | r4.8xl | 320 | 320 (m) | 330 |   | 2400\n100M | local | x1e.8xl | 32 |   | 7850 | 780 | 960\n100M | Cluster_10 | r4.8xl | 320 | 585 | 1825 | 10*72 | 2400\n\n100M records data is \"big\" enough for Spark to be in the \"at scale\" modus operandi. However, the \ncomputation speed and memory footprint inefficiencies of the algorithm/implementation are so\nhuge that no cluster of any size can really help. Furthermore larger data (billions) would mean even more \nprohibitively slow training (many hours/days) for any reasonable cluster size (remember, the timings\nabove are for 10 trees, any decent GBM would need at least 100 trees).\n\nAlso, the fact that Spark has so huge memory footprint means that one can run e.g. lightgbm\ninstead on much less RAM, so that even larger datasets would fit in the RAM of a single server.\nResults for lightgbm for comparison with the above Spark cluster results (10 trees):\n\nsize | hw | cores | time [s] | AUC | RAM [GB] | avail RAM [GB]\n-- | -- | -- | -- | -- | -- | --\n10M | r4.8xl | 16 (m) | 7 | 0.743 | 4 | 240\n100M | r4.8xl | 16 (m) | 60 | 0.743 | 13(d)+5 | 240\n\nMore details [here](https://github.com/szilard/GBM-perf/issues/21).\n\n\n\n## Recommendations\n\nIf you **don't have a GPU, lightgbm and xgboost** (CPU) train the fastest.\n\nIf you **have a GPU, xgboost** (GPU) is very fast (and depending on the data, your hardware etc.\noften faster than the above mentioned lightgbm/xgboost on CPU).\n\nIf you consider deployment, **h2o has the best ways to deploy** as a real-time\n(fast scoring) application.\n\nNote, however, there are a lot more other criteria to consider when you choose which tool\nto use, e.g.:\n\n![](comparison_table.png)\n\nYou can find more info in my talks at several conferences and meetups with many of them having video\nrecordings available, for example my talk\nat Berlin Buzzwords in 2019, video recording [here](https://www.youtube.com/watch?v=qjuizRba3ZQ), slides\n[here](https://bit.ly/szilard-talk-berlbuzz19), or a more updated talk from November 2020 at the LA Data Science Meetup,\nvideo recording [here](https://www.youtube.com/watch?v=ecUUUdisKAc), \nslides [here](https://docs.google.com/presentation/d/1hRJveGyFArYzfpPSD9XeOi6oCHRjrj12yx4MDIrtPZg/edit).\n\n**2024 updates** in a university seminar talk, \nslides [here](https://docs.google.com/presentation/d/1WApb_qrBX8kXaW4JFsiR4lLSbaKWESeREERt0AM1vZM/edit#slide=id.g53965d405_00).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszilard%2Fgbm-perf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fszilard%2Fgbm-perf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszilard%2Fgbm-perf/lists"}