{"id":13742234,"url":"https://github.com/ray-project/xgboost_ray","last_synced_at":"2025-04-12T23:29:19.040Z","repository":{"id":37925197,"uuid":"303662480","full_name":"ray-project/xgboost_ray","owner":"ray-project","description":"Distributed XGBoost on Ray","archived":false,"fork":false,"pushed_at":"2024-03-02T07:26:52.000Z","size":483,"stargazers_count":133,"open_issues_count":44,"forks_count":33,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-05-22T13:31:20.208Z","etag":null,"topics":["dask","data-science","kaggle","machine-learning","modin","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ray-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-13T10:09:17.000Z","updated_at":"2024-06-18T15:35:28.171Z","dependencies_parsed_at":"2024-01-05T00:57:56.503Z","dependency_job_id":"80e11dc9-1aeb-4415-83ef-3dbab61ae793","html_url":"https://github.com/ray-project/xgboost_ray","commit_stats":{"total_commits":263,"total_committers":18,"mean_commits":14.61111111111111,"dds":0.714828897338403,"last_synced_commit":"9fc9a9e27eb2d9ae12a29ad57d5b65476b64ecfa"},"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fxgboost_ray","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fxgboost_ray/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fxgboost_ray/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fxgboost_ray/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ray-project","download_url":"https://codeload.github.com/ray-project/xgboost_ray/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248646795,"owners_count":21139073,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dask","data-science","kaggle","machine-learning","modin","xgboost"],"created_at":"2024-08-03T05:00:24.627Z","updated_at":"2025-04-12T23:29:19.009Z","avatar_url":"https://github.com/ray-project.png","language":"Python","funding_links":[],"categories":["Models and Projects"],"sub_categories":["Ray + X (integration)"],"readme":"\u003c!--$UNCOMMENT(xgboost-ray)=--\u003e\n\n# Distributed XGBoost on Ray\n\u003c!--$REMOVE--\u003e\n![Build Status](https://github.com/ray-project/xgboost_ray/workflows/pytest%20on%20push/badge.svg)\n[![docs.ray.io](https://img.shields.io/badge/docs-ray.io-blue)](https://docs.ray.io/en/master/xgboost-ray.html)\n\u003c!--$END_REMOVE--\u003e\nXGBoost-Ray is a distributed backend for\n[XGBoost](https://xgboost.readthedocs.io/en/latest/), built\non top of\n[distributed computing framework Ray](https://ray.io).\n\nXGBoost-Ray\n\n- enables [multi-node](#usage) and [multi-GPU](#multi-gpu-training) training\n- integrates seamlessly with distributed [hyperparameter optimization](#hyperparameter-tuning) library [Ray Tune](http://tune.io)\n- comes with advanced [fault tolerance handling](#fault-tolerance) mechanisms, and\n- supports [distributed dataframes and distributed data loading](#distributed-data-loading)\n\nAll releases are tested on large clusters and workloads.\n\n## Installation\n\nYou can install the latest XGBoost-Ray release from PIP:\n\n```bash\npip install \"xgboost_ray\"\n```\n\nIf you'd like to install the latest master, use this command instead:\n\n```bash\npip install \"git+https://github.com/ray-project/xgboost_ray.git#egg=xgboost_ray\"\n```\n\n## Usage\n\nXGBoost-Ray provides a drop-in replacement for XGBoost's `train`\nfunction. To pass data, instead of using `xgb.DMatrix` you will\nhave to use `xgboost_ray.RayDMatrix`. You can also use a scikit-learn\ninterface - see next section.\n\n\nJust as in original `xgb.train()` function, the\n[training parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)\nare passed as the `params` dictionary.\n\nRay-specific distributed training parameters are configured with a\n`xgboost_ray.RayParams` object. For instance, you can set\nthe `num_actors` property to specify how many distributed actors\nyou would like to use.\n\nHere is a simplified example (which requires `sklearn`):\n\n**Training:**\n\n```python\nfrom xgboost_ray import RayDMatrix, RayParams, train\nfrom sklearn.datasets import load_breast_cancer\n\ntrain_x, train_y = load_breast_cancer(return_X_y=True)\ntrain_set = RayDMatrix(train_x, train_y)\n\nevals_result = {}\nbst = train(\n    {\n        \"objective\": \"binary:logistic\",\n        \"eval_metric\": [\"logloss\", \"error\"],\n    },\n    train_set,\n    evals_result=evals_result,\n    evals=[(train_set, \"train\")],\n    verbose_eval=False,\n    ray_params=RayParams(\n        num_actors=2,  # Number of remote actors\n        cpus_per_actor=1))\n\nbst.save_model(\"model.xgb\")\nprint(\"Final training error: {:.4f}\".format(\n    evals_result[\"train\"][\"error\"][-1]))\n```\n\n**Prediction:**\n\n```python\nfrom xgboost_ray import RayDMatrix, RayParams, predict\nfrom sklearn.datasets import load_breast_cancer\nimport xgboost as xgb\n\ndata, labels = load_breast_cancer(return_X_y=True)\n\ndpred = RayDMatrix(data, labels)\n\nbst = xgb.Booster(model_file=\"model.xgb\")\npred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))\n\nprint(pred_ray)\n```\n\n### scikit-learn API\n\nXGBoost-Ray also features a scikit-learn API fully mirroring pure\nXGBoost scikit-learn API, providing a completely drop-in\nreplacement. The following estimators are available:\n\n- `RayXGBClassifier`\n- `RayXGRegressor`\n- `RayXGBRFClassifier`\n- `RayXGBRFRegressor`\n- `RayXGBRanker`\n\nExample usage of `RayXGBClassifier`:\n\n```python\nfrom xgboost_ray import RayXGBClassifier, RayParams\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.model_selection import train_test_split\n\nseed = 42\n\nX, y = load_breast_cancer(return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, train_size=0.25, random_state=42\n)\n\nclf = RayXGBClassifier(\n    n_jobs=4,  # In XGBoost-Ray, n_jobs sets the number of actors\n    random_state=seed\n)\n\n# scikit-learn API will automatically convert the data\n# to RayDMatrix format as needed.\n# You can also pass X as a RayDMatrix, in which case\n# y will be ignored.\n\nclf.fit(X_train, y_train)\n\npred_ray = clf.predict(X_test)\nprint(pred_ray)\n\npred_proba_ray = clf.predict_proba(X_test)\nprint(pred_proba_ray)\n\n# It is also possible to pass a RayParams object\n# to fit/predict/predict_proba methods - will override\n# n_jobs set during initialization\n\nclf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))\n\npred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))\nprint(pred_ray)\n```\n\nThings to keep in mind:\n\n- `n_jobs` parameter controls the number of actors spawned.\n  You can pass a `RayParams` object to the\n  `fit`/`predict`/`predict_proba` methods as the `ray_params` argument\n  for greater control over resource allocation. Doing\n  so will override the value of `n_jobs` with the value of\n  `ray_params.num_actors` attribute. For more information, refer\n  to the [Resources](#resources) section below.\n- By default `n_jobs` is set to `1`, which means the training\n  will **not** be distributed. Make sure to either set `n_jobs`\n  to a higher value or pass a `RayParams` object as outlined above\n  in order to take advantage of XGBoost-Ray's functionality.\n- After calling `fit`, additional evaluation results (e.g. training time,\n  number of rows, callback results) will be available under\n  `additional_results_` attribute.\n- XGBoost-Ray's scikit-learn API is based on XGBoost 1.4.\n  While we try to support older XGBoost versions, please note that\n  this library is only fully tested and supported for XGBoost \u003e= 1.4.\n\nFor more information on the scikit-learn API, refer to the [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn).\n\n## Data loading\n\nData is passed to XGBoost-Ray via a `RayDMatrix` object.\n\nThe `RayDMatrix` lazy loads data and stores it sharded in the\nRay object store. The Ray XGBoost actors then access these\nshards to run their training on.\n\nA `RayDMatrix` support various data and file types, like\nPandas DataFrames, Numpy Arrays, CSV files and Parquet files.\n\nExample loading multiple parquet files:\n\n```python\nimport glob\nfrom xgboost_ray import RayDMatrix, RayFileType\n\n# We can also pass a list of files\npath = list(sorted(glob.glob(\"/data/nyc-taxi/*/*/*.parquet\")))\n\n# This argument will be passed to `pd.read_parquet()`\ncolumns = [\n    \"passenger_count\",\n    \"trip_distance\", \"pickup_longitude\", \"pickup_latitude\",\n    \"dropoff_longitude\", \"dropoff_latitude\",\n    \"fare_amount\", \"extra\", \"mta_tax\", \"tip_amount\",\n    \"tolls_amount\", \"total_amount\"\n]\n\ndtrain = RayDMatrix(\n    path,\n    label=\"passenger_count\",  # Will select this column as the label\n    columns=columns,\n    # ignore=[\"total_amount\"],  # Optional list of columns to ignore\n    filetype=RayFileType.PARQUET)\n```\n\n\u003c!--$UNCOMMENT(xgboost-ray-tuning)=--\u003e\n\n## Hyperparameter Tuning\n\nXGBoost-Ray integrates with \u003c!--$UNCOMMENT{ref}`Ray Tune \u003ctune-main\u003e`--\u003e\u003c!--$REMOVE--\u003e[Ray Tune](https://tune.io)\u003c!--$END_REMOVE--\u003e to provide distributed hyperparameter tuning for your\ndistributed XGBoost models. You can run multiple XGBoost-Ray training runs in parallel, each with a different\nhyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training\ncode to a function, and pass the function to `tune.run`. Internally, `train` will detect if `tune` is being used and will\nautomatically report results to tune.\n\nExample using XGBoost-Ray with Ray Tune:\n\n```python\nfrom xgboost_ray import RayDMatrix, RayParams, train\nfrom sklearn.datasets import load_breast_cancer\n\nnum_actors = 4\nnum_cpus_per_actor = 1\n\nray_params = RayParams(\n    num_actors=num_actors,\n    cpus_per_actor=num_cpus_per_actor)\n\ndef train_model(config):\n    train_x, train_y = load_breast_cancer(return_X_y=True)\n    train_set = RayDMatrix(train_x, train_y)\n\n    evals_result = {}\n    bst = train(\n        params=config,\n        dtrain=train_set,\n        evals_result=evals_result,\n        evals=[(train_set, \"train\")],\n        verbose_eval=False,\n        ray_params=ray_params)\n    bst.save_model(\"model.xgb\")\n\nfrom ray import tune\n\n# Specify the hyperparameter search space.\nconfig = {\n    \"tree_method\": \"approx\",\n    \"objective\": \"binary:logistic\",\n    \"eval_metric\": [\"logloss\", \"error\"],\n    \"eta\": tune.loguniform(1e-4, 1e-1),\n    \"subsample\": tune.uniform(0.5, 1.0),\n    \"max_depth\": tune.randint(1, 9)\n}\n\n# Make sure to use the `get_tune_resources` method to set the `resources_per_trial`\nanalysis = tune.run(\n    train_model,\n    config=config,\n    metric=\"train-error\",\n    mode=\"min\",\n    num_samples=4,\n    resources_per_trial=ray_params.get_tune_resources())\nprint(\"Best hyperparameters\", analysis.best_config)\n```\n\nAlso see examples/simple_tune.py for another example.\n\n## Fault tolerance\n\nXGBoost-Ray leverages the stateful Ray actor model to\nenable fault tolerant training. There are currently\ntwo modes implemented.\n\n### Non-elastic training (warm restart)\n\nWhen an actor or node dies, XGBoost-Ray will retain the\nstate of the remaining actors. In non-elastic training,\nthe failed actors will be replaced as soon as resources\nare available again. Only these actors will reload their\nparts of the data. Training will resume once all actors\nare ready for training again.\n\nYou can set this mode in the `RayParams`:\n\n```python\nfrom xgboost_ray import RayParams\n\nray_params = RayParams(\n    elastic_training=False,  # Use non-elastic training\n    max_actor_restarts=2,    # How often are actors allowed to fail\n)\n```\n\n### Elastic training\n\nIn elastic training, XGBoost-Ray will continue training\nwith fewer actors (and on fewer data) when a node or actor\ndies. The missing actors are staged in the background,\nand are reintegrated into training once they are back and\nloaded their data.\n\nThis mode will train on fewer data for a period of time,\nwhich can impact accuracy. In practice, we found these\neffects to be minor, especially for large shuffled datasets.\nThe immediate benefit is that training time is reduced\nsignificantly to almost the same level as if no actors died.\nThus, especially when data loading takes a large part of\nthe total training time, this setting can dramatically speed\nup training times for large distributed jobs.\n\nYou can configure this mode in the `RayParams`:\n\n```python\nfrom xgboost_ray import RayParams\n\nray_params = RayParams(\n    elastic_training=True,  # Use elastic training\n    max_failed_actors=3,    # Only allow at most 3 actors to die at the same time\n    max_actor_restarts=2,   # How often are actors allowed to fail\n)\n```\n\n## Resources\n\nBy default, XGBoost-Ray tries to determine the number of CPUs\navailable and distributes them evenly across actors.\n\nIn the case of very large clusters or clusters with many different\nmachine sizes, it makes sense to limit the number of CPUs per actor\nby setting the `cpus_per_actor` argument. Consider always\nsetting this explicitly.\n\nThe number of XGBoost actors always has to be set manually with\nthe `num_actors` argument.\n\n### Multi GPU training\n\nXGBoost-Ray enables multi GPU training. The XGBoost core backend\nwill automatically leverage NCCL2 for cross-device communication.\nAll you have to do is to start one actor per GPU and set XGBoost's\n`tree_method` to a GPU-compatible option, eg. `gpu_hist` (see XGBoost\ndocumentation for more details.)\n\nFor instance, if you have 2 machines with 4 GPUs each, you will want\nto start 8 remote actors, and set `gpus_per_actor=1`. There is usually\nno benefit in allocating less (e.g. 0.5) or more than one GPU per actor.\n\nYou should divide the CPUs evenly across actors per machine, so if your\nmachines have 16 CPUs in addition to the 4 GPUs, each actor should have\n4 CPUs to use.\n\n```python\nfrom xgboost_ray import RayParams\n\nray_params = RayParams(\n    num_actors=8,\n    gpus_per_actor=1,\n    cpus_per_actor=4,   # Divide evenly across actors per machine\n)\n```\n\n### How many remote actors should I use?\n\nThis depends on your workload and your cluster setup.\nGenerally there is no inherent benefit of running more than\none remote actor per node for CPU-only training. This is because\nXGBoost core can already leverage multiple CPUs via threading.\n\nHowever, there are some cases when you should consider starting\nmore than one actor per node:\n\n- For [multi GPU training](#multi-gpu-training), each GPU should have a separate\n  remote actor. Thus, if your machine has 24 CPUs and 4 GPUs,\n  you will want to start 4 remote actors with 6 CPUs and 1 GPU\n  each\n- In a **heterogeneous cluster**, you might want to find the\n  [greatest common divisor](https://en.wikipedia.org/wiki/Greatest_common_divisor)\n  for the number of CPUs.\n  E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively,\n  you should set the number of actors to 6 and the CPUs per\n  actor to 4.\n\n## Distributed data loading\n\nXGBoost-Ray can leverage both centralized and distributed data loading.\n\nIn **centralized data loading**, the data is partitioned by the head node\nand stored in the object store. Each remote actor then retrieves their\npartitions by querying the Ray object store. Centralized loading is used\nwhen you pass centralized in-memory dataframes, such as Pandas dataframes\nor Numpy arrays, or when you pass a single source file, such as a single CSV\nor Parquet file.\n\n```python\nfrom xgboost_ray import RayDMatrix\n\n# This will use centralized data loading, as only one source file is specified\n# `label_col` is a column in the CSV, used as the target label\nray_params = RayDMatrix(\"./source_file.csv\", label=\"label_col\")\n```\n\nIn **distributed data loading**, each remote actor loads their data directly from\nthe source (e.g. local hard disk, NFS, HDFS, S3),\nwithout a central bottleneck. The data is still stored in the\nobject store, but locally to each actor. This mode is used automatically\nwhen loading data from multiple CSV or Parquet files. Please note that\nwe do not check or enforce partition sizes in this case - it is your job\nto make sure the data is evenly distributed across the source files.\n\n```python\nfrom xgboost_ray import RayDMatrix\n\n# This will use distributed data loading, as four source files are specified\n# Please note that you cannot schedule more than four actors in this case.\n# `label_col` is a column in the Parquet files, used as the target label\nray_params = RayDMatrix([\n    \"hdfs:///tmp/part1.parquet\",\n    \"hdfs:///tmp/part2.parquet\",\n    \"hdfs:///tmp/part3.parquet\",\n    \"hdfs:///tmp/part4.parquet\",\n], label=\"label_col\")\n```\n\nLastly, XGBoost-Ray supports **distributed dataframe** representations, such\nas \u003c!--$UNCOMMENT{ref}`Ray Datasets \u003cdatasets\u003e`--\u003e\u003c!--$REMOVE--\u003e[Ray Datasets](https://docs.ray.io/en/latest/data/dataset.html)\u003c!--$END_REMOVE--\u003e,\n[Modin](https://modin.readthedocs.io/en/latest/) and\n[Dask dataframes](https://docs.dask.org/en/latest/dataframe.html)\n(used with \u003c!--$UNCOMMENT{ref}`Dask on Ray \u003cdask-on-ray\u003e`--\u003e\u003c!--$REMOVE--\u003e[Dask on Ray](https://docs.ray.io/en/master/dask-on-ray.html)\u003c!--$END_REMOVE--\u003e).\nHere, XGBoost-Ray will check on which nodes the distributed partitions\nare currently located, and will assign partitions to actors in order to\nminimize cross-node data transfer. Please note that we also assume here\nthat partition sizes are uniform.\n\n```python\nfrom xgboost_ray import RayDMatrix\n\n# This will try to allocate the existing Modin partitions\n# to co-located Ray actors. If this is not possible, data will\n# be transferred across nodes\nray_params = RayDMatrix(existing_modin_df)\n```\n\n### Data sources\n\nThe following data sources can be used with a `RayDMatrix` object.\n\n| Type                                                             | Centralized loading | Distributed loading |\n|------------------------------------------------------------------|---------------------|---------------------|\n| Numpy array                                                      | Yes                 | No                  |\n| Pandas dataframe                                                 | Yes                 | No                  |\n| Single CSV                                                       | Yes                 | No                  |\n| Multi CSV                                                        | Yes                 | Yes                 |\n| Single Parquet                                                   | Yes                 | No                  |\n| Multi Parquet                                                    | Yes                 | Yes                 |\n| [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html)   | Yes                 | Yes                 |\n| [Petastorm](https://github.com/uber/petastorm)                   | Yes                 | Yes                 |\n| [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes                 | Yes                 |\n| [Modin dataframe](https://modin.readthedocs.io/en/latest/)       | Yes                 | Yes                 |\n\n## Memory usage\n\nXGBoost uses a compute-optimized datastructure, the `DMatrix`,\nto hold training data. When converting a dataset to a `DMatrix`,\nXGBoost creates intermediate copies and ends up\nholding a complete copy of the full data. The data will be converted\ninto the local dataformat (on a 64 bit system these are 64 bit floats.)\nDepending on the system and original dataset dtype, this matrix can\nthus occupy more memory than the original dataset.\n\nThe **peak memory usage** for CPU-based training is at least\n**3x** the dataset size (assuming dtype `float32` on a 64bit system)\nplus about **400,000 KiB** for other resources,\nlike operating system requirements and storing of intermediate\nresults.\n\n**Example**\n\n- Machine type: AWS m5.xlarge (4 vCPUs, 16 GiB RAM)\n- Usable RAM: ~15,350,000 KiB\n- Dataset: 1,250,000 rows with 1024 features, dtype float32.\n  Total size: 5,000,000 KiB\n- XGBoost DMatrix size: ~10,000,000 KiB\n\nThis dataset will fit exactly on this node for training.\n\nNote that the DMatrix size might be lower on a 32 bit system.\n\n**GPUs**\n\nGenerally, the same memory requirements exist for GPU-based\ntraining. Additionally, the GPU must have enough memory\nto hold the dataset.\n\nIn the example above, the GPU must have at least\n10,000,000 KiB (about 9.6 GiB) memory. However,\nempirically we found that using a `DeviceQuantileDMatrix`\nseems to show more peak GPU memory usage, possibly\nfor intermediate storage when loading data (about 10%).\n\n**Best practices**\n\nIn order to reduce peak memory usage, consider the following\nsuggestions:\n\n- Store data as `float32` or less. More precision is often\n  not needed, and keeping data in a smaller format will\n  help reduce peak memory usage for initial data loading.\n- Pass the `dtype` when loading data from CSV. Otherwise,\n  floating point values will be loaded as `np.float64`\n  per default, increasing peak memory usage by 33%.\n\n## Placement Strategies\n\nXGBoost-Ray leverages Ray's Placement Group API (\u003chttps://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html\u003e)\nto implement placement strategies for better fault tolerance.\n\nBy default, a SPREAD strategy is used for training, which attempts to spread all of the training workers\nacross the nodes in a cluster on a best-effort basis. This improves fault tolerance since it minimizes the\nnumber of worker failures when a node goes down, but comes at a cost of increased inter-node communication\nTo disable this strategy, set the `RXGB_USE_SPREAD_STRATEGY` environment variable to 0. If disabled, no\nparticular placement strategy will be used.\n\nNote that this strategy is used only when `elastic_training` is not used. If `elastic_training` is set to `True`,\nno placement strategy is used.\n\nWhen XGBoost-Ray is used with Ray Tune for hyperparameter tuning, a PACK strategy is used. This strategy\nattempts to place all workers for each trial on the same node on a best-effort basis. This means that if a node\ngoes down, it will be less likely to impact multiple trials.\n\nWhen placement strategies are used, XGBoost-Ray will wait for 100 seconds for the required resources\nto become available, and will fail if the required resources cannot be reserved and the cluster cannot autoscale\nto increase the number of resources. You can change the `RXGB_PLACEMENT_GROUP_TIMEOUT_S` environment variable to modify\nhow long this timeout should be.\n\n## More examples\n\nFor complete end to end examples, please have a look at\nthe [examples folder](https://github.com/ray-project/xgboost_ray/tree/master/xgboost_ray/examples/):\n\n- [Simple sklearn breastcancer dataset example](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/simple.py) (requires `sklearn`)\n- [HIGGS classification example](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/higgs.py)\n  ([download dataset (2.6 GB)](https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz))\n- [HIGGS classification example with Parquet](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/higgs_parquet.py) (uses the same dataset)\n- [Test data classification](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/train_on_test_data.py) (uses a self-generated dataset)\n\u003c!--$REMOVE--\u003e\n## Resources\n\n* [XGBoost-Ray documentation](https://xgboost.readthedocs.io/en/stable/tutorials/ray.html)\n* [Ray community slack](https://forms.gle/9TSdDYUgxYs8SA9e8)\n\u003c!--$END_REMOVE--\u003e\n\u003c!--$UNCOMMENT## API reference\n\n```{eval-rst}\n.. autoclass:: xgboost_ray.RayParams\n    :members:\n```\n\n```{eval-rst}\n.. autoclass:: xgboost_ray.RayDMatrix\n    :members:\n```\n\n```{eval-rst}\n.. autofunction:: xgboost_ray.train\n```\n\n```{eval-rst}\n.. autofunction:: xgboost_ray.predict\n```\n\n### scikit-learn API\n\n```{eval-rst}\n.. autoclass:: xgboost_ray.RayXGBClassifier\n    :members:\n```\n\n```{eval-rst}\n.. autoclass:: xgboost_ray.RayXGBRegressor\n    :members:\n```\n\n```{eval-rst}\n.. autoclass:: xgboost_ray.RayXGBRFClassifier\n    :members:\n```\n\n```{eval-rst}\n.. autoclass:: xgboost_ray.RayXGBRFRegressor\n    :members:\n```--\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fray-project%2Fxgboost_ray","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fray-project%2Fxgboost_ray","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fray-project%2Fxgboost_ray/lists"}