{"id":18800551,"url":"https://github.com/xtra-computing/deltaboost","last_synced_at":"2025-04-13T17:31:19.712Z","repository":{"id":175474663,"uuid":"622462579","full_name":"Xtra-Computing/DeltaBoost","owner":"Xtra-Computing","description":"GBDT-based model with efficient unlearning (SIGMOD 2023)","archived":false,"fork":false,"pushed_at":"2024-05-31T05:58:05.000Z","size":3239,"stargazers_count":7,"open_issues_count":0,"forks_count":2,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-27T08:22:31.447Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Xtra-Computing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-02T07:21:33.000Z","updated_at":"2025-01-07T07:22:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"98ab6d2e-1414-46f9-8581-2951ca4667d0","html_url":"https://github.com/Xtra-Computing/DeltaBoost","commit_stats":null,"previous_names":["xtra-computing/deltaboost"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xtra-Computing%2FDeltaBoost","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xtra-Computing%2FDeltaBoost/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xtra-Computing%2FDeltaBoost/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xtra-Computing%2FDeltaBoost/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Xtra-Computing","download_url":"https://codeload.github.com/Xtra-Computing/DeltaBoost/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248752372,"owners_count":21156079,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T22:19:01.347Z","updated_at":"2025-04-13T17:31:18.003Z","avatar_url":"https://github.com/Xtra-Computing.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DeltaBoost Documentation\n**News**: DeltaBoost has won the [Honorable Mention for Best Artifact Award](https://sigmod.org/sigmod-awards/sigmod-best-artifact-award/) in SIGMOD23!\n\nDeltaBoost is a machine learning model based on gradient boosting decision tree (GBDT) that supports efficient machine unlearning, which is published on [SIGMOD 23](https://dl.acm.org/doi/abs/10.1145/3589313). We provide two methods to reproduce the results in the paper: a master script and a step-by-step guide. The master script will automatically download the datasets, build DeltaBoost, run the experiments, and summary results. The estimated execution time of the master script is a week. The step-by-step guide will show how to run each experiment in the paper.\n\n**Contents**\n\u003c!-- TOC --\u003e\n* [DeltaBoost Documentation](#deltaboost-documentation)\n* [Getting Started](#getting-started)\n  * [Environment (Docker)](#environment-docker)\n  * [Environment (Step by Step)](#environment-step-by-step)\n    * [Install G++, GCC, OpenSSL, OpenCL, cmake and GMP](#install-g-gcc-openssl-opencl-cmake-and-gmp)\n    * [Install NTL](#install-ntl)\n    * [Install Boost](#install-boost)\n  * [Reproduce Main Results (Master Script)](#reproduce-main-results-master-script)\n  * [Prepare Data](#prepare-data)\n    * [Install Python Environment](#install-python-environment)\n    * [Download and Preprocess Datasets](#download-and-preprocess-datasets)\n  * [Build DeltaBoost](#build-deltaboost)\n* [Usage of DeltaBoost](#usage-of-deltaboost)\n  * [Basic Usage](#basic-usage)\n  * [Parameter Guide](#parameter-guide)\n  * [Reproduce Main Results (Step by Step)](#reproduce-main-results-step-by-step)\n    * [Removing in one tree (Table 4,5)](#removing-in-one-tree-table-45)\n    * [Removing in Multiple trees (Table 7)](#removing-in-multiple-trees-table-7)\n    * [Efficiency (Table 6)](#efficiency-table-6)\n    * [Memory Usage (Table 8)](#memory-usage-table-8)\n    * [Accuracy (Figure 9)](#accuracy-figure-9)\n    * [Ablation Study (Figure 10, 11)](#ablation-study-figure-10-11)\n* [Citation](#citation)\n\u003c!-- TOC --\u003e\n\n[//]: # (Contents)\n\n# Getting Started\n\n## Environment (Docker)\nThe **recommended** approach for environment configuration is through a docker image. Download the image by\n```shell\ndocker pull jerrylife/deltaboost\n```\nCreate a container named `deltaboost` based on the image.\n```shell\ndocker run -d -t --name deltaboost jerrylife/deltaboost\n```\nFind the container ID at the first column by\n```shell\ndocker ps\n```\nExecute the master script in the container in background\n```shell\ndocker exec -t \u003ccontainer-ID\u003e bash run.sh\n```\nYou may also enter the container to observe the results by\n```shell\ndocker exec -it \u003ccontainer-ID\u003e bash\n```\n**Important:** `download_datasets.sh` is only tested for fresh execution. If a download is terminated and needed to restart, please remove the data folder by `rm -rf data/` before the next execution.\n\nFor convenience of manual configuration, we also provide the Dockerfile for image building.\n\n## Environment (Step by Step)\n\nThe required packages for DeltaBoost includes \n* g++-10 or above\n* OpenSSL\n* OpenCL\n* CMake 3.15 or above\n* GMP\n* NTL\n* Boost\n* Python 3.9+\n\n### Install G++, GCC, OpenSSL, OpenCL, cmake and GMP\n\n```shell\nsudo apt install gcc-10 g++-10 libssl-dev opencl-headers cmake libgmp3-dev\n```\n\n### Install NTL\nThe NTL can be installed from source by \n```shell\nwget https://libntl.org/ntl-11.5.1.tar.gz\ntar -xvf ntl-11.5.1.tar.gz\ncd ntl-11.5.1/src\n./configure SHARED=on\nmake -j\nsudo make install\n```\nIf `NTL` is not installed under default folder, you need to specify the category of NTL during compilation by\n```shell\ncmake .. -DNTL_PATH=\"PATH_TO_NTL\"\n```\n\n### Install Boost\nDeltaBoost requires `boost \u003e= 1.75.0`. Since it may not be available on official `apt` repositories, you may need to install manually.\n\nDownload and unzip `boost 1.75.0`.\n```shell\nwget https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.bz2\ntar -xvf boost_1_75_0.tar.bz2\n```\nInstall dependencies for building boost.\n```shell\nsudo apt-get install build-essential autotools-dev libicu-dev libbz2-dev libboost-all-dev\n```\nStart building.\n```shell\n./bootstrap.sh --prefix=/usr/\n./b2\nsudo ./b2 install\n```\n\n## Reproduce Main Results (Master Script)\nWe provide a master script to reproduce the main results in the paper. The script will automatically download the datasets, build DeltaBoost, run the experiments, and summary results. The results will be saved in `fig/` and `out/` directory. Simply run\n```shell\nbash run.sh\n```\n\n## Prepare Data\n\n### Install Python Environment\nDeltaBoost requires `Python \u003e= 3.9`. The required packages have been included in `python-utils/requirements.txt`. Install necessary modules by\n```shell\npip install -r requirements.txt\n```\n### Download and Preprocess Datasets\n\nDownload datasets and remove instances from samples.\n```shell\nbash download_datasets.sh\n```\nThis script will download 5 datasets from LIBSVM wesbite. After downloading and unzipping, some instances will be removed from these datasets. The removing ratio is `0.1%` and `1%` by default. The time of removal may take several minutes. If more ratios is needed, you can change the `-r` option of `remove_sample.py`. After the preparation, there should exist a `data/` directory with the following structure.\n\n**Important:** `download_datasets.sh` is only tested for fresh execution. If a download is terminated and needed to restart, please remove the data folder by `rm -rf data/` beore the next execution.\n\n\n```text\ndata\n├── cadata\n├── cadata.test\n├── cadata.train\n├── cadata.train.delete_1e-02\n├── cadata.train.delete_1e-03\n├── cadata.train.remain_1e-02\n├── cadata.train.remain_1e-03\n├── codrna.test\n├── codrna.train\n├── codrna.train.delete_1e-02\n├── codrna.train.delete_1e-03\n├── codrna.train.remain_1e-02\n├── codrna.train.remain_1e-03\n├── covtype\n├── covtype.test\n├── covtype.train\n├── covtype.train.delete_1e-02\n├── covtype.train.delete_1e-03\n├── covtype.train.remain_1e-02\n├── covtype.train.remain_1e-03\n├── gisette.test\n├── gisette.train\n├── gisette.train.delete_1e-02\n├── gisette.train.delete_1e-03\n├── gisette.train.remain_1e-02\n├── gisette.train.remain_1e-03\n├── msd.test\n├── msd.train\n├── msd.train.delete_1e-02\n├── msd.train.delete_1e-03\n├── msd.train.remain_1e-02\n└── msd.train.remain_1e-03\n```\n\n\n## Build DeltaBoost\nBuild DeltaBoost by\n```shell\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake -j\n```\nAn executable named `build/bin/FedTree-train` should be created. For convenience, you may create a symlink for this binary.\n```shell\ncd ..   # under root dir of DeltaBoost\nln -s build/bin/FedTree-train main\n```\n# Usage of DeltaBoost\nFor simplicity, the usage guide assumes that the binary `main` has been created.\n\n## Basic Usage\nDeltaBoost can be configured by a `.conf` file or/and the command line parameters. For example,\n```shell\n./main conf=conf/cadata.conf    # By .conf file\n./main enable_delta=true nbr_size=10       # By parameters\n./main conf=conf/cadata.conf enable_delta=true nbr_size=10  # By both methods\n```\nWhen both methods are applied, the parameters in the command line will overwrite the value in the `.conf` file.\n\nSure, here is a brief parameter guide in markdown format.\n\n## Parameter Guide\n\n- **dataset_name** (std::string)\n    - Usage: The name of the dataset.\n    - Default value: \"\"\n\n- **save_model_name** (std::string)\n    - Usage: The name to save the model as.\n    - Default value: \"\"\n\n- **data** (std::string)\n    - Usage: Path to the training data.\n    - Default value: \"../dataset/test_dataset.txt\"\n\n- **test_data** (std::string)\n    - Usage: Path to the test data.\n    - Default value: \"\"\n\n- **remain_data** (std::string)\n    - Usage: Path to the remaining training data after deletion.\n    - Default value: \"\"\n\n- **delete_data** (std::string)\n    - Usage: Path to the deleted training data.\n    - Default value: \"\"\n\n- **n_parties** (int)\n    - Usage: The number of parties in the federated learning setting.\n    - Default value: 2\n\n- **mode** (std::string)\n    - Usage: The mode of federated learning (e.g., \"horizontal\" or \"centralized\").\n    - Default value: \"horizontal\"\n\n- **privacy_tech** (std::string)\n    - Usage: The privacy technique to use (e.g., \"he\" or \"none\").\n    - Default value: \"he\"\n\n- **learning_rate** (float)\n    - Usage: The learning rate for the gradient boosting decision tree.\n    - Default value: 1\n\n- **max_depth** (int)\n    - Usage: The maximum depth of the trees in the gradient boosting decision tree.\n    - Default value: 6\n\n- **n_trees** (int)\n    - Usage: The number of trees in the gradient boosting decision tree.\n    - Default value: 40\n\n- **objective** (std::string)\n    - Usage: The objective function for the gradient boosting decision tree (e.g., \"reg:linear\").\n    - Default value: \"reg:linear\"\n\n- **num_class** (int)\n    - Usage: The number of classes in the data.\n    - Default value: 1\n\n- **tree_method** (std::string)\n    - Usage: The method to use for tree construction (e.g., \"hist\").\n    - Default value: \"hist\"\n\n- **lambda** (float)\n    - Usage: The lambda parameter for the gradient boosting decision tree.\n    - Default value: 1\n\n- **verbose** (int)\n    - Usage: Controls the verbosity of the output.\n    - Default value: 1\n\n- **enable_delta** (std::string)\n    - Usage: Enable or disable the delta boosting parameter (\"true\" or \"false\").\n    - Default value: \"false\"\n\n- **remove_ratio** (float)\n    - Usage: The ratio of data to be removed in delta boosting.\n    - Default value: 0.0\n\n- **min_diff_gain** (int)\n    - Usage: (Please provide the usage)\n    - Default value: \"\"\n\n- **max_range_gain** (int)\n    - Usage: (Please provide the usage)\n    - Default value: \"\"\n\n- **n_used_trees** (int)\n    - Usage: The number of trees to be used in delta boosting.\n    - Default value: 0\n\n- **max_bin_size** (int)\n    - Usage: The maximum bin size in delta boosting.\n    - Default value: 100\n\n- **nbr_size** (int)\n    - Usage: The neighbor size in delta boosting.\n    - Default value: 1\n\n- **gain_alpha** (float)\n    - Usage: The alpha parameter for the gain calculation in delta boosting.\n    - Default value: 0.0\n\n- **delta_gain_eps_feature** (float)\n    - Usage: The epsilon parameter for the gain calculation with respect to features in delta boosting.\n    - Default value: 0.0\n\n- **delta_gain_eps_sn** (float)\n    - Usage: The epsilon parameter for the gain calculation with respect to sample numbers in delta boosting.\n    - Default value: 0.0\n\n- **hash_sampling_round** (int)\n    - Usage: The number of rounds for hash sampling in delta boosting.\n    - Default value: 1\n\n- **n_quantized_bins** (int)\n    - Usage: The number of quantized bins in delta boosting.\n    - Default value: \"\"\n\n- **seed** (int)\n    - Usage: The seed for random number generation.\n    - Default value: \"\"\n\n## Reproduce Main Results (Step by Step)\nBefore reproducing the main results, please make sure that the binary `main` has been created. All the time reported are done on two AMD EPYC 7543 32-Core Processor using 96 threads. If your machine does not have the required threads, you may\n- reduce the number of seeds, for example, to `5`. However, this increases the variance of the calculated Hellinger distance.\n- reduce the require threads, for example, to `taskset -c 0-11`. However, this increases the running time. If you want to use all the threads, simply remove `taskset -c 0-x` before the command.\n\nFirst, create necessary folders to store results.\n```shell\nmkdir -p cache out fig\n```\n\n### Removing in one tree (Table 4,5)\nTo test removing in a single tree with Deltaboost, simply run\n\n```shell\nbash test_remove_deltaboost_tree_1.sh 100  # try 100 seeds\n```\nThis script finishes in **6 hours**. After the execution, two folders will appear under the project root:\n\n- `out/remove_test/tree1` contains accuracy of each model on five datasets.\n- `cache/` contains two kinds of information:\n  - original model, deleted model, and retrained model in `json` format.\n  - detailed per-instance prediction in `csv` format. This information is used to calculate the Hellinger distance.\n\nTo extract the information in a latex table, run\n\n```shell\n# in project root\ncd python-utils\npython plot_results.py -t 1\n```\nThe scripts extracts the **accuracy** and **Hellinger distance** of DeltaBoost into Latex table. The cells of baselines to be manually filled in are left empty in this table.\n\nTwo files of summarized outputs are generated in `out/`:\n- `out/accuracy_table_tree1.csv`: Results of accuracy in Table 4. An example is shown below.\n\n```csv\n,,0.0874\\textpm 0.0002,,,0.0873\\textpm 0.0005\n,,0.0874\\textpm 0.0002,,,0.0873\\textpm 0.0005\n,,0.0873\\textpm 0.0002,,,0.0872\\textpm 0.0007\n,,0.2611\\textpm 0.0001,,,0.2610\\textpm 0.0001\n,,0.2611\\textpm 0.0001,,,0.2611\\textpm 0.0001\n,,0.2611\\textpm 0.0001,,,0.2610\\textpm 0.0000\n,,0.0731\\textpm 0.0020,,,0.0787\\textpm 0.0042\n,,0.0731\\textpm 0.0020,,,0.0786\\textpm 0.0043\n,,0.0731\\textpm 0.0020,,,0.0790\\textpm 0.0043\n-,-,0.1557\\textpm 0.0034,-,-,0.1643\\textpm 0.0066\n-,-,0.1557\\textpm 0.0034,-,-,0.1643\\textpm 0.0065\n-,-,0.1558\\textpm 0.0034,-,-,0.1644\\textpm 0.0066\n-,-,0.1009\\textpm 0.0003,-,-,0.1009\\textpm 0.0003\n-,-,0.1009\\textpm 0.0003,-,-,0.1009\\textpm 0.0003\n-,-,0.1009\\textpm 0.0003,-,-,0.1009\\textpm 0.0003\n```\n\n- `out/forget_table_tree1.csv`: Results of Hellinger distance in Table 5. An example is shown below.\n\n```csv\n,,0.0002\\textpm 0.0051,,,0.1046\\textpm 0.2984\n,,0.0000\\textpm 0.0014,,,0.0070\\textpm 0.0515\n,,0.0162\\textpm 0.1260,,,0.0300\\textpm 0.1521\n,,0.0000\\textpm 0.0005,,,0.0069\\textpm 0.0467\n,,0.0007\\textpm 0.0022,,,0.0070\\textpm 0.0081\n,,0.0000\\textpm 0.0004,,,0.0051\\textpm 0.0065\n-,-,0.0058\\textpm 0.0157,-,-,0.0087\\textpm 0.0113\n-,-,0.0034\\textpm 0.0121,-,-,0.0033\\textpm 0.0048\n-,-,0.0041\\textpm 0.0044,-,-,0.0126\\textpm 0.0101\n-,-,0.0028\\textpm 0.0036,-,-,0.0093\\textpm 0.0079\n```\n\nThese two results might be slightly different from the results in the paper due to the randomness of the training process. However, the distance between $M_d$ and $M_r$ is very small, which is consistent as the results in the paper.\n\n### Removing in Multiple trees (Table 7)\nTo test removing in 10 trees with Deltaboost, simply run\n\n```shell\nbash test_remove_deltaboost_tree_10.sh 100  # try 100 seeds\n```\nThe script finishes in **2-3 days**. After the execution, two folders will appear under the project root:\n- `out/remove_test/tree10` contains accuracy of each model on five datasets.\n- `cache/` contains two kinds of information:\n  - original model, deleted model, and retrained model in `json` format.\n  - detailed per-instance prediction in `csv` format. This information is used to calculate the Hellinger distance.\n  \nTo extract the information in a latex table, run\n```shell\n# in project root\ncd python-utils\npython plot_results.py -t 10\n```\nThe script extracts the **accuracy** and **Hellinger distance** of DeltaBoost into Latex table. The cells of baselines to be manually filled in are left empty in this table.\n\nTwo files of summarized outputs are generated in `out/`:\n- `out/accuracy_table_tree10.csv`: Results of accuracy in Table 7(a). An example is shown below.\n\n```csv\n,,0.0616\\textpm 0.0011,,,0.0617\\textpm 0.0010\n,,0.0617\\textpm 0.0011,,,0.0618\\textpm 0.0010\n,,0.0617\\textpm 0.0011,,,0.0617\\textpm 0.0010\n,,0.2265\\textpm 0.0069,,,0.2265\\textpm 0.0069\n,,0.2264\\textpm 0.0069,,,0.2265\\textpm 0.0068\n,,0.2264\\textpm 0.0067,,,0.2255\\textpm 0.0066\n,,0.0509\\textpm 0.0043,,,0.0490\\textpm 0.0038\n,,0.0509\\textpm 0.0043,,,0.0490\\textpm 0.0038\n,,0.0508\\textpm 0.0041,,,0.0497\\textpm 0.0046\n-,-,0.1272\\textpm 0.0055,-,-,0.1396\\textpm 0.0068\n-,-,0.1274\\textpm 0.0055,-,-,0.1400\\textpm 0.0068\n-,-,0.1273\\textpm 0.0055,-,-,0.1399\\textpm 0.0072\n-,-,0.1040\\textpm 0.0006,-,-,0.1040\\textpm 0.0006\n-,-,0.1040\\textpm 0.0006,-,-,0.1040\\textpm 0.0006\n-,-,0.1041\\textpm 0.0006,-,-,0.1040\\textpm 0.0005\n```\n\n- `out/forget_table_tree10.csv`: Results of Hellinger distance in Table 7(b). An example is shown below.\n\n```csv\n,,0.0130\\textpm 0.0100,,,0.0088\\textpm 0.0079\n,,0.0129\\textpm 0.0100,,,0.0089\\textpm 0.0078\n,,0.0112\\textpm 0.0089,,,0.0118\\textpm 0.0096\n,,0.0112\\textpm 0.0090,,,0.0118\\textpm 0.0096\n,,0.0106\\textpm 0.0073,,,0.0312\\textpm 0.0169\n,,0.0106\\textpm 0.0073,,,0.0312\\textpm 0.0167\n-,-,0.0240\\textpm 0.0169,-,-,0.0247\\textpm 0.0159\n-,-,0.0239\\textpm 0.0160,-,-,0.0249\\textpm 0.0149\n-,-,0.0194\\textpm 0.0106,-,-,0.0249\\textpm 0.0127\n-,-,0.0194\\textpm 0.0106,-,-,0.0248\\textpm 0.0126\n```\n\nThese two results might be slightly different from the results in the paper due to the randomness of the training process. However, the distance between $M_d$ and $M_r$ is very small, which is consistent as the results in the paper.\n\n### Efficiency (Table 6)\n\nTo test the efficiency, we need to perform a clean retrain of GBDT. To train a 10-tree GBDT, run\n\n```shell\nbash test_remove_gbdt_efficiency.sh 10\n```\n\nThe script retrain GBDT on five datasets with two removal ratios for one time since the GBDT is deterministic. The script finishes in **10 minutes**. After the execution, the efficiency and speedup can be summarized by\n```shell\npython plot_time.py -t 10\n```\nThe expected output should be like\n```text\nThunder\t\u0026 DB-Train\t\u0026 DB-Remove\t\u0026 Speedup (Thunder) \\\\\n 12.410\t\u0026  8.053 \\textpm 3.976\t \u0026  0.156 \\textpm 0.047\t \u0026 79.34x \\\\\n 12.143\t\u0026  7.717 \\textpm 4.134\t \u0026  0.160 \\textpm 0.035\t \u0026 75.82x \\\\\n 15.668\t\u0026  52.253 \\textpm 4.796\t \u0026  1.482 \\textpm 2.260\t \u0026 10.57x \\\\\n 16.015\t\u0026  52.333 \\textpm 4.107\t \u0026  1.874 \\textpm 3.364\t \u0026 8.55x \\\\\n 50.213\t\u0026  66.658 \\textpm 7.747\t \u0026  0.956 \\textpm 0.265\t \u0026 52.51x \\\\\n 47.089\t\u0026  65.322 \\textpm 7.235\t \u0026  1.123 \\textpm 0.259\t \u0026 41.95x \\\\\n 12.434\t\u0026  6.038 \\textpm 5.198\t \u0026  0.068 \\textpm 0.042\t \u0026 183.03x \\\\\n 12.524\t\u0026  4.704 \\textpm 3.282\t \u0026  0.053 \\textpm 0.037\t \u0026 237.99x \\\\\n 22.209\t\u0026  53.451 \\textpm 3.659\t \u0026  3.523 \\textpm 0.812\t \u0026 6.30x \\\\\n 24.067\t\u0026  54.221 \\textpm 2.952\t \u0026  3.422 \\textpm 0.700\t \u0026 7.03x \\\\\n```\nThe time may vary due to the environment and hardwares, but the speedup is consistently significant as that in the Table 6 of the paper.\n\nWe also provide a script to running the baselines: `sklearn` and `xgboost` for efficiency comparison. Note that the performance of `xgboost` vary significantly by version. For example, some versions favors high-dimensional datasets but performs slower on large low-dimensional datasets. We adopt the default version of conda `xgboost==1.5.0` in our experiments. To run the baselines, run\n```shell\ntaskset -c 0-95 python baseline.py  # Also limit the number of threads to 96\n```\nThis script is expected to finish in **10 minutes**. The output contains the accuracy and training time (excluding loading data) of baselines. The expected output should be like\n```text\nGot X with shape (58940, 8), y with shape (58940,)\nScaling y to [0,1]\nGot X with shape (271617, 8), y with shape (271617,)\nScaling y to [0,1]\nsklearn GBDT training time: 1.209s\nsklearn GBDT error: 0.0577\n=====================================\nGot X with shape (460161, 54), y with shape (460161,)\nScaling y to [0,1]\nGot X with shape (116203, 54), y with shape (116203,)\nScaling y to [0,1]\nsklearn GBDT training time: 21.309s\nsklearn GBDT error: 0.1974\n=====================================\nGot X with shape (5940, 5000), y with shape (5940,)\nScaling y to [0,1]\nGot X with shape (1000, 5000), y with shape (1000,)\nScaling y to [0,1]\nsklearn GBDT training time: 21.941s\nsklearn GBDT error: 0.0600\n=====================================\nGot X with shape (16347, 8), y with shape (16347,)\nScaling y to [0,1]\nGot X with shape (4128, 8), y with shape (4128,)\nScaling y to [0,1]\nsklearn GBDT training time: 0.601s\nsklearn GBDT error: 0.8558\n=====================================\nGot X with shape (459078, 90), y with shape (459078,)\nScaling y to [0,1]\nGot X with shape (51630, 90), y with shape (51630,)\nScaling y to [0,1]\nsklearn GBDT training time: 372.924s\nsklearn GBDT error: 0.8819\n=====================================\nGot X with shape (59476, 8), y with shape (59476,)\nScaling y to [0,1]\nGot X with shape (271617, 8), y with shape (271617,)\nScaling y to [0,1]\n[10:06:19] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\nXGBoost training time: 9.131s\nXGBoost error: 0.0405\n=====================================\nGot X with shape (464345, 54), y with shape (464345,)\nScaling y to [0,1]\nGot X with shape (116203, 54), y with shape (116203,)\nScaling y to [0,1]\n[10:06:29] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\nXGBoost training time: 13.075s\nXGBoost error: 0.1558\n=====================================\nGot X with shape (5994, 5000), y with shape (5994,)\nScaling y to [0,1]\nGot X with shape (1000, 5000), y with shape (1000,)\nScaling y to [0,1]\n[10:06:47] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\nXGBoost training time: 13.260s\nXGBoost error: 0.0320\n=====================================\nGot X with shape (16496, 8), y with shape (16496,)\nScaling y to [0,1]\nGot X with shape (4128, 8), y with shape (4128,)\nScaling y to [0,1]\nXGBoost training time: 8.966s\nXGBoost RMSE: 0.1182\n=====================================\nGot X with shape (463252, 90), y with shape (463252,)\nScaling y to [0,1]\nGot X with shape (51630, 90), y with shape (51630,)\nScaling y to [0,1]\nXGBoost training time: 20.309s\nXGBoost RMSE: 0.1145\n=====================================\nGot X with shape (59476, 8), y with shape (59476,)\nScaling y to [0,1]\nGot X with shape (271617, 8), y with shape (271617,)\nScaling y to [0,1]\nRandom Forest training time: 0.278s\nRandom Forest error: 0.1073\n=====================================\nGot X with shape (464345, 54), y with shape (464345,)\nScaling y to [0,1]\nGot X with shape (116203, 54), y with shape (116203,)\nScaling y to [0,1]\nRandom Forest training time: 2.656s\nRandom Forest error: 0.2360\n=====================================\nGot X with shape (5994, 5000), y with shape (5994,)\nScaling y to [0,1]\nGot X with shape (1000, 5000), y with shape (1000,)\nScaling y to [0,1]\nRandom Forest training time: 0.280s\nRandom Forest error: 0.0650\n=====================================\nGot X with shape (16496, 8), y with shape (16496,)\nScaling y to [0,1]\nGot X with shape (4128, 8), y with shape (4128,)\nScaling y to [0,1]\nRandom Forest training time: 0.387s\nRandom Forest accuracy: 0.1312\n=====================================\nGot X with shape (463252, 90), y with shape (463252,)\nScaling y to [0,1]\nGot X with shape (51630, 90), y with shape (51630,)\nScaling y to [0,1]\nRandom Forest training time: 229.927s\nRandom Forest accuracy: 0.1170\nGot X with shape (59476, 8), y with shape (59476,)\nScaling y to [0,1]\nGot X with shape (271617, 8), y with shape (271617,)\nScaling y to [0,1]\nDecision Tree training time: 0.122s\nDecision Tree error: 0.0669\n=====================================\nGot X with shape (464345, 54), y with shape (464345,)\nScaling y to [0,1]\nGot X with shape (116203, 54), y with shape (116203,)\nScaling y to [0,1]\nDecision Tree training time: 2.289s\nDecision Tree error: 0.2225\n=====================================\nGot X with shape (5994, 5000), y with shape (5994,)\nScaling y to [0,1]\nGot X with shape (1000, 5000), y with shape (1000,)\nScaling y to [0,1]\nDecision Tree training time: 2.464s\nDecision Tree error: 0.0680\n=====================================\nGot X with shape (16496, 8), y with shape (16496,)\nScaling y to [0,1]\nGot X with shape (4128, 8), y with shape (4128,)\nScaling y to [0,1]\nDecision Tree training time: 0.058s\nDecision Tree accuracy: 0.1382\n=====================================\nGot X with shape (463252, 90), y with shape (463252,)\nScaling y to [0,1]\nGot X with shape (51630, 90), y with shape (51630,)\nScaling y to [0,1]\nDecision Tree training time: 35.572s\nDecision Tree accuracy: 0.1185\n```\n\nNote that the training time of baselines in this example is longer than that in Table 6 due to the different CPU. Nonetheless, the speedup of DeltaBoost is still similarly significant, thus the conclusion is not affected.\n\n### Memory Usage (Table 8)\nThe peak memory usage can be easily observed during the training, which is however hard to be recorded by a script. Since the memory consumption is almost consistent during the training, the recommended approach is to manually monitor the peak memory usage of the process in the system monitor, e.g., `htop`.\n\n\n### Accuracy (Figure 9)\nThe accuracy of baselines is output by the same command as testing efficiency.\n```shell\npython baseline.py\n```\nThe accuracy of DeltaBoost has also recorded in the previous logs.\n\nThe default max number of trees is `10`, which is sufficient to obtain a promising accuracy. To test the accuracy of baselines with 100 trees, run\n```shell\npython baseline.py -t 100\n```\nSince each baseline algorithm is run for only once, this script is expected to finish in **10 minutes**.\n\nNext, we also need to obtain the results of DeltaBoost with 100 trees. To do so, run\n```shell\nbash test_accuracy.sh 10  # run 10 times\n```\nThis procedure takes around **1-2 days**. For more efficient testing, you can reduce the number of repeats by changing the parameter from `10` to a smaller number. This will result in larger variance in the results.\n\nAfter obtaining all the results, run\n```shell\npython plot_results.py -acc -t 10   # (10 trees)\npython plot_results.py -acc -t 100  # (100 trees)\n```\nTwo images will be generated in `fig/`, named\n```text\nacc-tree10.png\nacc-tree100.png\n```\nBoth images are similar to Fig. 9 in the paper.\n\n\n### Ablation Study (Figure 10, 11)\nThe ablation study includes six bash scripts.\n```text\nablation_bagging.sh\nablation_iteration.sh\nablation_nbins.sh\nablation_quantization.sh\nablation_ratio.sh\nablation_regularization.sh\n```\nThese scripts can be run in a single script `test_all_ablation.sh` by\n```shell\nbash test_all_ablation.sh 50  # run 50 times\n```\nThis combined script takes around **1-2 days**. If you want to run the ablation study in a shorter time, you can reduce the number of repeats by changing the parameter from `50` to a smaller number. This will result in larger variance in the results.\n\nTo plot all the figures of ablation study into `fig/ablation`, run\n```shell\npython plot_ablation.py\n```\nThis plotting process takes around **10 minutes**. The major time cost is calculating Hellinger distance.\n\n# Citation\nIf you find this repository useful in your research, please cite our paper:\n\n```text\n@article{wu2023deltaboost,\n  author = {Wu, Zhaomin and Zhu, Junhui and Li, Qinbin and He, Bingsheng},\n  title = {DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine Unlearning},\n  year = {2023},\n  issue_date = {June 2023},\n  publisher = {Association for Computing Machinery},\n  address = {New York, NY, USA},\n  volume = {1},\n  number = {2},\n  url = {https://doi-org.libproxy1.nus.edu.sg/10.1145/3589313},\n  doi = {10.1145/3589313},\n  journal = {Proc. ACM Manag. Data},\n  month = {jun},\n  articleno = {168},\n  numpages = {26},\n  keywords = {data deletion, gradient boosting decision trees, machine unlearning}\n}\n```\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtra-computing%2Fdeltaboost","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxtra-computing%2Fdeltaboost","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtra-computing%2Fdeltaboost/lists"}