{"id":13482486,"url":"https://github.com/xiyanghu/OSDT","last_synced_at":"2025-03-27T13:32:01.570Z","repository":{"id":58614326,"uuid":"190356243","full_name":"xiyanghu/OSDT","owner":"xiyanghu","description":"Optimal Sparse Decision Trees","archived":false,"fork":false,"pushed_at":"2023-04-27T15:02:09.000Z","size":6740,"stargazers_count":99,"open_issues_count":6,"forks_count":11,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-10-30T16:40:52.392Z","etag":null,"topics":["accelerate","acceleration-model","algorithm","algorithm-optimization","data-mining","data-science","interpretable-ml","machine-learning","ml-system","mlsys","neurips","python","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xiyanghu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-06-05T08:27:57.000Z","updated_at":"2024-09-30T02:20:13.000Z","dependencies_parsed_at":"2024-01-15T20:47:05.622Z","dependency_job_id":"52b5a164-2226-46c9-894d-bdc9d0666890","html_url":"https://github.com/xiyanghu/OSDT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiyanghu%2FOSDT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiyanghu%2FOSDT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiyanghu%2FOSDT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiyanghu%2FOSDT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xiyanghu","download_url":"https://codeload.github.com/xiyanghu/OSDT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245854473,"owners_count":20683359,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accelerate","acceleration-model","algorithm","algorithm-optimization","data-mining","data-science","interpretable-ml","machine-learning","ml-system","mlsys","neurips","python","python3"],"created_at":"2024-07-31T17:01:02.474Z","updated_at":"2025-03-27T13:32:00.082Z","avatar_url":"https://github.com/xiyanghu.png","language":"Python","funding_links":[],"categories":["2019","Technical Resources"],"sub_categories":["Open Source/Access Responsible AI Software Packages"],"readme":"# Optimal Sparse Decision Trees (OSDT)\r\n\r\n![GitHub Repo stars](https://img.shields.io/github/stars/xiyanghu/OSDT?style=social)\r\n![Twitter Follow](https://img.shields.io/twitter/follow/hu_xiyang?style=social)\r\n![License](https://img.shields.io/github/license/xiyanghu/OSDT?color=critical)\r\n[![arXiv](https://img.shields.io/badge/arXiv-1904.12847-b31b1b.svg?style=flat)](https://arxiv.org/abs/1904.12847)\r\n\r\nThis accompanies the paper, [\"Optimal Sparse Decision Trees\"](https://arxiv.org/abs/1904.12847) by Xiyang Hu,\r\nCynthia Rudin, and Margo Seltzer.\r\n\r\nIt appeared in the [2019 NeurIPS conference](https://nips.cc/Conferences/2019)\r\n\r\n* [:movie_camera: Overview video](https://youtu.be/UMjMQaH508M)\r\n* [:newspaper: NeurIPS poster](doc/OSDT_NIPS_Poster.pdf)\r\n* [:notebook_with_decorative_cover: NeurIPS slides](doc/NeurIPSSlides.pdf)\r\n\r\n### Use OSDT\r\n\r\n```python\r\nfrom osdt import OSDT\r\n\r\n# initilize an OSDT object\r\nmodel = OSDT()\r\n# fit the model\r\nmodel.fit(x_train, y_train)\r\n# make prediction and get the prediction accuracy\r\nprediction, accuracy = model.predict(x_test, y_test)\r\n# make prediction only\r\nprediction = model.predict(x_test)\r\n```\r\n\r\n### Documentation\r\n\r\nAll code are in the `./src` folder. The `OSDT` class is in the `osdt.py` file.\r\n\r\n---\r\n\r\n```python\r\nCLASS osdt.OSDT(lamb=0.1, prior_metric=\"curiosity\", MAXDEPTH=float('Inf'), MAX_NLEAVES=float('Inf'), niter=float('Inf'),\r\n                logon=False, support=True, incre_support=True, accu_support=True, equiv_points=True,\r\n                lookahead=True, lenbound=True, R_c0=1, timelimit=float('Inf'), init_cart=True,\r\n                saveTree=False, readTree=False)\r\n```\r\n\r\n\u003cdetails\u003e\u003csummary\u003e \u003cb\u003ePARAMETERS\u003c/b\u003e: \u003c/summary\u003e\r\n\u003cp\u003e\r\n \r\n * **lamb** : float, optional (default=0.1)\\\r\n     The regularization parameter lambda of the objective function.\r\n * **prior_metric** : {'objective', 'bound', 'curiosity', 'entropy', 'gini', 'FIFO'}, optional (default='curiosity')\\\r\n     The scheduling policy used to determine the priority of leaves:\r\n     - 'objective' will use the objective function\r\n     - 'bound' will used the lower bound\r\n     - 'curiosity' will use the curiosity\r\n     - 'entropy' will use the entropy\r\n     - 'gini' will use the GINI value\r\n     - 'FIFO' will use first in first out\r\n * **MAXDEPTH** : int, optional (default=float('Inf'))\\\r\n     Maximum depth of the tree.\r\n * **MAX_NLEAVES** : int, optional (default=float('Inf'))\\\r\n     Maximum number of leaves of the tree.\r\n * **niter** : int, optional (default=float('Inf'))\\\r\n     Maximum number of tree evaluations.\r\n * **logon** : bool, optional (default=False)\\\r\n     Record relevant trees and values during the execution.\r\n * **support** : bool, optional (default=True)\\\r\n     Turn on Lower bound on leaf support.\r\n * **incre_support** : bool, optional (default=True)\\\r\n     Turn on Lower bound on incremental classification accuracy.\r\n * **accu_support** : bool, optional (default=True)\\\r\n     Turn on Lower bound on classification accuracy.\r\n * **equiv_points** : bool, optional (default=True)\\\r\n     Turn on Equivalent points bound.\r\n * **lookahead** : bool, optional (default=True)\\\r\n     Turn on Lookahead bound.\r\n * **lenbound** : bool, optional (default=True)\\\r\n     Turn on Prefix-specific upper bound on number of leaves.\r\n * **R_c0** : float, optional (default=1)\\\r\n     The initial risk.\r\n * **timelimit** : int, optional (default=float('Inf'))\\\r\n     Time limit on the running time. Default is True.\r\n * **init_cart** : bool, optional (default=True)\\\r\n     Initialize with CART.\r\n * **saveTree** : bool, optional (default=False)\\\r\n     Save the tree.\r\n * **readTree** : bool, optional (default=False)\\\r\n     Read Tree from the preserved one, and only explore the children of the preserved one.\r\n\r\n\u003c/p\u003e\r\n\u003c/details\u003e\r\n\r\n---\r\n\r\n\u003e **fit**(x, y)\r\n\r\n\u0026nbsp;\u0026nbsp;\u0026nbsp; Fit the model with input data.\r\n\r\n\u0026nbsp;\u0026nbsp;\u0026nbsp; **PARAMETERS**:\r\n* **x** : ndarray of shape (ndata, nfeature)\\\r\n    The features of the data.\r\n* **y** : ndarray of shape (ndata,), optional\\\r\n    The true labels of the data.\r\n\r\n\u0026nbsp;\u0026nbsp;\u0026nbsp; **RETURNS**:\r\n* **self** : fitted OSDT model.\r\n\r\n\u003e **predict**(x, y=None)\r\n\r\n\u0026nbsp;\u0026nbsp;\u0026nbsp; Predict if a particular sample is an outlier or not.\r\n\r\n\u0026nbsp;\u0026nbsp;\u0026nbsp; **PARAMETERS**:\r\n* **x** : ndarray of shape (ndata, nfeature)\\\r\n    The features of the data.\r\n* **y** : ndarray of shape (ndata,), optional (default=None)\\\r\n    The true labels of the data.\r\n\r\n\u0026nbsp;\u0026nbsp;\u0026nbsp; **RETURNS**:\r\n* **prediction** : ndarray of shape (ndata,)\\\r\n    The features of the training data.\r\n* **accuracy** : float, optional\\\r\n    If true label y is provided, output the accuracy of the prediction.\r\n\r\n---\r\n\r\n### Installation\r\n\r\n```shell\r\ngit clone https://github.com/xiyanghu/OSDT.git\r\ncd OSDT\r\nconda env create -f environment.yml\r\nconda activate osdt\r\n```\r\n\r\n#### Dependencies\r\n\r\n* [gmp](https://gmplib.org/) (GNU Multiple Precision Arithmetic Library)\r\n* [mpfr](http://www.mpfr.org/) (GNU MPFR Library for multiple-precision floating-point computations; depends on gmp)\r\n* [libmpc](http://www.multiprecision.org/) (GNU MPC for arbitrarily high precision and correct rounding; depends on gmp and mpfr)\r\n* [gmpy2](https://pypi.org/project/gmpy2/#files) (GMP/MPIR, MPFR, and MPC interface to Python 2.6+ and 3.x)\r\n* See [environment.yml](environment.yml)\r\n\r\n\u003c!---\r\n1. Install GMP\r\n   * Run Command `sudo apt install libgmp3-dev`(Ubuntu) OR `brew install gmp`(MacOS) \r\n   * If the command above does not work, try manual Installation:\r\n      * Download `gmp-6.2.1.tar.xz` from [gmplib.org](https://gmplib.org/)\r\n      * Run command `tar -jvxf gmp-6.2.1.tar.xz`\r\n      * Run command `cd gmp-6.2.1`\r\n      * Run command `./configure`\r\n      * Run command `make`\r\n      * Run command `make check`\r\n      * Run command `sudo make install`\r\n2. Install MPFR\r\n   * Run command `sudo apt install libmpfr-dev`(Ubuntu) OR `brew install mpfr`(MacOS)  \r\n3. Install libmpc\r\n   * Run command `sudo apt install libmpc-dev`(Ubuntu) OR `brew install libmpc`(MacOS)  \r\n4. Install gmpy2\r\n   * Run command `pip install gmpy2`\r\n--\u003e\r\n\r\n### Datasets\r\n\r\nSee `data/preprocessed/`.\r\n\r\nWe used 7 datasets: Five of them are from the UCI Machine Learning Repository (tic-tac-toc, car evaluation, monk1, monk2, monk3). \r\nThe other two datasets are the ProPublica recidivism data set and the Fair Isaac (FICO) credit risk datasets. \r\nWe predict which individuals are arrested within two years of release (`{N = 7,215}`) on the recidivism data set and whether an individual will default on a loan for the FICO dataset. \r\n* [Tic-Tac-Toc](https://archive.ics.uci.edu/ml/datasets/tic-tac-toe+Endgame)\r\n* [Car Evaluation](https://archive.ics.uci.edu/ml/datasets/car+evaluation)\r\n* [MONK's](https://archive.ics.uci.edu/ml/datasets/MONK's+Problems)\r\n* [ProPublica](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm)\r\n* [FICO](https://community.fico.com/s/explainable-machine-learning-challenge)\r\n\r\n\r\n### Example test code\r\n\r\nWe provide our test code in `test_accuracy.py`.\r\n\r\n### Citing OSDT\r\n\r\n[OSDT paper](\u003chttps://arxiv.org/abs/1904.12847\u003e) is published in\r\n*Neural Information Processing Systems (NeurIPS) 2019*.\r\nIf you use OSDT in a scientific publication, we would appreciate\r\ncitations to the following paper:\r\n\r\n    @inproceedings{NEURIPS2019_ac52c626,\r\n     author = {Hu, Xiyang and Rudin, Cynthia and Seltzer, Margo},\r\n     booktitle = {Advances in Neural Information Processing Systems},\r\n     editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\\textquotesingle Alch\\'{e}-Buc and E. Fox and R. Garnett},\r\n     pages = {7265-7273},\r\n     publisher = {Curran Associates, Inc.},\r\n     title = {Optimal Sparse Decision Trees},\r\n     url = {https://proceedings.neurips.cc/paper_files/paper/2019/file/ac52c626afc10d4075708ac4c778ddfc-Paper.pdf},\r\n     volume = {32},\r\n     year = {2019}\r\n    }\r\n\r\n\r\nor:\r\n\r\n    Hu, X., Rudin, C., and Seltzer, M. (2019). Optimal sparse decision trees. In Advances in Neural Information Processing Systems, pp. 7265–7273.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiyanghu%2FOSDT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxiyanghu%2FOSDT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiyanghu%2FOSDT/lists"}