{"id":13751817,"url":"https://github.com/hlamotte/decision-tree","last_synced_at":"2025-05-09T18:32:40.264Z","repository":{"id":51012311,"uuid":"353851166","full_name":"hlamotte/decision-tree","owner":"hlamotte","description":"Implementation of categorical decision tree in C++","archived":false,"fork":false,"pushed_at":"2021-08-01T04:28:52.000Z","size":9343,"stargazers_count":22,"open_issues_count":0,"forks_count":6,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-16T04:32:33.831Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hlamotte.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-01T23:25:23.000Z","updated_at":"2024-07-05T18:56:56.000Z","dependencies_parsed_at":"2022-09-03T05:01:25.680Z","dependency_job_id":null,"html_url":"https://github.com/hlamotte/decision-tree","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hlamotte%2Fdecision-tree","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hlamotte%2Fdecision-tree/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hlamotte%2Fdecision-tree/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hlamotte%2Fdecision-tree/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hlamotte","download_url":"https://codeload.github.com/hlamotte/decision-tree/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253303109,"owners_count":21886890,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:00:55.242Z","updated_at":"2025-05-09T18:32:35.231Z","avatar_url":"https://github.com/hlamotte.png","language":"Jupyter Notebook","funding_links":[],"categories":["梯度提升和树模型"],"sub_categories":[],"readme":"# Decision tree in C++\n## How to compile and run tests locally\nEnsure you have CMake installed.\n\nInstall CMake on MacOS:\n```\n$ brew install cmake\n$ cmake --version\n```\n\nThis project has been successfully compiled using clang++ compiler on MacOS (AppleClang 11.0.3). Ensure you have Clang installed.\n\nCompiling project:\n```\n$ mkdir build\n$ /usr/local/bin/cmake -S . -B ./build/ -D CMAKE_CXX_COMPILER=/usr/bin/clang++ -DCMAKE_VERBOSE_MAKEFILE=ON\n$ cd build \u0026\u0026 make\n```\nFor running tests after successful compilation:\n```bash\n$ cd test \u0026\u0026 ./DecisionTests \u0026\u0026 cd ../..\n```\n## Using this decision tree classifier\nAn example jupyter notebook calling this classifier training on the Titanic dataset can be found [here](notebooks/titanic_predictions.ipynb).\n\nAs can be seen in the notebook, the classifier implemented here does not produce identical results to the scikit-learn Decision Tree Classifier, but we see 85% accurate results on a test data set compared with the sci-kit learn implementation.\n\n\n## Overview of this implementation of a decision tree classifier\nDecision trees are a simple machine learning algorithm that use a series of features of an observation to create a prediction of a target outcome class.\n\nFor example, the target outcome class could be whether a company should interview a candidate for a job, and the series of features could be:\n\n1. The university they attended\n2. The subject they studied at university\n3. Their highest degree qualification\n4. Their previous employer\n5. Their previous job title\n\nThe decision tree is built (\"trained\") using a series of observations (training data) with their corresponding outcome class to create a tree that best fits the training data. This tree can then be used to make predictions on a series of observations without knowing the target outcome class.\n\nFor the interview screening tool outlined the training data might look like:\n\n| PersonID | University | UniSubject | Degree | PrevEmployer   | PrevTitle      | Interviewed |\n|----------|------------|------------|--------|----------------|----------------|-------------|\n| 1        | Harvard    | Math       | MSc    | Google         | Data Scientist | Yes         |\n| 2        | Stanford   | Math       | MSc    | Microsoft      | Data Analyst   | Yes         |\n| 3        | Cornell    | English    | MA     | New York Times | Reporter       | No          |\n\nWhere the outcome class used for training is whether the candidate was interviewed or not.\n\nIn this repo basic categorical-only decision tree classifier algorithm is implemented in C++ as an exercise to learn a low-level language.\n\n## Training a decision tree classifier\n- Have a number of measurements of categorical feature vectors and a corresponding outcome class.\n- For each feature calculate the [Gini Gain](https://victorzhou.com/blog/gini-impurity/).\n- Select the feature to split on that maximizes the Gini gain.\n- Split the data based on that feature to create two new sub-trees.\n- Repeat splitting of nodes in each sub-tree until only one outcome class predicted after splitting (gini gain = 0).\n\n## Inputs and outputs\n### Inputs\nTo simplify the problem to the core business logic of a categorical decision tree classifier, we will assume that __all categories are encoded as integers__ then input to the decision tree as a train csv and a test csv. The __header (feature names) will be omitted from the input__ and each *row* will contain all the observations for a feature, a column will represent one individual observation. Note this is rotated 90 degrees from a traditional csv representation.\n\nAssume training data features contain at least one instance of every class that can exist and that classes are indexed from zero.\n\nThe last row will be assumed to be the outcomes/target variable in the training dataset. There will be one less row in the test dataset as the outcome would be unknown.\n\n\n## Required data structures\n### Training and test data structure\n\nAn array of pointers for all features these pointers point to arrays with the value of that feature for each training or test observation. \n\nAn array of outcome classes for each training and test observation.\n\n### Decision tree architecture structure\nTree architecture.\n\nEach node has following attributes:\n1. ChildLeft - pointer to left child node\n2. ChildRight - pointer to right child node\n3. SplitFeature - feature used at split\n4. SplitCategory - category of that feature used at split. Observations containing that category are passed to the right child tree. If leaf, predicted class.\n5. GiniGain - gain as a result of split.\n\n## UML\n*Tree class*\n- Tree()\n- ~Tree()\n- Node* head\n- Traverse()\n- Fit()\n- Transform()\n- CSVReader\n\nStretch nice to haves:\n- Load()\n- Save()\n- Look into Hdf5 format for saving\n\n*Node class*\n- Node()\n- ~Node()\n- Node* children\n- DataFrame data\n- int splitFeature\n- int splitCategory\n- float giniImpurity\n\n\n## Gini gain\nGini impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled *according to the class distribution* in the dataset.\n\nGini gain is calculated as the original gini impurity of the dataset minus the weighted resultant gini impurities as a result of the split.\n\nWe choose the branch that maximises the Gini gain.\n\n## Development using VSCode\nUsing CMake and GoogleTest framework in VSCode based on this [video](https://www.youtube.com/watch?v=Lp1ifh9TuFI).\n\nBuild the binaries with F7. Run tests from the ribbon at the bottom of the VSCode UI.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhlamotte%2Fdecision-tree","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhlamotte%2Fdecision-tree","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhlamotte%2Fdecision-tree/lists"}