{"id":14989264,"url":"https://github.com/rambatino/chaid","last_synced_at":"2025-04-04T20:15:52.550Z","repository":{"id":51241887,"uuid":"56257335","full_name":"Rambatino/CHAID","owner":"Rambatino","description":"A python implementation of the common CHAID algorithm","archived":false,"fork":false,"pushed_at":"2024-07-22T08:32:54.000Z","size":5505,"stargazers_count":156,"open_issues_count":0,"forks_count":53,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-04-04T20:15:46.733Z","etag":null,"topics":["chaid","marketing-statistics","spss","stats","tree"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rambatino.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-04-14T17:34:11.000Z","updated_at":"2025-03-04T10:14:40.000Z","dependencies_parsed_at":"2022-08-17T10:30:28.539Z","dependency_job_id":"4b626dfc-e787-4ebd-933d-36cbaab6b47e","html_url":"https://github.com/Rambatino/CHAID","commit_stats":{"total_commits":224,"total_committers":6,"mean_commits":"37.333333333333336","dds":0.2991071428571429,"last_synced_commit":"bf3d1ef0cf5a0fbe463ab3b1fcbe9fb0abd6d58f"},"previous_names":[],"tags_count":35,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rambatino%2FCHAID","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rambatino%2FCHAID/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rambatino%2FCHAID/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rambatino%2FCHAID/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rambatino","download_url":"https://codeload.github.com/Rambatino/CHAID/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242681,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chaid","marketing-statistics","spss","stats","tree"],"created_at":"2024-09-24T14:17:57.653Z","updated_at":"2025-04-04T20:15:52.527Z","avatar_url":"https://github.com/Rambatino.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://img.shields.io/pypi/v/CHAID.svg\"\u003e \u003cimg src=\"https://img.shields.io/pypi/dm/chaid.svg?maxAge=2592000\u0026label=installs\u0026color=%2327B1FF\"\u003e \u003cimg src=\"https://img.shields.io/pypi/pyversions/pytest.svg\"\u003e \u003cimg src=\"https://circleci.com/gh/Rambatino/CHAID.png?style=shield\u0026circle-token=031aab51ad1dea4a698d02f02288887f06c1a9ef\"\u003e \u003ca href=\"https://codecov.io/gh/Rambatino/CHAID\"\u003e\u003cimg src=\"https://codecov.io/gh/Rambatino/CHAID/branch/master/graph/badge.svg\" alt=\"Codecov\" /\u003e\u003c/a\u003e\n\nChi-Squared Automatic Inference Detection\n=========================================\n\nThis package provides a python implementation of the [Chi-Squared Automatic Inference Detection (CHAID) decision tree](https://en.wikipedia.org/wiki/CHAID) as well as [exhaustive CHAID](https://github.com/Rambatino/CHAID/issues/112)\n\n\nInstallation\n------------\n\nCHAID is distributed via [pypi](https://pypi.python.org/pypi/CHAID) and can be installed like:\n\n``` bash\npip3 install CHAID\n```\n\nIf you need support for graphs, optional packages must be installed together like:\n``` bash\npip install CHAID[graph]\n```\n\nIf you need support to read in a `.sav` file (SPSS), you will also need to install optional packages like:\n``` bash\npip install CHAID[spss]\n```\n\nTo install multiple optional packages, you can use a comma-separated list like:\n``` bash\npip install CHAID[graph,spss]\n```\n\nAlternatively, you can clone the repository and install via\n``` bash\npip install -e path/to/your/checkout\n```\n\nN.B. although we've made some attempt at supporting python 2.7 see [here](https://github.com/Rambatino/CHAID/pull/103), we don't encourage the use of it as it's reached it's [End Of Life (EOL)](https://www.python.org/doc/sunset-python-2).\n\nCreating a CHAID Tree\n---------------\n\n``` python\nfrom CHAID import Tree, NominalColumn\nimport pandas as pd\nimport numpy as np\n\n\n## create the data\nndarr = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3)\ndf = pd.DataFrame(ndarr)\ndf.columns = ['a', 'b', 'c']\narr = np.array(([1] * 5) + ([2] * 5))\ndf['d'] = arr\n\n\u003e\u003e\u003e df\n   a  b  c  d\n0  1  2  3  1\n1  1  2  3  1\n2  1  2  3  1\n3  1  2  3  1\n4  1  2  3  1\n5  2  2  3  2\n6  2  2  3  2\n7  2  2  3  2\n8  2  2  3  2\n9  2  2  3  2\n\n## set the CHAID input parameters\nindependent_variable_columns = ['a', 'b', 'c']\ndep_variable = 'd'\n\n## create the Tree via pandas\ntree = Tree.from_pandas_df(df, dict(zip(independent_variable_columns, ['nominal'] * 3)), dep_variable)\n## create the same tree, but without pandas helper\ntree = Tree.from_numpy(ndarr, arr, split_titles=['a', 'b', 'c'], min_child_node_size=5)\n## create the same tree using the tree constructor\ncols = [\n  NominalColumn(ndarr[:,0], name='a'),\n  NominalColumn(ndarr[:,1], name='b'),\n  NominalColumn(ndarr[:,2], name='c')\n]\ntree = Tree(cols, NominalColumn(arr, name='d'), {'min_child_node_size': 5})\n\n\u003e\u003e\u003e tree.print_tree()\n([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))\n├── ([1], {1: 5, 2: 0}, \u003cInvalid Chaid Split\u003e)\n└── ([2], {1: 0, 2: 5}, \u003cInvalid Chaid Split\u003e)\n\n## to get a LibTree object,\n\u003e\u003e\u003e tree.to_tree()\n\u003ctreelib.tree.Tree object at 0x114e2e350\u003e\n\n## the different nodes of the tree can be accessed like\nfirst_node = tree.tree_store[0]\n\n\u003e\u003e\u003e first_node\n([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))\n\n## the properties of the node can be access like\n\u003e\u003e\u003e first_node.members\n{1: 5, 2: 5}\n\n## the properties of split can be accessed like\n\u003e\u003e\u003e first_node.split.p\n0.001565402258002549\n\u003e\u003e\u003e first_node.split.score\n10.0\n```\n\nCreating a Tree using Bartlett's or Levene's Significance Test for Continuous Variables\n----------\n\nWhen the dependent variable is continuous, the chi-squared test does not work due to very low frequencies of values across subgroups. As a consequence, and because the F-test is very susceptible to deviations from normality, the normality of the dependent set is determined and [Bartlett's test](https://en.wikipedia.org/wiki/Bartlett%27s_test) for significance is used when the data is normally distributed (although the subgroups may not necessarily be so) or [Levene's test](https://en.wikipedia.org/wiki/Levene%27s_test) is used when the data is non-normal.\n\n``` python\nfrom CHAID import Tree\n\n## create the data\nndarr = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3)\ndf = pd.DataFrame(ndarr)\ndf.columns = ['a', 'b', 'c']\ndf['d'] = np.random.normal(300, 100, 10)\nindependent_variable_columns = ['a', 'b', 'c']\ndep_variable = 'd'\n\n\u003e\u003e\u003e df\n   a  b  c           d\n0  1  2  3  262.816747\n1  1  2  3  240.139085\n2  1  2  3  204.224083\n3  1  2  3  231.024752\n4  1  2  3  263.176338\n5  2  2  3  440.371621\n6  2  2  3  221.762452\n7  2  2  3  197.290268\n8  2  2  3  275.925549\n9  2  2  3  238.471850\n\n## create the Tree via pandas\ntree = Tree.from_pandas_df(df, dict(zip(independent_variable_columns, ['nominal'] * 3)), dep_variable, dep_variable_type='continuous')\n\n## print the tree (though not enough power to split)\n\u003e\u003e\u003e tree.print_tree()\n([], {'s.t.d': 86.562258585515579, 'mean': 297.52027436303212}, \u003cInvalid Chaid Split\u003e)\n```\n\nParameters\n----------\n* `df`: Pandas DataFrame\n* `i_variables: Dict\u003cstring, string\u003e`: Independent variable column names as keys and the type as the values (nominal or ordinal)\n* `d_variable: String`: Dependent variable column name\n* `opts: {}`:\n  * `alpha_merge: Float (default = 0.05)`: If the respective test for a given pair of predictor categories is not statistically significant as defined by an `alpha_merge` value, the least significant predictor categories are merged and the splitting of the node is attempted with the newly formed categories\n  * `max_depth: Integer (default = 2)`: The maximum depth of the tree\n  * `min_parent_node_size: Float (default = 30)`: The minimum number of respondents required for a split to occur on a particular node\n  * `min_child_node_size: Float (default = 0)`: If the split of a node results in a child node whose node size is less than `min_child_node_size`, child nodes that have too few cases (as with this minimum) will merge with the most similar child node as measured by the largest of the p-values. However, if the resulting number of child nodes is 1, the node will not be split.\n  * `max_splits: Integer or None (default = None)`: If specified, child nodes will continue to be merged until the number of splits at a single node is at max equal to `max_splits`. If not specified, this will be ignored.\n  * `split_threshold: Float (default = 0)`: The split threshold when bucketing root node surrogate splits\n  * `weight: String (default = None)`: The name of the weight column\n  * `dep_variable_type (default = categorical, other_options = continuous)`: Whether the dependent variable is 'categorical' or 'continuous'\nRunning from the Command Line\n-----------------------------\n\nYou can play around with the repo by cloning and running this from the command line:\n\n```\npython -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05\n```\n\nIt calls the `print_tree()` method, which prints the tree to terminal:\n\n``` python\n([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, chi=365.886947811, groups=[['female'], ['male']]))\n├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, chi=24.0936494474, groups=[['C', '\u003cmissing\u003e'], ['Q', 'S']]))\n│   ├── (['C', '\u003cmissing\u003e'], {0: 11, 1: 104}, \u003cInvalid Chaid Split\u003e)\n│   └── (['Q', 'S'], {0: 116, 1: 235}, \u003cInvalid Chaid Split\u003e)\n└── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, chi=16.4413525404, groups=[['C'], ['Q', 'S']]))\n    ├── (['C'], {0: 109, 1: 48}, \u003cInvalid Chaid Split\u003e)\n    └── (['Q', 'S'], {0: 573, 1: 113}, \u003cInvalid Chaid Split\u003e)\n```\n\nor to test the continuous dependent variable case:\n\n```\npython -m CHAID tests/data/titanic.csv fare sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous\n```\n\n``` python\n([], {'s.t.d': 51.727293077231302, 'mean': 33.270043468296414}, (embarked, p=8.46027456424e-24, score=55.3476155546, groups=[['C'], ['Q', '\u003cmissing\u003e'], ['S']]), dof=1308))\n├── (['C'], {'s.t.d': 84.029951444532529, 'mean': 62.336267407407405}, (sex, p=0.0293299541476, score=4.7994643184, groups=[['female'], ['male']]), dof=269))\n│   ├── (['female'], {'s.t.d': 90.687664523113241, 'mean': 81.12853982300885}, \u003cInvalid Chaid Split\u003e)\n│   └── (['male'], {'s.t.d': 76.07029674707077, 'mean': 48.810619108280257}, \u003cInvalid Chaid Split\u003e)\n├── (['Q', '\u003cmissing\u003e'], {'s.t.d': 15.902095006812658, 'mean': 13.490467999999998}, \u003cInvalid Chaid Split\u003e)\n└── (['S'], {'s.t.d': 37.066877311088625, 'mean': 27.388825164113786}, (sex, p=3.43875930713e-07, score=26.3745361415, groups=[['female'], ['male']]), dof=913))\n    ├── (['female'], {'s.t.d': 48.971933059814894, 'mean': 39.339305154639177}, \u003cInvalid Chaid Split\u003e)\n    └── (['male'], {'s.t.d': 28.242580058030033, 'mean': 21.806819261637241}, \u003cInvalid Chaid Split\u003e)\n```\n\nNote that the frequency of the dependent variable is replaced with the standard deviation and mean of the continuous set at each node and that any NaNs in the dependent set are automatically converted to 0.0.\n\nGenerating Splitting Rules\n----------\nAppend `--rules` to the cli or call `tree.classification_rules(node)` (either pass in the node or if node is None then it will return all splitting rules)\n\n```\npython -m CHAID tests/data/titanic.csv fare sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous --rules\n```\n\n``` python\n{'node': 2, 'rules': [{'variable': 'sex', 'data': ['female']}, {'variable': 'embarked', 'data': ['C']}]}\n{'node': 3, 'rules': [{'variable': 'sex', 'data': ['male']}, {'variable': 'embarked', 'data': ['C']}]}\n{'node': 4, 'rules': [{'variable': 'embarked', 'data': ['Q', '\u003cmissing\u003e']}]}\n{'node': 6, 'rules': [{'variable': 'sex', 'data': ['female']}, {'variable': 'embarked', 'data': ['S']}]}\n{'node': 7, 'rules': [{'variable': 'sex', 'data': ['male']}, {'variable': 'embarked', 'data': ['S']}]}\n```\n\nParameters\n-------\nRun `python -m CHAID -h` to see description of command line arguments\n\nHow to Read the Tree\n-------\n\nWe'll start with a real world example using the titanic dataset.\n\nFirst make sure to install all required packages:\n\n``` bash\npython setup.py install \u0026\u0026 pip install ipdb\n```\n\nRun:\n```bash\npython -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05\n```\n\nafter placing an ipdb statement on like 55 on \\_\\_main\\_\\_.py as in the example below. The parameters mean max depth two 4 levels, a minimum parent node size threshold to 2 and merge the groups if the p-value is greater than 0.05 when comparing the groups.\n\n```python\n82        tree = Tree.from_pandas_df(data, independent_variables,\n83                                   nspace.dependent_variable[0],\n84                                   variable_types=types, **config)\n---\u003e 85   import ipdb; ipdb.set_trace()\n86    \n87        if nspace.classify:\n88            predictions = pd.Series(tree.node_predictions())\n89            predictions.name = 'node_id'\n90            data = pd.concat([data, predictions], axis=1)\n91            print(data.to_csv())\n92        elif nspace.predict:\n```\n\nRunning `tree.print_tree()` gives:\n\n``` python\n([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))\n├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '\u003cmissing\u003e'], ['Q', 'S']]), dof=1))\n│   ├── (['C', '\u003cmissing\u003e'], {0: 11, 1: 104}, \u003cInvalid Chaid Split\u003e)\n│   └── (['Q', 'S'], {0: 116, 1: 235}, \u003cInvalid Chaid Split\u003e)\n└── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1))\n    ├── (['C'], {0: 109, 1: 48}, \u003cInvalid Chaid Split\u003e)\n    └── (['Q', 'S'], {0: 573, 1: 113}, \u003cInvalid Chaid Split\u003e)\n```\n\nas show above. The first line is the root node, all the data is present in this node. The the vertical bars originating from a node represents paths to that node's children.\n\nRunning `tree.tree_store` will give you a list of all the nodes in the tree:\n\n``` python\n[\n  ([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1)),\n  (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '\u003cmissing\u003e'], ['Q', 'S']]), dof=1)),\n  (['C', '\u003cmissing\u003e'], {0: 11, 1: 104}, \u003cInvalid Chaid Split\u003e), (['Q', 'S'], {0: 116, 1: 235}, \u003cInvalid Chaid Split\u003e),\n  (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1)),\n  (['C'], {0: 109, 1: 48}, \u003cInvalid Chaid Split\u003e), (['Q', 'S'], {0: 573, 1: 113}, \u003cInvalid Chaid Split\u003e)\n]\n```\n\nSo let's inspect the root node `tree.tree_store[0]`:\n\n``` python\n([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))\n```\n\nNodes have certain properties. Firstly, they show the column that was chosen to split to this node (for a root node the column is empty '([])'). The second property `{0: 809, 1: 500}` show the members of that node, and represent the current frequency of the dependent variable. In this case, it is all the answers in the 'survived' column, as that was the first column past to the program in the command line (`python -m CHAID tests/data/titanic.csv survived`). The next property represents the splitting of the node. What column was chosen to make that split (in this case, `sex`), the p-value of the split and the chi-score and most importantly, which variables in `sex` create the new nodes and the degrees of freedom associated with that split (1, in this case)\n\nThese properties that can be accessed:\n\n``` python\nipdb\u003e root_node = tree.tree_store[0]\nipdb\u003e root_node.choices\n[]\nipdb\u003e root_node.members\n{0: 809, 1: 500}\nipdb\u003e root_node.split\n(sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1)\n```\n\nThe split variable can be further inspected:\n\n``` python\nipdb\u003e split = root_node.split\nipdb\u003e split.column\n'sex'\nipdb\u003e split.p\n1.4714531016922664e-81\nipdb\u003e split.score\n365.88694781112048\nipdb\u003e split.dof\n1\nipdb\u003e split.groupings\n\"[['female'], ['male']]\"\n```\n\nTherefore, in this example, the root node is split on the column 'sex' in the data, splitting up the females and males. These females and males each form a new node and further down, the all male and all female nodes are split on the column 'embarked' (although they needn't split on the same column). A `\u003cInvalid Chaid Split\u003e` is reached when either the node is pure (only one dependent variable remains) or when a terminating parameter is met (e.g. min node size, or max depth [see tree parameters above])\n\nThe conclusion drawn from this tree is that: \"Gender was the most important factor driving the survival of people on the titanic. Whereby females had a much higher likelihood of surviving (survival = 1 in the survival column and 0 means they died). Of those females, those who embarked first class (class 'C', node 2) had a much higher likelihood of surviving.\"\n\nExporting the tree\n-------\n\nIf you want to export the tree to a dot file, then use:\n\n```python\ntree.to_tree()\n```\n\nThis creates a [treelib](https://github.com/caesar0301/treelib/blob/master/treelib) which has a `.to_graphviz()` method [here](https://github.com/caesar0301/treelib/blob/master/treelib/tree.py#L894).\n\n\nIn order to use visually graph the CHAID tree, you'll need to install two more libraries that aren't distributed via pypi:\n\n- graphviz - see [here](https://stackoverflow.com/questions/35064304/runtimeerror-make-sure-the-graphviz-executables-are-on-your-systems-path-aft) for platform specific installations\n- orca - see [the README.md](https://github.com/plotly/orca) for platform specific installations\n\nYou can export the tree to .gv and png using:\n\n```python\ntree.render(path=None, view=False)\n```\n\nWhich will save it to a file specified at `path` and can be instantly viewed when view=True.\n\nThis can also be triggered from the command line using `--export` or `--export-path`. The former causes it to be stored in a newly created `trees` folder and the latter specifies the location of the file. Both will trigger an auto-viewing of the tree. E.g:\n\n```bash\npython -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --export\n```\n\n```bash\npython -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --export-path YOUR_PATH.gv\n```\n\nThe output will look like:\n\n![](https://github.com/Rambatino/CHAID/blob/master/docs/2019-04-01%2011:45:43.gv.png?raw=true \"CHAID Tree\")\n\nTesting\n-------\n\nCHAID uses [`pytest`](https://pypi.python.org/pypi/pytest) for its unit testing. The tests can be run from the root of a checkout with:\n``` bash\npy.test\n```\n\nIf you so wish to run the unit tests across multiple python versions to make sure your changes are compatible, run: [`tox`](https://github.com/tox-dev/tox) ([`detox`](https://github.com/tox-dev/detox/releases) to run in parallel). You may need to run `pip install tox tox-pyenv detox` \u0026 `brew install pyenv` beforehand.\n\nCaveats\n-------\n\n* Unlike SPSS, this library doesn't modify the data internally. This means that weight variables aren't rounded as they are in SPSS.\n* Every row is valid, even if all values are NaN or undefined. This is different to SPSS where in the weighted case it will strip out all rows if all the independent variables are NaN\n\nUpcoming Features\n-------\n\n* Accuracy Estimation using Machine Learning techniques on the data\n* Binning of continuous independent variables\n\nGenerating the CHANGELOG.md\n--------\n\n`gem install github_changelog_generator \u0026\u0026 github_changelog_generator --exclude-labels maintenance,refactor,testing`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frambatino%2Fchaid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frambatino%2Fchaid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frambatino%2Fchaid/lists"}