{"id":15697009,"url":"https://github.com/tlapusan/woodpecker","last_synced_at":"2025-05-08T23:29:24.471Z","repository":{"id":96741735,"uuid":"171097395","full_name":"tlapusan/woodpecker","owner":"tlapusan","description":"A python library used for tree structure interpretation. ","archived":false,"fork":false,"pushed_at":"2024-01-25T15:19:45.000Z","size":4325,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-31T19:21:17.783Z","etag":null,"topics":["decision-trees","machine-learning","random-forest","scikit-learn","sklearn","visualization"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tlapusan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["tlapusan"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":null}},"created_at":"2019-02-17T08:11:23.000Z","updated_at":"2024-05-29T20:01:39.000Z","dependencies_parsed_at":"2024-10-09T13:45:42.314Z","dependency_job_id":null,"html_url":"https://github.com/tlapusan/woodpecker","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlapusan%2Fwoodpecker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlapusan%2Fwoodpecker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlapusan%2Fwoodpecker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlapusan%2Fwoodpecker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tlapusan","download_url":"https://codeload.github.com/tlapusan/woodpecker/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253162897,"owners_count":21864000,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["decision-trees","machine-learning","random-forest","scikit-learn","sklearn","visualization"],"created_at":"2024-10-03T19:10:47.548Z","updated_at":"2025-05-08T23:29:24.447Z","avatar_url":"https://github.com/tlapusan.png","language":"Jupyter Notebook","funding_links":["https://github.com/sponsors/tlapusan"],"categories":[],"sub_categories":[],"readme":"\n# Purpose \nA python library used for model structure interpretation. \u003cbr\u003e\nRight now the library contains logic for [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), [DecisionTreeRegression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) and [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [scikit-learn](https://scikit-learn.org/stable/). \nNext versions of the library will contain other types of algorithms, like \n[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), [XGboost](https://xgboost.readthedocs.io/en/latest/).\n\n\nBecoming a better machine learning engineer is important to understand more deeply the model structure and also to have an intuition of what is happening if we change the model inputs, how these will reflect in model performance. \nBy model inputs we mean to add more data, add new features and to change model hyperparameters\n\n\nThis library was developed with two main ideas in mind :\n- help us better understand the model structure, the model results and based on this to properly choose others hyperparameter values, other set of features for the next iteration\n- to justify/explain the predictions of ML models both for technical and non technical people\n\n# How to install ?\npip install git+https://github.com/tlapusan/woodpecker.git\n\n# Usage example\n\n### Training example\nThe well known [titanic dataset](https://www.kaggle.com/c/titanic/data) was chosen to show library capabilities.\n\n\u003e features = [\"Pclass\", \"Age\", \"Fare\", \"Sex_label\", \"Cabin_label\", \"Embarked_label\"] \u003cbr\u003e\n\u003e target = \"Survived\" \n\nLet's see some descriptive statistics about training set. \u003cbr\u003e \n\u003e train[features].describe() \n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/titanic_train_describe.png)\n   \n### Train the model \n\u003e model = DecisionTreeClassifier(criterion=\"entropy\", random_state=random_state, min_samples_split=20)\n\u003e model.fit(train[features], train[target])\n\n### Start using the library\n\n\u003e dts = DecisionTreeStructure(model, train, features, target)\n\n#### Visualize feature importance\n\nYou don't have to type all the code needed to extract feature importance,\nto map them to feature names and to sort them.\nNow, you just type this simple utility function. \n\n\u003e dts.show_features_importance() \n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/feature_importance.png)\n\n#### Visualize decision tree structure \n\nLike in the above case, this function is also an utility function what \nwrap all the code needed to visualize decision tree structure using graphviz.\n\n\u003e dts.show_decision_tree_structure() \n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/decision_tree_structure.png)\n\n#### Leaves impurity distribution\n\nImpurity is a metric which shows how confident is your leaf prediction. \u003cbr\u003e\nIn case of entropy, impurity is a range of values between 0 and 1. \n0 means that the leaf node is very confident about its predictions, 1 means the opposite.\n\nThe tree performance is directly influenced by each leaf performance. So it's very important to have a general \noverview of how leaves impurity look.\n\n\u003e dts.show_leaf_impurity_distribution(bins=40, figsize=(20, 7))\n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/leaves_impurity_distribution.png)\n\n#### Leaves sample distribution\n\nSample is a metric which shows how many examples from training set reached that node. \u003cbr\u003e\nFor a leaf is ideal to have an impurity very close to 0, but it's also equally important \nto have a significant set of samples reaching that leaf. If the set of samples is very small, could be a sign \nof outfitting for the leaf.\n\nThat's why is important to look both at leaves impurity (previous plot) and samples to get a better understanding of tree performance.\n\n\u003e dts.show_leaf_samples_distribution(bins=40, figsize=(20, 7))\n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/leaves_sample_distribution.png)\n\n#### Individual leaves metrics\n\nThere could be the case when we want to investigate individual leaf behavior. \u003cbr\u003e\nWe could analyze leaves with very good, medium or very low performance.  \n\n\n\u003e plt.figure(figsize=(40,30)) \u003cbr\u003e\n\u003e plt.subplot(3,1,1) \u003cbr\u003e\n\u003e dts.show_leaf_impurity() \u003cbr\u003e\n\n\u003e plt.subplot(3,1,2) \u003cbr\u003e\n\u003e dts.show_leaf_samples()\n\n\u003e plt.subplot(3,1,3)\n\u003e dts.show_leaf_samples_by_class()\n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/leaves_metrics.png)\n\n#### Get node samples\nThis function return a dataframe with all training samples reaching a node.\nAfter looking at individual leaves metrics, we can see that there are some interesting leaves. \nFor example the leaf 19 has impurity 0, a lot of samples and all people survived (survived=1)\nGetting the samples from such a leaf can help us to discover patterns in data or to discover why a leaf \nhas good/bad performance.\n\n\u003e dts.get_node_samples(node_id=19)[features + [target]].describe()\n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/get_node_samples.png)\n\nWe can see that majority of people were from a high social economic status (Pclass = 1), most of them were young to mid age,\nbought an expensive ticket (mean(Fare) from training is 32) and are all women.\n\n#### Visualize decision tree path prediction\nThere will be moments when we need to justify why our model predicted a specific value.\nLooking at the whole tree and tracking the path prediction is not time effective if the depth of the tree is large.\n\nLet's look at prediction path for the following sample : \n\u003ePclass             3.0 \u003cbr\u003e\nAge               28.0 \u003cbr\u003e\nFare              15.5 \u003cbr\u003e\nSex_label          0.0 \u003cbr\u003e\nCabin_label       -1.0 \u003cbr\u003e\nEmbarked_label     1.0 \u003cbr\u003e\n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/decision_tree_prediction_path.png)\n\n#### Visualize decision tree splits path prediction\nThis visualization shows the training data splits the model was build. \nIt can be used also as a way to learn how decision tree was built.\n\nThe sample is the same as above. \n\u003e dts.show_decision_tree_splits_prediction(sample, bins=20)\n\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/decision_tree_splits_prediction_part_1.png)\n![](https://github.com/tlapusan/woodpecker/blob/version_0.1/resources/docs/images/classification/decision_tree_splits_prediction_part_2.png)\n\n\nFor other algorithms visualizations, you can take a look inside the [notebooks folder](https://github.com/tlapusan/woodpecker/tree/master/notebooks)\n\n# Release History\n- 0.1\n    -  model structure investigation for DecisionTreeClassifier \n- 0.2\n    - add visualisation for correct/wrong leaves predictions\n    - add setup.py file\n\n# Meta\nTudor Lapusan \u003cbr\u003e\ntwitter : @tlapusan \u003cbr\u003e \nemail : tudor.lapusan@gmail.com\n\n# Library dependencies\n\n- jupyter\n- matplotlib \n- scikit-learn \n- pandas \n\n# License\nThis project is licensed under the terms of the MIT license, see LICENSE.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftlapusan%2Fwoodpecker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftlapusan%2Fwoodpecker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftlapusan%2Fwoodpecker/lists"}