{"id":13856863,"url":"https://github.com/WinVector/pyvtreat","last_synced_at":"2025-07-13T19:33:02.795Z","repository":{"id":62587936,"uuid":"197965982","full_name":"WinVector/pyvtreat","owner":"WinVector","description":"vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.","archived":false,"fork":false,"pushed_at":"2024-06-13T16:49:58.000Z","size":47379,"stargazers_count":120,"open_issues_count":2,"forks_count":8,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-11-22T05:05:09.702Z","etag":null,"topics":["data-science","machine-learning","pydata","python"],"latest_commit_sha":null,"homepage":"https://winvector.github.io/pyvtreat/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WinVector.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-20T18:09:34.000Z","updated_at":"2024-11-19T08:34:27.000Z","dependencies_parsed_at":"2023-01-30T21:00:50.147Z","dependency_job_id":"d29ae277-b266-4890-8603-72f25a3ab75f","html_url":"https://github.com/WinVector/pyvtreat","commit_stats":{"total_commits":546,"total_committers":2,"mean_commits":273.0,"dds":0.03663003663003661,"last_synced_commit":"7ad7fa0384b0807db1aa4277b76176fa5bccb353"},"previous_names":[],"tags_count":41,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2Fpyvtreat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2Fpyvtreat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2Fpyvtreat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WinVector%2Fpyvtreat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WinVector","download_url":"https://codeload.github.com/WinVector/pyvtreat/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225844821,"owners_count":17533161,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","machine-learning","pydata","python"],"created_at":"2024-08-05T03:01:16.427Z","updated_at":"2024-11-22T14:31:07.589Z","avatar_url":"https://github.com/WinVector.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n\n[This](https://github.com/WinVector/pyvtreat) is the Python version of the `vtreat` data preparation system\n(also available as an [`R` package](http://winvector.github.io/vtreat/)).\n\n`vtreat` is a `DataFrame` processor/conditioner that prepares\nreal-world data for supervised machine learning or predictive modeling\nin a statistically sound manner.\n\n# Installing\n\nInstall `vtreat` with either of:\n\n  * `pip install vtreat`\n  * `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.4.6.tar.gz`\n\n\n# Video Introduction\n\n\n[Our PyData LA 2019 talk](https://youtu.be/qMCQFjEV90k) on `vtreat` is a good video introduction\nto what problems `vtreat` can be used to solve.  The slides can be found [here](https://github.com/WinVector/Examples/blob/master/PyDataLA2019/vtreat_pydata2019.pdf).\n\n# Details\n\n`vtreat` takes an input `DataFrame`\nthat has a specified column called \"the outcome variable\" (or \"y\")\nthat is the quantity to be predicted (and must not have missing\nvalues).  Other input columns are possible explanatory variables\n(typically numeric or categorical/string-valued, these columns may\nhave missing values) that the user later wants to use to predict \"y\".\nIn practice such an input `DataFrame` may not be immediately suitable\nfor machine learning procedures that often expect only numeric\nexplanatory variables, and may not tolerate missing values.\n\nTo solve this, `vtreat` builds a transformed `DataFrame` where all\nexplanatory variable columns have been transformed into a number of\nnumeric explanatory variable columns, without missing values.  The\n`vtreat` implementation produces derived numeric columns that capture\nmost of the information relating the explanatory columns to the\nspecified \"y\" or dependent/outcome column through a number of numeric\ntransforms (indicator variables, impact codes, prevalence codes, and\nmore).  This transformed `DataFrame` is suitable for a wide range of\nsupervised learning methods from linear regression, through gradient\nboosted machines.\n\nThe idea is: you can take a `DataFrame` of messy real world data and\neasily, faithfully, reliably, and repeatably prepare it for machine\nlearning using documented methods using `vtreat`.  Incorporating\n`vtreat` into your machine learning workflow lets you quickly work\nwith very diverse structured data.\n\nTo get started with `vtreat` please check out our documentation:\n\n  * [Getting started using `vtreat` for classification](https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.md).\n  * [Getting started using `vtreat` for regression](https://github.com/WinVector/pyvtreat/blob/master/Examples/Regression/Regression.md).\n  * [Getting started using `vtreat` for multi-category classification](https://github.com/WinVector/pyvtreat/blob/master/Examples/Multinomial/MultinomialExample.md).\n  * [Getting started using `vtreat` for unsupervised tasks](https://github.com/WinVector/pyvtreat/blob/master/Examples/Unsupervised/Unsupervised.md).\n  * [The `vtreat` Score Frame](https://github.com/WinVector/pyvtreat/blob/master/Examples/ScoreFrame/ScoreFrame.md) (a table mapping new derived variables to original columns).\n  * [The original `vtreat` paper](https://arxiv.org/abs/1611.09477) this note describes the methodology and theory. (The article describes the `R` version, however all of the examples can be found worked in `Python` [here](https://github.com/WinVector/pyvtreat/tree/master/Examples/vtreat_paper1)).\n\nSome `vtreat` common capabilities are documented here:\n\n  * **Score Frame** [score_frame_](https://github.com/WinVector/pyvtreat/blob/master/Examples/ScoreFrame/ScoreFrame.md), using the `score_frame_` information.\n  * **Cross Validation** [Customized Cross Plans](https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md), controlling the cross validation plan.\n\n`vtreat` is available as a [`Python`/`Pandas` package](https://github.com/WinVector/vtreat), and also as an [`R` package](https://github.com/WinVector/vtreat).\n\n\n![](https://github.com/WinVector/vtreat/raw/master/tools/vtreat.png)\n\n(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)\n\n`vtreat` is used by instantiating one of the classes\n`vtreat.NumericOutcomeTreatment`, `vtreat.BinomialOutcomeTreatment`, `vtreat.MultinomialOutcomeTreatment`, or `vtreat.UnsupervisedTreatment`.\nEach of these implements the [`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) interfaces\nexpecting a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) as input. The `vtreat` steps are intended to\nbe a \"one step fix\" that works well with [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) stages.\n\nThe `vtreat` `Pipeline.fit_transform()`\nmethod implements the powerful [cross-frame](https://cran.r-project.org/web/packages/vtreat/vignettes/vtreatCrossFrames.html) ideas (allowing the same data to be used for `vtreat` fitting and for later model construction, while\nmitigating nested model bias issues).\n\n## Background\n\nEven with modern machine learning techniques (random forests, support\nvector machines, neural nets, gradient boosted trees, and so on) or\nstandard statistical methods (regression, generalized regression,\ngeneralized additive models) there are *common* data issues that can\ncause modeling to fail. vtreat deals with a number of these in a\nprincipled and automated fashion.\n\nIn particular `vtreat` emphasizes a concept called “y-aware\npre-processing” and implements:\n\n  - Treatment of missing values through safe replacement plus an indicator\n    column (a simple but very powerful method when combined with\n    downstream machine learning algorithms).\n  - Treatment of novel levels (new values of categorical variable seen\n    during test or application, but not seen during training) through\n    sub-models (or impact/effects coding of pooled rare events).\n  - Explicit coding of categorical variable levels as new indicator\n    variables (with optional suppression of non-significant indicators).\n  - Treatment of categorical variables with very large numbers of levels\n    through sub-models (again [impact/effects\n    coding](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/)).\n  - Correct treatment of nested models or sub-models through data split / cross-frame methods\n    (please see\n    [here](https://winvector.github.io/vtreat/articles/vtreatOverfit.html))\n    or through the generation of “cross validated” data frames (see\n    [here](https://winvector.github.io/vtreat/articles/vtreatCrossFrames.html));\n    these are issues similar to what is required to build statistically\n    efficient stacked models or super-learners).\n\nThe idea is: even with a sophisticated machine learning algorithm there\nare *many* ways messy real world data can defeat the modeling process,\nand vtreat helps with at least ten of them. We emphasize: these problems\nare already in your data, you simply build better and more reliable\nmodels if you attempt to mitigate them. Automated processing is no\nsubstitute for actually looking at the data, but vtreat supplies\nefficient, reliable, documented, and tested implementations of many of\nthe commonly needed transforms.\n\nTo help explain the methods we have prepared some documentation:\n\n  - The [vtreat package\n    overall](https://winvector.github.io/vtreat/index.html).\n  - [Preparing data for analysis using R\n    white-paper](http://winvector.github.io/DataPrep/EN-CNTNT-Whitepaper-Data-Prep-Using-R.pdf)\n  - The [types of new\n    variables](https://winvector.github.io/vtreat/articles/vtreatVariableTypes.html)\n    introduced by vtreat processing (including how to limit down to\n    domain appropriate variable types).\n  - Statistically sound treatment of the nested modeling issue\n    introduced by any sort of pre-processing (such as vtreat itself):\n    [nested over-fit\n    issues](https://winvector.github.io/vtreat/articles/vtreatOverfit.html)\n    and a general [cross-frame\n    solution](https://winvector.github.io/vtreat/articles/vtreatCrossFrames.html).\n  - [Principled ways to pick significance based pruning\n    levels](https://winvector.github.io/vtreat/articles/vtreatSignificance.html).\n\n## Example\n\n\nThis is an supervised classification example taken from the KDD 2009 cup.  A copy of the data and details can be found here: [https://github.com/WinVector/PDSwR2/tree/master/KDD2009](https://github.com/WinVector/PDSwR2/tree/master/KDD2009).  The problem was to predict account cancellation (\"churn\") from very messy data (column names not given, numeric and categorical variables, many missing values, some categorical variables with a large number of possible levels).  In this example we show how to quickly use `vtreat` to prepare the data for modeling.  `vtreat` takes in `Pandas` `DataFrame`s and returns both a treatment plan and a clean `Pandas` `DataFrame` ready for modeling.\n# to install\n!pip install vtreat\n!pip install wvpy\nLoad our packages/modules.\n\n\n```python\nimport pandas\nimport xgboost\nimport vtreat\nimport vtreat.cross_plan\nimport numpy.random\nimport wvpy.util\nimport scipy.sparse\n```\n\nRead in explanitory variables.\n\n\n```python\n# data from https://github.com/WinVector/PDSwR2/tree/master/KDD2009\ndir = \"../../../PracticalDataScienceWithR2nd/PDSwR2/KDD2009/\"\nd = pandas.read_csv(dir + 'orange_small_train.data.gz', sep='\\t', header=0)\nvars = [c for c in d.columns]\nd.shape\n```\n\n\n\n\n    (50000, 230)\n\n\n\nRead in dependent variable we are trying to predict.\n\n\n```python\nchurn = pandas.read_csv(dir + 'orange_small_train_churn.labels.txt', header=None)\nchurn.columns = [\"churn\"]\nchurn.shape\n```\n\n\n\n\n    (50000, 1)\n\n\n\n\n```python\nchurn[\"churn\"].value_counts()\n```\n\n\n\n\n    -1    46328\n     1     3672\n    Name: churn, dtype: int64\n\n\n\nArrange test/train split.\n\n\n```python\nnumpy.random.seed(855885)\nn = d.shape[0]\n# https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md\nsplit1 = vtreat.cross_plan.KWayCrossPlanYStratified().split_plan(n_rows=n, k_folds=10, y=churn.iloc[:, 0])\ntrain_idx = set(split1[0]['train'])\nis_train = [i in train_idx for i in range(n)]\nis_test = numpy.logical_not(is_train)\n```\n\n(The reported performance runs of this example were sensitive to the prevalance of the churn variable in the test set, we are cutting down on this source of evaluation variarance by using the stratified split.)\n\n\n```python\nd_train = d.loc[is_train, :].copy()\nchurn_train = numpy.asarray(churn.loc[is_train, :][\"churn\"]==1)\nd_test = d.loc[is_test, :].copy()\nchurn_test = numpy.asarray(churn.loc[is_test, :][\"churn\"]==1)\n```\n\nTake a look at the dependent variables.  They are a mess, many missing values.  Categorical variables that can not be directly used without some re-encoding.\n\n\n```python\nd_train.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eVar1\u003c/th\u003e\n      \u003cth\u003eVar2\u003c/th\u003e\n      \u003cth\u003eVar3\u003c/th\u003e\n      \u003cth\u003eVar4\u003c/th\u003e\n      \u003cth\u003eVar5\u003c/th\u003e\n      \u003cth\u003eVar6\u003c/th\u003e\n      \u003cth\u003eVar7\u003c/th\u003e\n      \u003cth\u003eVar8\u003c/th\u003e\n      \u003cth\u003eVar9\u003c/th\u003e\n      \u003cth\u003eVar10\u003c/th\u003e\n      \u003cth\u003e...\u003c/th\u003e\n      \u003cth\u003eVar221\u003c/th\u003e\n      \u003cth\u003eVar222\u003c/th\u003e\n      \u003cth\u003eVar223\u003c/th\u003e\n      \u003cth\u003eVar224\u003c/th\u003e\n      \u003cth\u003eVar225\u003c/th\u003e\n      \u003cth\u003eVar226\u003c/th\u003e\n      \u003cth\u003eVar227\u003c/th\u003e\n      \u003cth\u003eVar228\u003c/th\u003e\n      \u003cth\u003eVar229\u003c/th\u003e\n      \u003cth\u003eVar230\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e1526.0\u003c/td\u003e\n      \u003ctd\u003e7.0\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003eoslk\u003c/td\u003e\n      \u003ctd\u003efXVEsaq\u003c/td\u003e\n      \u003ctd\u003ejySVZNlOJy\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003exb3V\u003c/td\u003e\n      \u003ctd\u003eRAYp\u003c/td\u003e\n      \u003ctd\u003eF2FyR07IdsN7I\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e525.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003eoslk\u003c/td\u003e\n      \u003ctd\u003e2Kb5FSF\u003c/td\u003e\n      \u003ctd\u003eLM8l689qOp\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003efKCe\u003c/td\u003e\n      \u003ctd\u003eRAYp\u003c/td\u003e\n      \u003ctd\u003eF2FyR07IdsN7I\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e5236.0\u003c/td\u003e\n      \u003ctd\u003e7.0\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003eAl6ZaUT\u003c/td\u003e\n      \u003ctd\u003eNKv4yOc\u003c/td\u003e\n      \u003ctd\u003ejySVZNlOJy\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003ekG3k\u003c/td\u003e\n      \u003ctd\u003eQu4f\u003c/td\u003e\n      \u003ctd\u003e02N6s8f\u003c/td\u003e\n      \u003ctd\u003eib5G6X1eUxUn6\u003c/td\u003e\n      \u003ctd\u003eam7c\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003eoslk\u003c/td\u003e\n      \u003ctd\u003eCE7uk3u\u003c/td\u003e\n      \u003ctd\u003eLM8l689qOp\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eFSa2\u003c/td\u003e\n      \u003ctd\u003eRAYp\u003c/td\u003e\n      \u003ctd\u003eF2FyR07IdsN7I\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e1029.0\u003c/td\u003e\n      \u003ctd\u003e7.0\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003eoslk\u003c/td\u003e\n      \u003ctd\u003e1J2cvxe\u003c/td\u003e\n      \u003ctd\u003eLM8l689qOp\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n      \u003ctd\u003ekG3k\u003c/td\u003e\n      \u003ctd\u003eFSa2\u003c/td\u003e\n      \u003ctd\u003eRAYp\u003c/td\u003e\n      \u003ctd\u003eF2FyR07IdsN7I\u003c/td\u003e\n      \u003ctd\u003emj86\u003c/td\u003e\n      \u003ctd\u003eNaN\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e5 rows × 230 columns\u003c/p\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nd_train.shape\n```\n\n\n\n\n    (45000, 230)\n\n\n\nTry building a model directly off this data (this will fail).\n\n\n```python\nfitter = xgboost.XGBClassifier(n_estimators=10, max_depth=3, objective='binary:logistic')\ntry:\n    fitter.fit(d_train, churn_train)\nexcept Exception as ex:\n    print(ex)\n```\n\n    DataFrame.dtypes for data must be int, float or bool.\n                    Did not expect the data types in fields Var191, Var192, Var193, Var194, Var195, Var196, Var197, Var198, Var199, Var200, Var201, Var202, Var203, Var204, Var205, Var206, Var207, Var208, Var210, Var211, Var212, Var213, Var214, Var215, Var216, Var217, Var218, Var219, Var220, Var221, Var222, Var223, Var224, Var225, Var226, Var227, Var228, Var229\n\n\nLet's quickly prepare a data frame with none of these issues.\n\nWe start by building our treatment plan, this has the `sklearn.pipeline.Pipeline` interfaces.\n\n\n```python\nplan = vtreat.BinomialOutcomeTreatment(outcome_target=True)\n```\n\nUse `.fit_transform()` to get a special copy of the treated training data that has cross-validated mitigations againsst nested model bias. We call this a \"cross frame.\" `.fit_transform()` is deliberately a different `DataFrame` than what would be returned by `.fit().transform()` (the `.fit().transform()` would damage the modeling effort due nested model bias, the `.fit_transform()` \"cross frame\" uses cross-validation techniques similar to \"stacking\" to mitigate these issues).\n\n\n```python\ncross_frame = plan.fit_transform(d_train, churn_train)\n```\n\nTake a look at the new data.  This frame is guaranteed to be all numeric with no missing values, with the rows in the same order as the training data.\n\n\n```python\ncross_frame.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eVar2_is_bad\u003c/th\u003e\n      \u003cth\u003eVar3_is_bad\u003c/th\u003e\n      \u003cth\u003eVar4_is_bad\u003c/th\u003e\n      \u003cth\u003eVar5_is_bad\u003c/th\u003e\n      \u003cth\u003eVar6_is_bad\u003c/th\u003e\n      \u003cth\u003eVar7_is_bad\u003c/th\u003e\n      \u003cth\u003eVar10_is_bad\u003c/th\u003e\n      \u003cth\u003eVar11_is_bad\u003c/th\u003e\n      \u003cth\u003eVar13_is_bad\u003c/th\u003e\n      \u003cth\u003eVar14_is_bad\u003c/th\u003e\n      \u003cth\u003e...\u003c/th\u003e\n      \u003cth\u003eVar227_lev_RAYp\u003c/th\u003e\n      \u003cth\u003eVar227_lev_ZI9m\u003c/th\u003e\n      \u003cth\u003eVar228_logit_code\u003c/th\u003e\n      \u003cth\u003eVar228_prevalence_code\u003c/th\u003e\n      \u003cth\u003eVar228_lev_F2FyR07IdsN7I\u003c/th\u003e\n      \u003cth\u003eVar229_logit_code\u003c/th\u003e\n      \u003cth\u003eVar229_prevalence_code\u003c/th\u003e\n      \u003cth\u003eVar229_lev__NA_\u003c/th\u003e\n      \u003cth\u003eVar229_lev_am7c\u003c/th\u003e\n      \u003cth\u003eVar229_lev_mj86\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.151682\u003c/td\u003e\n      \u003ctd\u003e0.653733\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.172744\u003c/td\u003e\n      \u003ctd\u003e0.567422\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.146119\u003c/td\u003e\n      \u003ctd\u003e0.653733\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.175707\u003c/td\u003e\n      \u003ctd\u003e0.567422\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e-0.629820\u003c/td\u003e\n      \u003ctd\u003e0.053956\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e-0.263504\u003c/td\u003e\n      \u003ctd\u003e0.234400\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.145871\u003c/td\u003e\n      \u003ctd\u003e0.653733\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.159486\u003c/td\u003e\n      \u003ctd\u003e0.567422\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.147432\u003c/td\u003e\n      \u003ctd\u003e0.653733\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n      \u003ctd\u003e-0.286852\u003c/td\u003e\n      \u003ctd\u003e0.196600\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e0.0\u003c/td\u003e\n      \u003ctd\u003e1.0\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e5 rows × 216 columns\u003c/p\u003e\n\u003c/div\u003e\n\n\n\n\n```python\ncross_frame.shape\n```\n\n\n\n\n    (45000, 216)\n\n\n\nPick a recommended subset of the new derived variables.\n\n\n```python\nplan.score_frame_.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003evariable\u003c/th\u003e\n      \u003cth\u003eorig_variable\u003c/th\u003e\n      \u003cth\u003etreatment\u003c/th\u003e\n      \u003cth\u003ey_aware\u003c/th\u003e\n      \u003cth\u003ehas_range\u003c/th\u003e\n      \u003cth\u003ePearsonR\u003c/th\u003e\n      \u003cth\u003esignificance\u003c/th\u003e\n      \u003cth\u003evcount\u003c/th\u003e\n      \u003cth\u003edefault_threshold\u003c/th\u003e\n      \u003cth\u003erecommended\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003eVar1_is_bad\u003c/td\u003e\n      \u003ctd\u003eVar1\u003c/td\u003e\n      \u003ctd\u003emissing_indicator\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n      \u003ctd\u003e0.003283\u003c/td\u003e\n      \u003ctd\u003e0.486212\u003c/td\u003e\n      \u003ctd\u003e193.0\u003c/td\u003e\n      \u003ctd\u003e0.001036\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003eVar2_is_bad\u003c/td\u003e\n      \u003ctd\u003eVar2\u003c/td\u003e\n      \u003ctd\u003emissing_indicator\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n      \u003ctd\u003e0.019270\u003c/td\u003e\n      \u003ctd\u003e0.000044\u003c/td\u003e\n      \u003ctd\u003e193.0\u003c/td\u003e\n      \u003ctd\u003e0.001036\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003eVar3_is_bad\u003c/td\u003e\n      \u003ctd\u003eVar3\u003c/td\u003e\n      \u003ctd\u003emissing_indicator\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n      \u003ctd\u003e0.019238\u003c/td\u003e\n      \u003ctd\u003e0.000045\u003c/td\u003e\n      \u003ctd\u003e193.0\u003c/td\u003e\n      \u003ctd\u003e0.001036\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003eVar4_is_bad\u003c/td\u003e\n      \u003ctd\u003eVar4\u003c/td\u003e\n      \u003ctd\u003emissing_indicator\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n      \u003ctd\u003e0.018744\u003c/td\u003e\n      \u003ctd\u003e0.000070\u003c/td\u003e\n      \u003ctd\u003e193.0\u003c/td\u003e\n      \u003ctd\u003e0.001036\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003eVar5_is_bad\u003c/td\u003e\n      \u003ctd\u003eVar5\u003c/td\u003e\n      \u003ctd\u003emissing_indicator\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n      \u003ctd\u003e0.017575\u003c/td\u003e\n      \u003ctd\u003e0.000193\u003c/td\u003e\n      \u003ctd\u003e193.0\u003c/td\u003e\n      \u003ctd\u003e0.001036\u003c/td\u003e\n      \u003ctd\u003eTrue\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nmodel_vars = numpy.asarray(plan.score_frame_[\"variable\"][plan.score_frame_[\"recommended\"]])\nlen(model_vars)\n```\n\n\n\n\n    216\n\n\n\nFit the model\n\n\n```python\ncross_frame.dtypes\n```\n\n\n\n\n    Var2_is_bad                            float64\n    Var3_is_bad                            float64\n    Var4_is_bad                            float64\n    Var5_is_bad                            float64\n    Var6_is_bad                            float64\n                                      ...         \n    Var229_logit_code                      float64\n    Var229_prevalence_code                 float64\n    Var229_lev__NA_           Sparse[float64, 0.0]\n    Var229_lev_am7c           Sparse[float64, 0.0]\n    Var229_lev_mj86           Sparse[float64, 0.0]\n    Length: 216, dtype: object\n\n\n\n\n```python\n# fails due to sparse columns\n# can also work around this by setting the vtreat parameter 'sparse_indicators' to False\ntry:\n    cross_sparse = xgboost.DMatrix(data=cross_frame.loc[:, model_vars], label=churn_train)\nexcept Exception as ex:\n    print(ex)\n```\n\n    DataFrame.dtypes for data must be int, float or bool.\n                    Did not expect the data types in fields Var193_lev_RO12, Var193_lev_2Knk1KF, Var194_lev__NA_, Var194_lev_SEuy, Var195_lev_taul, Var200_lev__NA_, Var201_lev__NA_, Var201_lev_smXZ, Var205_lev_VpdQ, Var206_lev_IYzP, Var206_lev_zm5i, Var206_lev__NA_, Var207_lev_me75fM6ugJ, Var207_lev_7M47J5GA0pTYIFxg5uy, Var210_lev_uKAI, Var211_lev_L84s, Var211_lev_Mtgm, Var212_lev_NhsEn4L, Var212_lev_XfqtO3UdzaXh_, Var213_lev__NA_, Var214_lev__NA_, Var218_lev_cJvF, Var218_lev_UYBR, Var221_lev_oslk, Var221_lev_zCkv, Var225_lev__NA_, Var225_lev_ELof, Var225_lev_kG3k, Var226_lev_FSa2, Var227_lev_RAYp, Var227_lev_ZI9m, Var228_lev_F2FyR07IdsN7I, Var229_lev__NA_, Var229_lev_am7c, Var229_lev_mj86\n\n\n\n```python\n# also fails\ntry:\n    cross_sparse = scipy.sparse.csc_matrix(cross_frame[model_vars])\nexcept Exception as ex:\n    print(ex)\n```\n\n    no supported conversion for types: (dtype('O'),)\n\n\n\n```python\n# works\ncross_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(cross_frame[[vi]]) for vi in model_vars])\n```\n\n\n```python\n# https://xgboost.readthedocs.io/en/latest/python/python_intro.html\nfd = xgboost.DMatrix(\n    data=cross_sparse, \n    label=churn_train)\n```\n\n\n```python\nx_parameters = {\"max_depth\":3, \"objective\":'binary:logistic'}\ncv = xgboost.cv(x_parameters, fd, num_boost_round=100, verbose_eval=False)\n```\n\n\n```python\ncv.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etrain-error-mean\u003c/th\u003e\n      \u003cth\u003etrain-error-std\u003c/th\u003e\n      \u003cth\u003etest-error-mean\u003c/th\u003e\n      \u003cth\u003etest-error-std\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e0.073378\u003c/td\u003e\n      \u003ctd\u003e0.000322\u003c/td\u003e\n      \u003ctd\u003e0.073733\u003c/td\u003e\n      \u003ctd\u003e0.000669\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e0.073411\u003c/td\u003e\n      \u003ctd\u003e0.000257\u003c/td\u003e\n      \u003ctd\u003e0.073511\u003c/td\u003e\n      \u003ctd\u003e0.000529\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e0.073433\u003c/td\u003e\n      \u003ctd\u003e0.000268\u003c/td\u003e\n      \u003ctd\u003e0.073578\u003c/td\u003e\n      \u003ctd\u003e0.000514\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e0.073444\u003c/td\u003e\n      \u003ctd\u003e0.000283\u003c/td\u003e\n      \u003ctd\u003e0.073533\u003c/td\u003e\n      \u003ctd\u003e0.000525\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e0.073444\u003c/td\u003e\n      \u003ctd\u003e0.000283\u003c/td\u003e\n      \u003ctd\u003e0.073533\u003c/td\u003e\n      \u003ctd\u003e0.000525\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nbest = cv.loc[cv[\"test-error-mean\"]\u003c= min(cv[\"test-error-mean\"] + 1.0e-9), :]\nbest\n\n\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etrain-error-mean\u003c/th\u003e\n      \u003cth\u003etrain-error-std\u003c/th\u003e\n      \u003cth\u003etest-error-mean\u003c/th\u003e\n      \u003cth\u003etest-error-std\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e21\u003c/th\u003e\n      \u003ctd\u003e0.072756\u003c/td\u003e\n      \u003ctd\u003e0.000177\u003c/td\u003e\n      \u003ctd\u003e0.073267\u003c/td\u003e\n      \u003ctd\u003e0.000327\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nntree = best.index.values[0]\nntree\n```\n\n\n\n\n    21\n\n\n\n\n```python\nfitter = xgboost.XGBClassifier(n_estimators=ntree, max_depth=3, objective='binary:logistic')\nfitter\n```\n\n\n\n\n    XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n                  colsample_bynode=1, colsample_bytree=1, gamma=0,\n                  learning_rate=0.1, max_delta_step=0, max_depth=3,\n                  min_child_weight=1, missing=None, n_estimators=21, n_jobs=1,\n                  nthread=None, objective='binary:logistic', random_state=0,\n                  reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n                  silent=None, subsample=1, verbosity=1)\n\n\n\n\n```python\nmodel = fitter.fit(cross_sparse, churn_train)\n```\n\nApply the data transform to our held-out data.\n\n\n```python\ntest_processed = plan.transform(d_test)\n```\n\nPlot the quality of the model on training data (a biased measure of performance).\n\n\n```python\npf_train = pandas.DataFrame({\"churn\":churn_train})\npf_train[\"pred\"] = model.predict_proba(cross_sparse)[:, 1]\nwvpy.util.plot_roc(pf_train[\"pred\"], pf_train[\"churn\"], title=\"Model on Train\")\n```\n\n\n![png](https://github.com/WinVector/pyvtreat/raw/master/Examples/KDD2009Example/output_44_0.png)\n\n\n\n\n\n    0.7424056263753072\n\n\n\nPlot the quality of the model score on the held-out data.  This AUC is not great, but in the ballpark of the original contest winners.\n\n\n```python\ntest_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processed[[vi]]) for vi in model_vars])\npf = pandas.DataFrame({\"churn\":churn_test})\npf[\"pred\"] = model.predict_proba(test_sparse)[:, 1]\nwvpy.util.plot_roc(pf[\"pred\"], pf[\"churn\"], title=\"Model on Test\")\n```\n\n\n![png](https://github.com/WinVector/pyvtreat/raw/master/Examples/KDD2009Example/output_46_0.png)\n\n\n\n\n\n    0.7328696191869485\n\n\n\nNotice we dealt with many problem columns at once, and in a statistically sound manner. More on the `vtreat` package for Python can be found here: [https://github.com/WinVector/pyvtreat](https://github.com/WinVector/pyvtreat).  Details on the `R` version can be found here: [https://github.com/WinVector/vtreat](https://github.com/WinVector/vtreat).\n\nWe can compare this to the [R solution (link)](https://github.com/WinVector/PDSwR2/blob/master/KDD2009/KDD2009vtreat.md).\n\nWe can compare the above cross-frame solution to a naive \"design transform and model on the same data set\" solution as we show below.  Note we turn off `filter_to_recommended` as this is computed using cross-frame techniques (and hence is a non-naive estimate).\n\n\n```python\nplan_naive = vtreat.BinomialOutcomeTreatment(\n    outcome_target=True,              \n    params=vtreat.vtreat_parameters({'filter_to_recommended':False}))\nplan_naive.fit(d_train, churn_train)\nnaive_frame = plan_naive.transform(d_train)\n```\n\n\n```python\nnaive_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(naive_frame[[vi]]) for vi in model_vars])\n```\n\n\n```python\nfd_naive = xgboost.DMatrix(data=naive_sparse, label=churn_train)\nx_parameters = {\"max_depth\":3, \"objective\":'binary:logistic'}\ncvn = xgboost.cv(x_parameters, fd_naive, num_boost_round=100, verbose_eval=False)\n```\n\n\n```python\nbestn = cvn.loc[cvn[\"test-error-mean\"]\u003c= min(cvn[\"test-error-mean\"] + 1.0e-9), :]\nbestn\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etrain-error-mean\u003c/th\u003e\n      \u003cth\u003etrain-error-std\u003c/th\u003e\n      \u003cth\u003etest-error-mean\u003c/th\u003e\n      \u003cth\u003etest-error-std\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e94\u003c/th\u003e\n      \u003ctd\u003e0.0485\u003c/td\u003e\n      \u003ctd\u003e0.000438\u003c/td\u003e\n      \u003ctd\u003e0.058622\u003c/td\u003e\n      \u003ctd\u003e0.000545\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nntreen = bestn.index.values[0]\nntreen\n```\n\n\n\n\n    94\n\n\n\n\n```python\nfittern = xgboost.XGBClassifier(n_estimators=ntreen, max_depth=3, objective='binary:logistic')\nfittern\n```\n\n\n\n\n    XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n                  colsample_bynode=1, colsample_bytree=1, gamma=0,\n                  learning_rate=0.1, max_delta_step=0, max_depth=3,\n                  min_child_weight=1, missing=None, n_estimators=94, n_jobs=1,\n                  nthread=None, objective='binary:logistic', random_state=0,\n                  reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n                  silent=None, subsample=1, verbosity=1)\n\n\n\n\n```python\nmodeln = fittern.fit(naive_sparse, churn_train)\n```\n\n\n```python\ntest_processedn = plan_naive.transform(d_test)\ntest_processedn = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processedn[[vi]]) for vi in model_vars])\n```\n\n\n```python\npfn_train = pandas.DataFrame({\"churn\":churn_train})\npfn_train[\"pred_naive\"] = modeln.predict_proba(naive_sparse)[:, 1]\nwvpy.util.plot_roc(pfn_train[\"pred_naive\"], pfn_train[\"churn\"], title=\"Overfit Model on Train\")\n```\n\n\n![png](https://github.com/WinVector/pyvtreat/raw/master/Examples/KDD2009Example/output_58_0.png)\n\n\n\n\n\n    0.9492686875296688\n\n\n\n\n```python\npfn = pandas.DataFrame({\"churn\":churn_test})\npfn[\"pred_naive\"] = modeln.predict_proba(test_processedn)[:, 1]\nwvpy.util.plot_roc(pfn[\"pred_naive\"], pfn[\"churn\"], title=\"Overfit Model on Test\")\n```\n\n\n![png](https://github.com/WinVector/pyvtreat/raw/master/Examples/KDD2009Example/output_59_0.png)\n\n\n\n\n\n    0.5960012412998182\n\n\n\nNote the naive test performance is worse, despite its far better training performance.  This is over-fit due to the nested model bias of using the same data to build the treatment plan and model without any cross-frame mitigations.\n\n\n\n\n## Solution Details\n\nSome `vreat` data treatments are “y-aware” (use distribution relations between\nindependent variables and the dependent variable).\n\nThe purpose of `vtreat` library is to reliably prepare data for\nsupervised machine learning. We try to leave as much as possible to the\nmachine learning algorithms themselves, but cover most of the truly\nnecessary typically ignored precautions. The library is designed to\nproduce a `DataFrame` that is entirely numeric and takes common\nprecautions to guard against the following real world data issues:\n\n  - Categorical variables with very many levels.\n    \n    We re-encode such variables as a family of indicator or dummy\n    variables for common levels plus an additional [impact\n    code](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/)\n    (also called “effects coded”). This allows principled use (including\n    smoothing) of huge categorical variables (like zip-codes) when\n    building models. This is critical for some libraries (such as\n    `randomForest`, which has hard limits on the number of allowed\n    levels).\n\n  - Rare categorical levels.\n    \n    Levels that do not occur often during training tend not to have\n    reliable effect estimates and contribute to over-fit.\n\n  - Novel categorical levels.\n    \n    A common problem in deploying a classifier to production is: new\n    levels (levels not seen during training) encountered during model\n    application. We deal with this by encoding categorical variables in\n    a possibly redundant manner: reserving a dummy variable for all\n    levels (not the more common all but a reference level scheme). This\n    is in fact the correct representation for regularized modeling\n    techniques and lets us code novel levels as all dummies\n    simultaneously zero (which is a reasonable thing to try). This\n    encoding while limited is cheaper than the fully Bayesian solution\n    of computing a weighted sum over previously seen levels during model\n    application.\n\n  - Missing/invalid values NA, NaN, +-Inf.\n    \n    Variables with these issues are re-coded as two columns. The first\n    column is clean copy of the variable (with missing/invalid values\n    replaced with either zero or the grand mean, depending on the user\n    chose of the `scale` parameter). The second column is a dummy or\n    indicator that marks if the replacement has been performed. This is\n    simpler than imputation of missing values, and allows the downstream\n    model to attempt to use missingness as a useful signal (which it\n    often is in industrial data).\n\nThe above are all awful things that often lurk in real world data.\nAutomating mitigation steps ensures they are easy enough that you actually\nperform them and leaves the analyst time to look for additional data\nissues. For example this allowed us to essentially automate a number of\nthe steps taught in chapters 4 and 6 of [*Practical Data Science with R*\n(Zumel, Mount; Manning 2014)](http://practicaldatascience.com/) into a\n[very short\nworksheet](https://github.com/WinVector/pyvtreat/blob/master/Examples/KDD2009Example/KDD2009Example.md) (though we\nthink for understanding it is *essential* to work all the steps by hand\nas we did in the book).  The 2nd edition of *Practical Data Science with R* covers\nusing `vtreat` in `R` in chapter 8 \"Advanced Data Preparation.\"\n\nThe idea is: `DataFrame`s prepared with the\n`vtreat` library are somewhat safe to train on as some precaution has\nbeen taken against all of the above issues. Also of interest are the\n`vtreat` variable significances (help in initial variable pruning, a\nnecessity when there are a large number of columns) and\n`vtreat::prepare(scale=TRUE)` which re-encodes all variables into\neffect units making them suitable for y-aware dimension reduction\n(variable clustering, or principal component analysis) and for geometry\nsensitive machine learning techniques (k-means, knn, linear SVM, and\nmore). You may want to do more than the `vtreat` library does (such as\nBayesian imputation, variable clustering, and more) but you certainly do\nnot want to do less.\n\n## References\n\nSome of our related articles (which should make clear some of our\nmotivations, and design decisions):\n\n  - [The `vtreat` technical paper](https://arxiv.org/abs/1611.09477).\n  - [Modeling trick: impact coding of categorical variables with many\n    levels](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/)\n  - [A bit more on impact\n    coding](http://www.win-vector.com/blog/2012/08/a-bit-more-on-impact-coding/)\n  - [vtreat: designing a package for variable\n    treatment](http://www.win-vector.com/blog/2014/08/vtreat-designing-a-package-for-variable-treatment/)\n  - [A comment on preparing data for\n    classifiers](http://www.win-vector.com/blog/2014/12/a-comment-on-preparing-data-for-classifiers/)\n  - [Nina Zumel presenting on\n    vtreat](http://www.slideshare.net/ChesterChen/vtreat)\n\n\nA directory of worked examples can be found [here](https://github.com/WinVector/pyvtreat/tree/master/Examples).\n\nWe intend to add better Python documentation and a certification suite going forward.\n\n## Installation\n\nTo install, please run:\n\n```python\n# To install:\npip install vtreat\n```\n\nSome notes on controlling `vtreat` cross-validation can be found [here](https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md).\n\n## Note on data types.\n\n`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing and scalar column names, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWinVector%2Fpyvtreat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FWinVector%2Fpyvtreat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWinVector%2Fpyvtreat/lists"}