{"id":44336658,"url":"https://github.com/adimajo/glmdisc_python","last_synced_at":"2026-02-11T11:36:31.415Z","repository":{"id":57434963,"uuid":"122314964","full_name":"adimajo/glmdisc_python","owner":"adimajo","description":"glmdisc Python package: discretization, factor level grouping, interaction discovery for logistic regression","archived":false,"fork":false,"pushed_at":"2023-11-28T20:53:09.000Z","size":6209,"stargazers_count":6,"open_issues_count":4,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-27T09:41:44.471Z","etag":null,"topics":["categorical-features","discretization","gibbs-sampler","interactions","logistic-regression"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adimajo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-21T09:19:04.000Z","updated_at":"2022-05-26T09:58:28.000Z","dependencies_parsed_at":"2022-09-04T15:32:53.713Z","dependency_job_id":null,"html_url":"https://github.com/adimajo/glmdisc_python","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/adimajo/glmdisc_python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adimajo%2Fglmdisc_python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adimajo%2Fglmdisc_python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adimajo%2Fglmdisc_python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adimajo%2Fglmdisc_python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adimajo","download_url":"https://codeload.github.com/adimajo/glmdisc_python/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adimajo%2Fglmdisc_python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29332641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-11T06:13:03.264Z","status":"ssl_error","status_checked_at":"2026-02-11T06:12:55.843Z","response_time":97,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["categorical-features","discretization","gibbs-sampler","interactions","logistic-regression"],"created_at":"2026-02-11T11:36:30.657Z","updated_at":"2026-02-11T11:36:31.405Z","avatar_url":"https://github.com/adimajo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI version](https://badge.fury.io/py/glmdisc.svg)](https://badge.fury.io/py/glmdisc)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/glmdisc.svg)](https://pypi.python.org/pypi/glmdisc/)\n[![PyPi Downloads](https://img.shields.io/pypi/dm/glmdisc)](https://img.shields.io/pypi/dm/glmdisc)\n[![Build Status](https://travis-ci.org/adimajo/glmdisc_python.svg?branch=master)](https://travis-ci.org/adimajo/glmdisc_python)\n![Python package](https://github.com/adimajo/glmdisc_python/workflows/Python%20package/badge.svg)\n[![codecov](https://codecov.io/gh/adimajo/glmdisc_python/branch/master/graph/badge.svg)](https://codecov.io/gh/adimajo/glmdisc_python)\n\n# Feature quantization for parsimonious and interpretable models\n\nTable of Contents\n-----------------\n\n* [Documentation](https://adimajo.github.io/glmdisc_python)\n* [Installation instructions](#-installing-the-package)\n* [Theory](#-use-case-example)\n* [Some examples](#-the-glmdisc-package)\n* [Open an issue](https://github.com/adimajo/glmdisc_python/issues/new/choose)\n* [References](#-references)\n* [Contribute](#-contribute)\n\n## Motivation\n\nCredit institutions are interested in the refunding probability of a loan given the applicant’s characteristics in order to assess the worthiness of the credit. For regulatory and interpretability reasons, the logistic regression is still widely used to learn this probability from the data. Although logistic regression handles naturally both quantitative and qualitative data, three pre-processing steps are usually performed: firstly, continuous features are discretized by assigning factor levels to pre-determined intervals; secondly, qualitative features, if they take numerous values, are grouped; thirdly, interactions (products between two different predictors) are sparsely introduced. By reinterpreting discretized (resp. grouped) features as latent variables, we are able, through the use of a Stochastic Expectation-Maximization (SEM) algorithm and a Gibbs sampler to find the best discretization (resp. grouping) scheme w.r.t. the logistic regression loss. For detecting interacting features, the same scheme is used by replacing the Gibbs sampler by a Metropolis-Hastings algorithm. The good performances of this approach are illustrated on simulated and real data from Credit Agricole Consumer Finance.\n\nThis repository is the implementation of Ehrhardt Adrien, et al. [Feature quantization for parsimonious and interpretable predictive models](https://arxiv.org/abs/1903.08920), preprint arXiv:1903.08920 (2019).\n\nNOTE: for now, only \"glmdisc-SEM\" is available.\n\n## Getting started\n\nThese instructions will get you a copy of the project up and running on your local machine for development and testing purposes.\n\n### Prerequisites\n\nThis code is supported on Python 3.7, 3.8, 3.9 and 3.10 (see [tox file](tox.ini)).\n\n### Installing the package\n\n#### Installing the development version\n\nIf `git` is installed on your machine, you can use:\n\n```PowerShell\npip install git+https://github.com/adimajo/glmdisc_python.git\n```\n\nIf `git` is not installed, you can also use:\n\n```PowerShell\npip install --upgrade https://github.com/adimajo/glmdisc_python/archive/master.tar.gz\n```\n\n#### Installing through the `pip` command\n\nYou can install a stable version from [PyPi](https://pypi.org/project/glmdisc/) by using:\n\n```PowerShell\npip install glmdisc\n```\n\n#### Installation guide for Anaconda\n\nThe installation with the `pip` command **should** work. If not, please raise an issue.\n\n#### For people behind proxy(ies)...\n\nA lot of people, including myself, work behind a proxy at work...\n\nA simple solution to get the package is to use the `--proxy` option of `pip`:\n\n```PowerShell\npip --proxy=http://username:password@server:port install glmdisc\n```\n\nwhere *username*, *password*, *server* and *port* should be replaced by your own values.\n\nIf environment variables `http_proxy` and / or `https_proxy` and / or (unfortunately depending on applications...) \n`HTTP_PROXY` and `HTTPS_PROXY` are set, the proxy settings should be picked up by `pip`.\n\nOver the years, I've found [CNTLM](http://cntlm.sourceforge.net/) to be a great tool in this regard.\n\n**What follows is a quick introduction to the problem of discretization and how this package answers the question.**\n\n\u003c!--**If you wish to see the package in action, please refer to the accompanying Jupyter Notebook.**--\u003e\n\n\u003c!--**If you seek specific assistance regarding the package or one of its function, please refer to the ReadTheDocs.**--\u003e\n\n## Use case example\n\nFor a thorough explanation of the approach, see [this blog post](https://adimajo.github.io/discretization) or [this article](https://arxiv.org/abs/1903.08920).\n\nIf you're interested in directly using the package, you can skip this part and go to [this part below](#-the-glmdisc-package).\n\nIn practice, the statistical modeler has historical data about each customer's characteristics. For obvious reasons, only data available at the time of inquiry must be used to build a future application scorecard. Those data often take the form of a well-structured table with one line per client alongside their performance (did they pay back their loan or not?) as can be seen in the following table:\n\n| Job | Habitation | Time in job | Children | Family status | Default |\n| --- | --- | --- | --- | --- | --- |\n| Craftsman | Owner | 10 | 0 | Divorced |  No |\n| Technician | Renter | **Missing** | 1 | Widower | No |\n| **Missing** | Starter | 5 | 2 | Single |  Yes |\n| Office employee | By family | 2 | 3 | Married | No |\n\n## Notations\n\nIn the rest of the vignette, the random vector \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X=(X_j)_1^d\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X=(X_j)_1^d\" title=\"X=(X_j)_1^d\" /\u003e\u003c/a\u003e  will designate the predictive features, i.e. the characteristics of a client. The random variable \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y\u0026space;\\in\u0026space;\\{0,1\\}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y\u0026space;\\in\u0026space;\\{0,1\\}\" title=\"Y \\in \\{0,1\\}\" /\u003e\u003c/a\u003e  will designate the label, i.e. if the client has defaulted (\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y=1\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y=1\" title=\"Y=1\" /\u003e\u003c/a\u003e) or not (\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y=0\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y=0\" title=\"Y=0\" /\u003e\u003c/a\u003e).\n\nWe are provided with an i.i.d. sample \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;(\\mathbf{x},\\mathbf{y})\u0026space;=\u0026space;(x_i,y_i)_1^n\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;(\\mathbf{x},\\mathbf{y})\u0026space;=\u0026space;(x_i,y_i)_1^n\" title=\"(\\mathbf{x},\\mathbf{y}) = (x_i,y_i)_1^n\" /\u003e\u003c/a\u003e consisting in \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;n\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;n\" title=\"n\" /\u003e\u003c/a\u003e observations of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e and \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y\" title=\"Y\" /\u003e\u003c/a\u003e.\n\n## Logistic regression\n\nThe logistic regression model assumes the following relation between \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e and \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y\" title=\"Y\" /\u003e\u003c/a\u003e :\n\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\ln\u0026space;\\left(\u0026space;\\frac{p_\\theta(Y=1|x)}{p_\\theta(Y=0|x)}\u0026space;\\right)\u0026space;=\u0026space;\\theta_0\u0026space;\u0026plus;\u0026space;\\sum_{j\u0026space;\\text{\u0026space;if\u0026space;}\u0026space;X_j\u0026space;\\text{\u0026space;continuous}}\u0026space;\\theta_j\u0026space;x_j\u0026space;\u0026plus;\u0026space;\\sum_{j\u0026space;\\text{\u0026space;if\u0026space;}\u0026space;X_j\u0026space;\\text{\u0026space;categorical}}\u0026space;\\theta_j^{x_j}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\ln\u0026space;\\left(\u0026space;\\frac{p_\\theta(Y=1|x)}{p_\\theta(Y=0|x)}\u0026space;\\right)\u0026space;=\u0026space;\\theta_0\u0026space;\u0026plus;\u0026space;\\sum_{j\u0026space;\\text{\u0026space;if\u0026space;}\u0026space;X_j\u0026space;\\text{\u0026space;continuous}}\u0026space;\\theta_j\u0026space;x_j\u0026space;\u0026plus;\u0026space;\\sum_{j\u0026space;\\text{\u0026space;if\u0026space;}\u0026space;X_j\u0026space;\\text{\u0026space;categorical}}\u0026space;\\theta_j^{x_j}\" title=\"\\ln \\left( \\frac{p_\\theta(Y=1|x)}{p_\\theta(Y=0|x)} \\right) = \\theta_0 + \\sum_{j \\text{ if } X_j \\text{ continuous}} \\theta_j x_j + \\sum_{j \\text{ if } X_j \\text{ categorical}} \\theta_j^{x_j}\" /\u003e\u003c/a\u003e\n\nwhere \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\theta\u0026space;=\u0026space;(\\theta_j)_0^d\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\theta\u0026space;=\u0026space;(\\theta_j)_0^d\" title=\"\\theta = (\\theta_j)_0^d\" /\u003e\u003c/a\u003e are estimated using \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;(\\mathbf{x},\\mathbf{y})\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;(\\mathbf{x},\\mathbf{y})\" title=\"(\\mathbf{x},\\mathbf{y})\" /\u003e\u003c/a\u003e (and \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\theta_j^h,\u0026space;1\u0026space;\\leq\u0026space;h\u0026space;\\leq\u0026space;l_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\theta_j^h,\u0026space;1\u0026space;\\leq\u0026space;h\u0026space;\\leq\u0026space;l_j\" title=\"\\theta_j^h, 1 \\leq h \\leq l_j\" /\u003e\u003c/a\u003e denotes the coefficients associated with a categorical feature \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=x_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?x_j\" title=\"x_j\" /\u003e\u003c/a\u003e being equal to \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=h\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?h\" title=\"h\" /\u003e\u003c/a\u003e).\n\nClearly, for continuous features, the model assumes linearity of the logit transform of the response \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y\" title=\"Y\" /\u003e\u003c/a\u003e with respect to \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e.\nOn the contrary, for categorical features, it might overfit if there are lots of levels (\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=l_j\u0026space;\u003e\u003e\u0026space;1\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?l_j\u0026space;\u003e\u003e\u0026space;1\" title=\"l_j \u003e\u003e 1\" /\u003e\u003c/a\u003e). It does not handle missing values. \n\n## Common problems with logistic regression on \"raw\" data\n\nFitting a logistic regression model on \"raw\" data presents several problems, among which some are tackled here.\n\n### Feature selection\n\nFirst, among all collected information on individuals, some are irrelevant for predicting \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y\" title=\"Y\" /\u003e\u003c/a\u003e. Their coefficient \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\theta_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\theta_j\" title=\"\\theta_j\" /\u003e\u003c/a\u003e should be 0  which might (eventually) be the case asymptotically (i.e. \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;n\u0026space;\\rightarrow\u0026space;\\infty\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;n\u0026space;\\rightarrow\u0026space;\\infty\" title=\"n \\rightarrow \\infty\" /\u003e\u003c/a\u003e).\n\nSecond, some collected information are highly correlated and affect each other's coefficient estimation.\n\nAs a consequence, data scientists often perform feature selection before training a machine learning algorithm such as logistic regression.\n\nThere already exists methods and packages to perform feature selection, see for example the `feature_selection` submodule in the `sklearn` package.\n\n`glmdisc` is not a feature selection tool but acts as such as a side-effect: when a continuous feature is discretized into only one interval, or when a categorical feature is regrouped into only one value, then this feature gets out of the model.\n\nFor a thorough reference on feature selection, see e.g. Guyon, I., \u0026 Elisseeff, A. (2003). [An introduction to variable and feature selection](http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf). *Journal of machine learning research, 3*(Mar), 1157-1182.\n\n### Linearity\n\nWhen provided with continuous features, the logistic regression model assumes linearity of the logit transform of the response \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y\" title=\"Y\" /\u003e\u003c/a\u003e with respect to \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e. This might not be the case at all.\n\nFor example, we can simulate a logistic model with an arbitrary power of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e and then try to fit a linear logistic model:\n\n```python\n\n\n```\n- [ ] Show the Python code\n\n- [ ] Get this graph online\n\nOf course, providing the `sklearn.linear_model.LogisticRegression` function with a dataset containing \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X^5\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X^5\" title=\"X^5\" /\u003e\u003c/a\u003e would solve the problem. This can't be done in practice for two reasons: first, it is too time-consuming to examine all features and candidate polynomials; second, we lose the interpretability of the logistic decision function which was of primary interest.\n\nConsequently, we wish to discretize the input variable \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e into a categorical feature which will \"minimize\" the error with respect to the \"true\" underlying relation:\n\n- [ ] Show the Python code\n\n- [ ] Get this graph online\n\n\n### Too many values per categorical feature\n\nWhen provided with categorical features, the logistic regression model fits a coefficient for all its values (except one which is taken as a reference). A common problem arises when there are too many values as each value will be taken by a small number of observations \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;x_i^j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;x_i^j\" title=\"x_i^j\" /\u003e\u003c/a\u003e which makes the estimation of a logistic regression coefficient unstable:\n\n\n- [ ] Show the Python code\n\n- [ ] Get this graph online\n\n\nIf we divide the training set in 10 and estimate the variance of each coefficient, we get:\n\n- [ ] Show the Python code\n\n- [ ] Get this graph online\n\n\n\nAll intervals crossing 0 are non-significant! We should group factor values to get a stable estimation and (hopefully) significant coefficient values.\n\n\n# Discretization and grouping: theoretical background\n\n## Notations\n\nLet \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}=(\\mathfrak{q}_j)_1^d\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}=(\\mathfrak{q}_j)_1^d\" title=\"\\mathfrak{q}=(\\mathfrak{q}_j)_1^d\" /\u003e\u003c/a\u003e be the latent discretized transform of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e, i.e. taking values in \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\{0,\\ldots,m_j\\}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\{0,\\ldots,m_j\\}\" title=\"\\{0,\\ldots,m_j\\}\" /\u003e\u003c/a\u003e where the number of values of each covariate \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;m_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;m_j\" title=\"m_j\" /\u003e\u003c/a\u003e is also latent.\n\nThe fitted logistic regression model is now:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\ln\u0026space;\\left(\u0026space;\\frac{p_\\theta(Y=1|e)}{p_\\theta(Y=0|e)}\u0026space;\\right)\u0026space;=\u0026space;\\theta_0\u0026space;\u0026plus;\u0026space;\\sum_{j=1}^d\u0026space;\\sum_{k=1}^{m_j}\u0026space;\\theta^j_k*{1}_{e^j=k}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\ln\u0026space;\\left(\u0026space;\\frac{p_\\theta(Y=1|\\mathfrak{q})}{p_\\theta(Y=0|\\mathfrak{q})}\u0026space;\\right)\u0026space;=\u0026space;\\theta_0\u0026space;\u0026plus;\u0026space;\\sum_{j=1}^d\u0026space;\\sum_{k=1}^{m_j}\u0026space;\\theta^j_k*{1}_{\\mathfrak{q}^j=k}\" title=\"\\ln \\left( \\frac{p_\\theta(Y=1|\\mathfrak{q})}{p_\\theta(Y=0|\\mathfrak{q})} \\right) = \\theta_0 + \\sum_{j=1}^d \\sum_{k=1}^{m_j} \\theta^j_k*{1}_{\\mathfrak{q}^j=k}\" /\u003e\u003c/a\u003e\n\nClearly, the number of parameters has grown which allows for flexible approximation of the true underlying model \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=p(Y|\\mathfrak{q})\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?p(Y|\\mathfrak{q})\" title=\"p(Y|\\mathfrak{q})\" /\u003e\u003c/a\u003e.\n\n## Best discretization?\n\nOur goal is to obtain the model \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=p_\\theta(Y|\\mathfrak{q})\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?p_\\theta(Y|\\mathfrak{q})\" title=\"p_\\theta(Y|\\mathfrak{q})\" /\u003e\u003c/a\u003e with best predictive power. As \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\mathfrak{q}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\mathfrak{q}\" title=\"\\mathfrak{q}\" /\u003e\u003c/a\u003e and \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\theta\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\theta\" title=\"\\theta\" /\u003e\u003c/a\u003e are both optimized, a formal goodness-of-fit criterion could be:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=(\\hat{\\theta},\\hat{\\mathfrak{q}})\u0026space;=\u0026space;\\arg\u0026space;\\max_{\\theta,\\mathfrak{q}}\u0026space;\\text{AIC}(p_\\theta(\\mathbf{y}|\\mathfrak{q}))\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?(\\hat{\\theta},\\hat{\\mathfrak{q}})\u0026space;=\u0026space;\\arg\u0026space;\\max_{\\theta,\\mathfrak{q}}\u0026space;\\text{AIC}(p_\\theta(\\mathbf{y}|\\mathfrak{q}))\" title=\"(\\hat{\\theta},\\hat{\\mathfrak{q}}) = \\arg \\max_{\\theta,\\mathfrak{q}} \\text{AIC}(p_\\theta(\\mathbf{y}|\\mathfrak{q}))\" /\u003e\u003c/a\u003e\nwhere AIC stands for Akaike Information Criterion.\n\n## Combinatorics\n\nThe problem seems well-posed: if we were able to generate all discretization schemes transforming \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?X\" title=\"X\" /\u003e\u003c/a\u003e to \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\mathfrak{q}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\mathfrak{q}\" title=\"\\mathfrak{q}\" /\u003e\u003c/a\u003e, learn \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=p_\\theta(y|\\mathfrak{q})\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?p_\\theta(y|\\mathfrak{q})\" title=\"p_\\theta(y|\\mathfrak{q})\" /\u003e\u003c/a\u003e for each of them and compare their AIC values, the problem would be solved.\n\nUnfortunately, there are way too many candidates to follow this procedure. Suppose we want to construct k intervals of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e given n distinct \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=(x_j_i)_1^n\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?(x_j_i)_1^n\" title=\"(x_j_i)_1^n\" /\u003e\u003c/a\u003e. There is \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=n\u0026space;\\choose\u0026space;k\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?n\u0026space;\\choose\u0026space;k\" title=\"n \\choose k\" /\u003e\u003c/a\u003e models. The true value of k is unknown, so it must be looped over. Finally, as logistic regression is a multivariate model, the discretization of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e can influence the discretization of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\mathfrak{q}_k\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\mathfrak{q}_k\" title=\"\\mathfrak{q}_k\" /\u003e\u003c/a\u003e, \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=k\u0026space;\\neq\u0026space;j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?k\u0026space;\\neq\u0026space;j\" title=\"k \\neq j\" /\u003e\u003c/a\u003e.\n\nAs a consequence, existing approaches to discretization (in particular discretization of continuous attributes) rely on strong assumptions to simplify the search of good candidates as can be seen in the review of Ramírez‐Gallego, S. et al. (2016) - see [References section](#-references).\n\n\n\n# Discretization and grouping: estimation\n\n## Likelihood estimation\n\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\mathfrak{q}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\mathfrak{q}\" title=\"\\mathfrak{q}\" /\u003e\u003c/a\u003e can be introduced in \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=p(Y|X)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?p(Y|X)\" title=\"p(Y|X)\" /\u003e\u003c/a\u003e:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y|x)\u0026space;=\u0026space;\\sum_\\mathfrak{q}\u0026space;p(y|x,\\mathfrak{q})p(\\mathfrak{q}|x)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y|x)\u0026space;=\u0026space;\\sum_\\mathfrak{q}\u0026space;p(y|x,\\mathfrak{q})p(\\mathfrak{q}|x)\" title=\"\\forall \\: x,y, \\; p(y|x) = \\sum_\\mathfrak{q} p(y|x,\\mathfrak{q})p(\\mathfrak{q}|x)\" /\u003e\u003c/a\u003e\n\nFirst, we assume that all information about \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;Y\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;Y\" title=\"Y\" /\u003e\u003c/a\u003e in \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X\" title=\"X\" /\u003e\u003c/a\u003e is already contained in \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}\" title=\"\\mathfrak{q}\" /\u003e\u003c/a\u003e so that:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\forall\u0026space;\\:\u0026space;x,y,\\mathfrak{q},\u0026space;\\;\u0026space;p(y|x,\\mathfrak{q})=p(y|\\mathfrak{q})\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\forall\u0026space;\\:\u0026space;x,y,\\mathfrak{q},\u0026space;\\;\u0026space;p(y|x,\\mathfrak{q})=p(y|\\mathfrak{q})\" title=\"\\forall \\: x,y,\\mathfrak{q}, \\; p(y|x,\\mathfrak{q})=p(y|\\mathfrak{q})\" /\u003e\u003c/a\u003e\nSecond, we assume the conditional independence of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e given \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=X_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?X_j\" title=\"X_j\" /\u003e\u003c/a\u003e, i.e. knowing \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=X_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?X_j\" title=\"X_j\" /\u003e\u003c/a\u003e, the discretization \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e is independent of the other features \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X_k\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X_k\" title=\"X_k\" /\u003e\u003c/a\u003e and \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_k\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_k\" title=\"\\mathfrak{q}_k\" /\u003e\u003c/a\u003e for all \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;k\u0026space;\\neq\u0026space;j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;k\u0026space;\\neq\u0026space;j\" title=\"k \\neq j\" /\u003e\u003c/a\u003e:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\forall\u0026space;\\:x,\u0026space;k\\neq\u0026space;j,\u0026space;\\;\u0026space;\\mathfrak{q}_j\u0026space;|\u0026space;x_j\u0026space;\\perp\u0026space;\\mathfrak{q}_k\u0026space;|\u0026space;x_k\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\forall\u0026space;\\:x,\u0026space;k\\neq\u0026space;j,\u0026space;\\;\u0026space;\\mathfrak{q}_j\u0026space;|\u0026space;x_j\u0026space;\\perp\u0026space;\\mathfrak{q}_k\u0026space;|\u0026space;x_k\" title=\"\\forall \\:x, k\\neq j, \\; \\mathfrak{q}_j | x_j \\perp \\mathfrak{q}_k | x_k\" /\u003e\u003c/a\u003e\nThe first equation becomes:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y|x)\u0026space;=\u0026space;\\sum_\\mathfrak{q}\u0026space;p(y|\\mathfrak{q})\u0026space;\\prod_{j=1}^d\u0026space;p(\\mathfrak{q}_j|x_j)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y|x)\u0026space;=\u0026space;\\sum_\\mathfrak{q}\u0026space;p(y|\\mathfrak{q})\u0026space;\\prod_{j=1}^d\u0026space;p(\\mathfrak{q}_j|x_j)\" title=\"\\forall \\: x,y, \\; p(y|x) = \\sum_\\mathfrak{q} p(y|\\mathfrak{q}) \\prod_{j=1}^d p(\\mathfrak{q}_j|x_j)\" /\u003e\u003c/a\u003e\nAs said earlier, we consider only logistic regression models on discretized data \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;p_\\theta(y|\\mathfrak{q})\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;p_\\theta(y|\\mathfrak{q})\" title=\"p_\\theta(y|\\mathfrak{q})\" /\u003e\u003c/a\u003e. Additionnally, it seems like we have to make further assumptions on the nature of the relationship of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e to \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;x_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;x_j\" title=\"x_j\" /\u003e\u003c/a\u003e. We chose to use polytomous logistic regressions for continuous \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X_j\" title=\"X_j\" /\u003e\u003c/a\u003e and contengency tables for qualitative \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X_j\" title=\"X_j\" /\u003e\u003c/a\u003e. This is an arbitrary choice and future versions will include the possibility of plugging your own model.\n\nThe first equation becomes:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y|x)\u0026space;=\u0026space;\\sum_\\mathfrak{q}\u0026space;p_\\theta(y|\\mathfrak{q})\u0026space;\\prod_{j=1}^d\u0026space;p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y|x)\u0026space;=\u0026space;\\sum_\\mathfrak{q}\u0026space;p_\\theta(y|\\mathfrak{q})\u0026space;\\prod_{j=1}^d\u0026space;p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" title=\"\\forall \\: x,y, \\; p(y|x) = \\sum_\\mathfrak{q} p_\\theta(y|\\mathfrak{q}) \\prod_{j=1}^d p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" /\u003e\u003c/a\u003e\n\n## The SEM algorithm\n\nIt is still hard to optimize over \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;p(y|x;\\theta,\\alpha)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;p(y|x;\\theta,\\alpha)\" title=\"p(y|x;\\theta,\\alpha)\" /\u003e\u003c/a\u003e as the number of candidate discretizations is gigantic as said earlier.\n\nHowever, calculating \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;p(y,\\mathfrak{q}|x)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;p(y,\\mathfrak{q}|x)\" title=\"p(y,\\mathfrak{q}|x)\" /\u003e\u003c/a\u003e is easy:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y,\\mathfrak{q}|x)\u0026space;=\u0026space;p_\\theta(y|\\mathfrak{q})\u0026space;\\prod_{j=1}^d\u0026space;p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\forall\u0026space;\\:\u0026space;x,y,\u0026space;\\;\u0026space;p(y,\\mathfrak{q}|x)\u0026space;=\u0026space;p_\\theta(y|\\mathfrak{q})\u0026space;\\prod_{j=1}^d\u0026space;p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" title=\"\\forall \\: x,y, \\; p(y,\\mathfrak{q}|x) = p_\\theta(y|\\mathfrak{q}) \\prod_{j=1}^d p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" /\u003e\u003c/a\u003e\n\nAs a consequence, we will draw random candidates \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}\" title=\"\\mathfrak{q}\" /\u003e\u003c/a\u003e approximately at the mode of the distribution \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;p(y,\\cdot|x)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;p(y,\\cdot|x)\" title=\"p(y,\\cdot|x)\" /\u003e\u003c/a\u003e using an SEM algorithm (see see [References section](#-references)).\n\n## Gibbs sampling\n\nTo update, at each random draw, the parameters \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\theta\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\theta\" title=\"\\theta\" /\u003e\u003c/a\u003e and \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\alpha\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\alpha\" title=\"\\alpha\" /\u003e\u003c/a\u003e and propose a new discretization \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}\" title=\"\\mathfrak{q}\" /\u003e\u003c/a\u003e, we use the following equation:\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=p(\\mathfrak{q}_j|x_j,y,\\mathfrak{q}_{\\{-j\\}})\u0026space;\\propto\u0026space;p_\\theta(y|\\mathfrak{q})\u0026space;p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?p(\\mathfrak{q}_j|x_j,y,\\mathfrak{q}_{\\{-j\\}})\u0026space;\\propto\u0026space;p_\\theta(y|\\mathfrak{q})\u0026space;p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" title=\"p(\\mathfrak{q}_j|x_j,y,\\mathfrak{q}_{\\{-j\\}}) \\propto p_\\theta(y|\\mathfrak{q}) p_{\\alpha_j}(\\mathfrak{q}_j|x_j)\" /\u003e\u003c/a\u003e\nNote that we draw \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e knowing all other variables, especially \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_{-j}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_{-j}\" title=\"\\mathfrak{q}_{-j}\" /\u003e\u003c/a\u003e so that we introduced a Gibbs sampler (see References section).\n\n# The `glmdisc` package\n\n## The `glmdisc` class\n\nThe documentation is available as a [Github Page](https://adimajo.github.io/glmdisc_python/index.html).\n\nThe `glmdisc` class implements the algorithm described in the previous section. Its parameters are described first, then its internals are briefly discussed. We finally focus on its ouptuts.\n\n### Parameters\n\nThe number of iterations in the SEM algorithm is controlled through the `iter` parameter. It can be useful to first run the `glmdisc` function with a low (10-50) `iter` parameter so you can have a better idea of how much time your code will run.\n\nThe `validation` and `test` boolean parameters control if the provided dataset should be divided into training, validation and/or test sets. The validation set aims at evaluating the quality of the model fit at each iteration while the test set provides the quality measure of the final chosen model.\n\nThe `criterion` parameters lets the user choose between standard model selection statistics like `aic` and `bic` and the `gini` index performance measure (proportional to the more traditional AUC measure). Note that if `validation=TRUE`, there is no need to penalize the log-likelihood and `aic` and `bic` become equivalent. On the contrary if `criterion=\"gini\"` and `validation=FALSE` then the algorithm may overfit the training data.\n\nThe `m_start` parameter controls the maximum number of categories of \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e for \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X_j\" title=\"X_j\" /\u003e\u003c/a\u003e continuous. The SEM algorithm will start with random \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e taking values in \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\{1,m_{\\text{start}}\\}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\{1,m_{\\text{start}}\\}\" title=\"\\{1,m_{\\text{start}}\\}\" /\u003e\u003c/a\u003e. For qualitative features \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X_j\" title=\"X_j\" /\u003e\u003c/a\u003e, \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;\\mathfrak{q}_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;\\mathfrak{q}_j\" title=\"\\mathfrak{q}_j\" /\u003e\u003c/a\u003e is initialized with as many values as \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;X_j\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;X_j\" title=\"X_j\" /\u003e\u003c/a\u003e so that `m_start` has no effect.\n\nEmpirical studies show that with a reasonably small training dataset (\u003c 10,000 rows) and a small `m_start` parameter (\u003c 20), approximately 500 to 1500 iterations are largely sufficient to obtain a satisfactory model \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;p_\\theta(y|\\mathfraq{q})\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;p_\\theta(y|q(x))\" title=\"p_\\theta(y|q(x))\" /\u003e\u003c/a\u003e.\n\n```python\n\u003e\u003e\u003e import glmdisc\n\u003e\u003e\u003e logreg_disc = glmdisc.Glmdisc(iter=100, validation=True, test=True, criterion=\"bic\", m_start=10)\n```\n```PowerShell\n2020-07-16 18:11:03.087 | WARNING  | glmdisc:__init__:216 - No need to penalize the log-likelihood when a validation set is used. Using log-likelihood instead.\n```\n\n### The `fit` function\n\nThe `fit` function of the `glmdisc` class is used to run the algorithm over the data provided to it. Subsequently, its parameters are: `predictors_cont` and `predictors_qual` which represent respectively the continuous features to be discretized and the categorical features which values are to be regrouped. They must be of type numpy array, filled with numeric and strings respectively. The last parameter is the class `labels`, of type numpy array as well, in binary form (0/1).\n\n```python\n\u003e\u003e\u003e n = 100\n\u003e\u003e\u003e d = 2\n\u003e\u003e\u003e x, y, _ = glmdisc.Glmdisc.generate_data(n, d)\n\u003e\u003e\u003e logreg_disc.fit(predictors_cont=x, predictors_qual=None, labels=y)\n```\n\n### The `best_formula` function\n\nThe `best_formula` function prints out in the console: the cut-points found for continuous features, the regroupments made for categorical features' values. It also returns it in a list.\n\n```python\n\u003e\u003e\u003e logreg_disc.best_formula()\n```\n```PowerShell\n2020-07-16 18:13:29.921 | INFO     | glmdisc._bestFormula:best_formula:29 - Cut-points found for continuous variable 0\n[0.9568289154869697, 0.6661178585993954, 0.49039089060451335, 0.33038638461067193, 0.7152644679549544]\n2020-07-16 18:13:29.922 | INFO     | glmdisc._bestFormula:best_formula:29 - Cut-points found for continuous variable 1\n[0.48684331022166916, 0.17904111281801316, 0.6603144758481163, 0.03838803248009037]\n```\n\n### The `discrete_data` function\n\nThe `discrete_data` function returns the discretized / regrouped version of the `predictors_cont` and `predictors_qual` arguments using the best discretization scheme found so far.\n\n```python\n\u003e\u003e\u003e logreg_disc.discrete_data()\n```\n```PowerShell\n2020-07-16 18:14:57.261 | INFO     | glmdisc._discreteData:discrete_data:44 - Returning discretized test set.\n\u003c20x11 sparse matrix of type '\u003cclass 'numpy.float64'\u003e'\n\twith 40 stored elements in Compressed Sparse Row format\u003e\n```\n\n```python\n\u003e\u003e\u003e logreg_disc.discrete_data().toarray()\n```\n```PowerShell\n2020-07-16 18:15:31.041 | INFO     | glmdisc._discreteData:discrete_data:44 - Returning discretized test set.\narray([[1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n       [0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],\n       [0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n       [0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],\n[...]\n```\n\n### The `discretize` function\n\nThe `discretize` function discretizes a new input dataset in the `predictors_cont`, `predictors_qual` format using the best discretization scheme found so far. The result is a numpy array of the size of the original data.\n\n```python\n\u003e\u003e\u003e n_new = 100\n\u003e\u003e\u003e x_new, _, _ = glmdisc.Glmdisc.generate_data(n_new, d)\n\u003e\u003e\u003e logreg_disc.discretize(predictors_cont=x_new, predictors_qual=None)\n```\n```PowerShell\narray([[4., 1.],\n       [5., 2.],\n       [4., 3.],\n       [4., 4.],\n       [3., 4.],\n       [0., 2.],\n[...]\n```\n\n### The `discretize_dummy` function\n\nThe `discretize_dummy` function discretizes a new input dataset in the `predictors_cont`, `predictors_qual` format using the best discretization scheme found so far. The result is a dummy (0/1) numpy array  corresponding to the One-Hot Encoding of the result provided by the `discretize` function.\n\n```python\n\u003e\u003e\u003e logreg_disc.discretize_dummy(predictors_cont=x_new, predictors_qual=None)\n```\n```PowerShell\n\u003c100x11 sparse matrix of type '\u003cclass 'numpy.float64'\u003e'\n\twith 200 stored elements in Compressed Sparse Row format\u003e\n```\n```python\n\u003e\u003e\u003e logreg_disc.discretize_dummy(predictors_cont=x_new, predictors_qual=None).toarray()\n```\n```PowerShell\narray([[0., 0., 0., ..., 0., 0., 0.],\n       [0., 0., 0., ..., 1., 0., 0.],\n       [0., 0., 0., ..., 0., 1., 0.],\n       ...,\n       [1., 0., 0., ..., 1., 0., 0.],\n       [1., 0., 0., ..., 0., 0., 0.],\n       [1., 0., 0., ..., 0., 0., 0.]])\n```\n\n### The `predict` function\n\nThe `predict` function discretizes a new input dataset in the `predictors_cont`, `predictors_qual` format using the best discretization scheme found so far through the `discretizeDummy` function and then applies the corresponding best Logistic Regression model \u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\inline\u0026space;p_\\theta(y|e)\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?\\inline\u0026space;p_\\theta(y|e)\" title=\"p_\\theta(y|e)\" /\u003e\u003c/a\u003e found so far.\n\n```python\n\u003e\u003e\u003e logreg_disc.predict(predictors_cont=x_new, predictors_qual=None)\n```\n```PowerShell\narray([[9.99394254e-01, 6.05745839e-04],\n       [9.99694576e-01, 3.05424466e-04],\n       [9.99817560e-01, 1.82439609e-04],\n       [9.99967791e-01, 3.22085041e-05],\n       [9.92296119e-01, 7.70388116e-03],\n[...]\n```\n\n### The attributes\n\nAll parameters are stored as attributes: `test`, \n`validation`, `criterion`, `iter`, `m_start` as well as:\n\n* `criterion_iter`: list of values of the criterion chosen;\n```python\n\u003e\u003e\u003e logreg_disc.criterion_iter\n```\n```PowerShell\n[-30.174443117243992, -26.182075441528603, -31.61227858514535, -19.70369464830396, -31.61997286396158, -25.99964499964587, ...]\n```\n* `best_link`: link function of the best quantization;\n```python\n\u003e\u003e\u003e logreg_disc.best_link\n```\n```PowerShell\n[LogisticRegression(C=1e+40, max_iter=25, multi_class='multinomial',\n                   solver='newton-cg', tol=0.001), \nLogisticRegression(C=1e+40, max_iter=25, multi_class='multinomial',\n                   solver='newton-cg', tol=0.001)]\n```\n* `best_reglog`: logistic regression function of the best quantization;\n```python\n\u003e\u003e\u003e logreg_disc.best_reglog\n```\n```PowerShell\nLogisticRegression(C=1e+40, max_iter=25, solver='liblinear', tol=0.001)\n```\n* `affectations`: list of label encoders for categorical features;\n```python\n\u003e\u003e\u003e logreg_disc.affectations\n```\n```PowerShell\n[None, None]\n```\n* `best_encoder_emap`: one hot encoder of the best quantization;\n```python\n\u003e\u003e\u003e logreg_disc.best_encoder_emap\n```\n```PowerShell\nOneHotEncoder(handle_unknown='ignore')\n```\n* `performance`: value of the chosen criterion for the best quantization;\n```python\n\u003e\u003e\u003e logreg_disc.performance\n```\n```PowerShell\n-14.924603930263428\n```\n* `train`: array of row indices for training samples;\n```python\n\u003e\u003e\u003e logreg_disc.train\n```\n```PowerShell\narray([97, 39, 94,  5, 16, 77, 88, 54, 80, 99, 46, 43, 52, 37, 28,  0, 18, ...\n```\n* `validate`: array of row indices for validation samples;\n```python\n\u003e\u003e\u003e logreg_disc.validate\n```\n```PowerShell\narray([36, 45, 29, 62,  8, 82, 76, 96, 41, 83, 17, 49, 57, 31, 60, 64, 65, ...\n```\n* `test_rows`: array of row indices for test samples;\n```python\n\u003e\u003e\u003e logreg_disc.test_rows\n```\n```PowerShell\narray([ 3, 75, 51, 27, 21, 48,  4, 44, 72, 68, 34, 22, 23, 50, 47,  6, 42, ...\n```\n\nTo see the package in action, please refer to [the accompanying Jupyter Notebook](examples/).\n\n- [ ] Do a notebook\n\n## Authors\n\n* [Adrien Ehrhardt](https://adimajo.github.io)\n* [Vincent Vandewalle](https://sites.google.com/site/vvandewa/)\n* [Philippe Heinrich](http://math.univ-lille1.fr/~heinrich/)\n* [Christophe Biernacki](http://math.univ-lille1.fr/~biernack/)\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\nThis research has been financed by [Crédit Agricole Consumer Finance](https://www.ca-consumerfinance.com/en.html) through a CIFRE PhD.\n\nThis research was supported by [Inria Lille - Nord-Europe](https://www.inria.fr/centre/lille) and [Lille University](https://www.univ-lille.fr/en/home/) as part of a PhD.\n\n## References\n\nEhrhardt, A. (2019), [Formalization and study of statistical problems in Credit Scoring: Reject inference, discretization and pairwise interactions, logistic regression trees](https://hal.archives-ouvertes.fr/tel-02302691) ([PhD thesis](https://github.com/adimajo/manuscrit_these)).\n\nEhrhardt, A., et al. [Feature quantization for parsimonious and interpretable predictive models](https://arxiv.org/abs/1903.08920). arXiv preprint arXiv:1903.08920 (2019)].\n\nCeleux, G., Chauveau, D., Diebolt, J. (1995), [On Stochastic Versions of the EM Algorithm](https://hal.inria.fr/inria-00074164/document). [Research Report] RR-2514, INRIA. 1995. \u003cinria-00074164\u003e\n\nAgresti, A. (2002) [**Categorical Data**](https://onlinelibrary.wiley.com/doi/book/10.1002/0471249688). Second edition. Wiley.\n\nRamírez‐Gallego, S., García, S., Mouriño‐Talín, H., Martínez‐Rego, D., Bolón‐Canedo, V., Alonso‐Betanzos, A. and Herrera, F. (2016). [Data discretization: taxonomy and big data challenge. *Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery*](https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1173), 6(1), 5-21.\n\n## Future development: integration of interaction discovery\n\nVery often, predictive features $X$ \"interact\" with each other with respect to the response feature. This is classical in the context of Credit Scoring or biostatistics (only the simultaneous presence of several features - genes, SNP, etc. is predictive of a disease).\n\nWith the growing number of potential predictors and the time required to manually analyze if an interaction should be added or not, there is a strong need for automatic procedures that screen potential interaction variables. This will be the subject of future work.\n\n## Future development: possibility of changing model assumptions\n\nIn the third section, we described two fundamental modelling hypotheses that were made:\n\u003e- The real probability density function $p(Y|X)$ can be approximated by a logistic regression $p_\\theta(Y|E)$ on the discretized data $E$.\n\u003e- The nature of the relationship of $\\mathfrak{q}_j$ to $X_j$ is:\n\u003e- A polytomous logistic regression if $X_j$ is continuous;\n\u003e- A contengency table if $X_j$ is qualitative.\n\nThese hypotheses are \"building blocks\" that could be changed at the modeller's will: discretization could optimize other models.\n\n- [ ] To delete when done with\n\n\n### Results\n\nFirst we simulate a \"true\" underlying discrete model:\n```{r, echo=TRUE, results='asis'}\nx = matrix(runif(300), nrow = 100, ncol = 3)\ncuts = seq(0,1,length.out= 4)\nxd = apply(x,2, function(col) as.numeric(cut(col,cuts)))\ntheta = t(matrix(c(0,0,0,2,2,2,-2,-2,-2),ncol=3,nrow=3))\nlog_odd = rowSums(t(sapply(seq_along(xd[,1]), function(row_id) sapply(seq_along(xd[row_id,]),\nfunction(element) theta[xd[row_id,element],element]))))\ny = rbinom(100,1,1/(1+exp(-log_odd)))\n```\n\nThe `glmdisc` function will try to \"recover\" the hidden true discretization `xd` when provided only with `x` and `y`:\n```{r, echo=TRUE,warning=FALSE, message=FALSE, results='hide',eval=FALSE}\nlibrary(glmdisc)\ndiscretization \u003c- glmdisc(x,y,iter=50,m_start=5,test=FALSE,validation=FALSE,criterion=\"aic\",interact=FALSE)\n```\n\n```{r, echo=FALSE,warning=FALSE, message=FALSE, results='hide',eval=TRUE}\nlibrary(glmdisc)\ndiscretization \u003c- glmdisc(x,y,iter=50,m_start=5,test=FALSE,validation=FALSE,criterion=\"aic\",interact=FALSE)\n```\n\n### How well did we do?\n\nTo compare the estimated and the true discretization schemes, we can represent them with respect to the input \"raw\" data `x`:\n\u003c!--```{r, echo=TRUE, out.width='.49\\\\linewidth', fig.width=3, fig.height=3,fig.show='hold'}--\u003e\n```{r, echo=FALSE}\nplot(x[,1],xd[,1])\nplot(discretization@cont.data[,1],discretization@disc.data[,1])\n```\n\n## Contribute\n\nYou can clone this project using:\n\n```PowerShell\ngit clone https://github.com/adimajo/glmdisc_python.git\n```\n\nYou can install all dependencies, including development dependencies, using (note that \nthis command requires `pipenv` which can be installed by typing `pip install pipenv`):\n\n```PowerShell\npipenv install -d\n```\n\nYou can build the documentation by going into the `docs` directory and typing `make html`.\n\nNOTE: you need to have a separate folder named `glmdisc_python_docs` in the same directory as this repository,\nas it will build the docs there so as to allow me to push this other directory as a separate `gh-pages` branch.\n\nYou can run the tests by typing `coverage run -m pytest`, which relies on packages \n[coverage](https://coverage.readthedocs.io/en/coverage-5.2/) and [pytest](https://docs.pytest.org/en/latest/).\n\nTo run the tests in different environments (one for each version of Python), install `pyenv` (see [the instructions here](https://github.com/pyenv/pyenv)),\ninstall all versions you want to test (see [tox.ini](tox.ini)), e.g. with `pyenv install 3.7.0` and run \n`pipenv run pyenv local 3.7.0 [...]` (and all other versions) followed by `pipenv run tox`.\n ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadimajo%2Fglmdisc_python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadimajo%2Fglmdisc_python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadimajo%2Fglmdisc_python/lists"}