{"id":16276521,"url":"https://github.com/hendersontrent/correctipy","last_synced_at":"2025-04-04T10:30:52.338Z","repository":{"id":103578657,"uuid":"587594403","full_name":"hendersontrent/correctipy","owner":"hendersontrent","description":"Python package for computing corrected test statistics for comparing machine learning models on correlated samples","archived":false,"fork":false,"pushed_at":"2023-01-12T05:43:09.000Z","size":3071,"stargazers_count":8,"open_issues_count":6,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-20T08:45:00.998Z","etag":null,"topics":["hypothesis-testing","machine-learning","statistics"],"latest_commit_sha":null,"homepage":"http://correctipy.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hendersontrent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-11T05:40:59.000Z","updated_at":"2024-12-28T12:35:16.000Z","dependencies_parsed_at":"2023-03-13T15:07:20.031Z","dependency_job_id":null,"html_url":"https://github.com/hendersontrent/correctipy","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendersontrent%2Fcorrectipy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendersontrent%2Fcorrectipy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendersontrent%2Fcorrectipy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendersontrent%2Fcorrectipy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hendersontrent","download_url":"https://codeload.github.com/hendersontrent/correctipy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247160244,"owners_count":20893797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hypothesis-testing","machine-learning","statistics"],"created_at":"2024-10-10T18:48:39.590Z","updated_at":"2025-04-04T10:30:52.318Z","avatar_url":"https://github.com/hendersontrent.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# correctipy\n\nCorrected test statistics for comparing machine learning models on\ncorrelated samples\n\n## Installation\n\nYou can install the stable version of `correctipy` from GitHub using:\n\n``` python\n pip install git+https://github.com/hendersontrent/correctipy\n```\n\n## General purpose\n\nOften in machine learning, we want to compare the performance of\ndifferent models to determine if one statistically outperforms another.\nHowever, the methods used (e.g., data resampling, $k$-fold\ncross-validation) to obtain these performance metrics (e.g.,\nclassification accuracy) violate the assumptions of traditional\nstatistical tests such as a $t$-test. The purpose of these methods is to\neither aid generalisability of findings (i.e., through quantification of\nerror as they produce multiple values for each model instead of just\none) or to optimise model hyperparameters. This makes them invaluable,\nbut unusable with traditional tests, as [Dietterich\n(1998)](https://pubmed.ncbi.nlm.nih.gov/9744903/) found that the\nstandard $t$-test underestimates the variance, therefore driving a high\nType I error. `correctipy` is a lightweight package that implements a\nsmall number of corrected test statistics for cases when samples are not\nindependent (and therefore are correlated), such as in the case of\nresampling, $k$-fold cross-validation, and repeated $k$-fold\ncross-validation. These corrections were all originally proposed by\n[Nadeau and Bengio\n(2003)](https://link.springer.com/article/10.1023/A:1024068626366).\nCurrently, only cases where two models are to be compared are supported.\n\nIf you are interested in the version for R, please see [`correctR`](https://github.com/hendersontrent/correctR).\n\n## Basic usage\n\n`correctipy` is a lightweight package that implements a small number of corrected test statistics for cases when samples of two machine learning model metrics (e.g., classification accuracy) are not independent (and therefore are correlated), such as in the case of resampling and $k$-fold cross-validation. We demonstrate the basic functionality here using some trivial examples for the following corrected tests that are currently implemented in `correctipy`:\n\n* Random subsampling\n* $k$-fold cross-validation\n* Repeated $k$-fold cross-validation\n\nThese corrections were all originally proposed by [Nadeau and Bengio (2003)](https://link.springer.com/article/10.1023/A:1024068626366) with additional representations in [Bouckaert and Frank (2004)](https://link.springer.com/chapter/10.1007/978-3-540-24775-3_3).\n\n### Random subsampling correction\n\nIn random subsampling, the standard $t$-test inflates Type I error when used in conjunction with random subsampling due to an underestimation of the variance, as found by [Dietterich (1998)](https://pubmed.ncbi.nlm.nih.gov/9744903/). Nadeau and Bengio (2003) proposed a solution (which we implement as `resampled_ttest` in `correctipy`) in the form of:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/rss.svg\" alt=\"Equation for random subsampling corrected test statistic\"/\u003e\n\u003c/p\u003e\n\nwhere $n$ is the number of resamples (NOTE: $n$ is *not* sample size), $n_{1}$ is the number of samples in the training data, and $n_{2}$ is the number of samples in the test data. $\\sigma^{2}$ is the variance estimate used in the standard paired $t$-test (which simply has $\\frac{\\sigma}{\\sqrt{n}}$ in the denominator where $n$ is the sample size in this case).\n\n### k-fold cross-validation correction\n\nThere is an alternate formulation of the random subsampling correction, devised in terms of the unbiased estimator $\\rho$, discussed in [Corani et al. (2016)](https://link.springer.com/article/10.1007/s10994-017-5641-9) which we implement as `kfold_tttest` in `correctipy`:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/kcv.svg\" alt=\"Equation for k-fold cross-validation corrected test statistic\"/\u003e\n\u003c/p\u003e\n\nwhere $n$ is the number of resamples and $\\rho = \\frac{1}{k}$ where $k$ is the number of folds in the $k$-fold cross-validation procedure. This formulation stems from the fact that Nadeau and Bengio (2003) proved there is no unbiased estimator, but it can be approximated with $\\rho = \\frac{1}{k}$.\n\n### Repeated k-fold cross-validation correction\n\nRepeated $k$-fold cross-validation is more complex than the previous case(s) as we now have $r$ repeats for every fold $k$. Bouckaert and Frank (2004) present a nice representation of the corrected test for this case which we implement as `repkfold_ttest` in `correctipy`:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/rkcv.svg\" alt=\"Equation for k-fold cross-validation corrected test statistic\"/\u003e\n\u003c/p\u003e\n\n## Setup\n\nIn the real world, we would have proper results obtained through fitting two models according to one or more of the procedures outlined above. For simplicity here, we are just going to simulate three datasets so we can get to the package functionality cleaner and easier. We are going to assume we are in a classification context and generate classification accuracy values. These values are purposefully egregious---we are going to (in the case of the random subsampling) just fix the train set sample size (`n1`) to 80 and the test set sample size (`n2`) to 20, and assume (using the same data) for the $k$-fold cross-validation correction that the same numbers were obtained on such a method. Again, the values are not important here, it is the corrections we are going to apply next that are crucial.\n\nIn the case of repeated $k$-fold cross-validation, take note of the column names. While your `data.frame` you pass in to `repkfold_ttest` can have more than the four columns specified here, it **must** contain at least these four with the exact corresponding names. The function explicitly searches for them. They are:\n\n1. `\"model\"` --- contains a label for each of the two models to compare\n2. `\"values\"` --- the numerical values of the performance metric (i.e., classification accuracy)\n3. `\"k\"` --- which fold the values correspond to\n4. `\"r\"` --- which repeat of the fold the values correspond to\n\n```python\nimport numpy as np\nimport pandas as pd\n\nx = np.random.normal(0.6, 0.1, 30)\ny = np.random.normal(0.4, 0.1, 30)\n\ntmp = pd.DataFrame({'model':np.repeat([1, 2], 60), \n                   'values':np.concatenate((np.random.normal(0.6, 0.1, 60), np.random.normal(0.4, 0.1, 60))),\n                   'k':[1, 1, 2, 2]*30,\n                   'r':[1, 2]*60\n                  })\n```\n\n## Package functionality\n\nWe can fit all the corrections in one-line functions:\n\n```python\nfrom correctipy import resampled_ttest\nfrom correctipy import kfold_ttest\nfrom correctipy import repkfold_ttest\n\nrss = resampled_ttest(x, y, 30, 80, 20) # Random subsampling\nkcv = kfold_ttest(x, y, 100, 30) # k-fold cross-validation\nrkcv = repkfold_ttest(tmp, 80, 20, 2, 2) # Repeated k-fold cross-validation\n```\n\nAll the functions return a Pandas dataframe with two named columns: `\"statistic\"` (the $t$-statistic) and `\"p_value\"` (the associated $p$-value), meaning they can be easily integrated into complex machine pipelines. Here is an example for the `resampled_ttest` case:\n\n```\n   statistic       p_value\n0    6.09829  6.083703e-07\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhendersontrent%2Fcorrectipy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhendersontrent%2Fcorrectipy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhendersontrent%2Fcorrectipy/lists"}