{"id":23951343,"url":"https://github.com/psteinb/c2st","last_synced_at":"2025-07-12T01:38:34.402Z","repository":{"id":40486260,"uuid":"451539876","full_name":"psteinb/c2st","owner":"psteinb","description":null,"archived":false,"fork":false,"pushed_at":"2023-09-12T23:56:28.000Z","size":827,"stargazers_count":4,"open_issues_count":5,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-12T23:41:50.634Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/psteinb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-24T16:21:33.000Z","updated_at":"2022-12-19T15:54:07.000Z","dependencies_parsed_at":"2022-08-09T22:00:18.515Z","dependency_job_id":null,"html_url":"https://github.com/psteinb/c2st","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psteinb%2Fc2st","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psteinb%2Fc2st/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psteinb%2Fc2st/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psteinb%2Fc2st/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/psteinb","download_url":"https://codeload.github.com/psteinb/c2st/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248647257,"owners_count":21139081,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-06T12:59:34.393Z","updated_at":"2025-04-12T23:41:57.928Z","avatar_url":"https://github.com/psteinb.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# c2st\n\n## Two sample test using a ML classifier\n\nTest whether sets of D-dimensional points are samples from the same\nmultivariate probability distribution.\n\nThe `c2st` function returns the accuracy of how well a binary classifier was\nable to classify two sets of points `X` and `Y` while being trained on the\nconcatenated dataset `(X,Y)`. All samples of `X` have received the label `0`\nand `Y` has received the label `1`.\n\nA value close to 0.5 means that the classifier is not better than random\nguessing, i.e. `X` and `Y` are likely from the same distribution. A value close\nto 1 means the classifier was able to separate `X` and `Y`, so they are\nprobably samples from different distributions.\n\n\n```py\n\u003e\u003e\u003e import numpy as np\n\u003e\u003e\u003e from c2st.check import c2st\n\n\u003e\u003e\u003e rng=np.random.default_rng(seed=123)\n\n# same distribution (Gaussian N(0,1)), D=20, 1000 points each\n\u003e\u003e\u003e X=rng.normal(loc=0, scale=1, size=(1000,20))\n\u003e\u003e\u003e Y=rng.normal(loc=0, scale=1, size=(1000,20))\n\u003e\u003e\u003e c2st(X, Y)\n0.4970122828225085\n\n# now shift the mean of Y by 0.3\n\u003e\u003e\u003e Y=rng.normal(loc=0.3, scale=1, size=(1000,20))\n0.6964673015530594\n\n# let's move Y more extensively away from X to 1.5\n\u003e\u003e\u003e Y=rng.normal(loc=1.5, scale=1, size=(1000,20))\n\u003e\u003e\u003e c2st(X, Y)\n0.9994791666666666\n\n# or change the distribution's variance, so we compare a narrow to a wide distribution at the same loc\n\u003e\u003e\u003e Y=rng.normal(loc=0, scale=2, size=(1000,20))\n\u003e\u003e\u003e c2st(X, Y)\n0.950321845047345\n\n# we use balanced accuracy scoring by default to handle different sample sizes\n\u003e\u003e\u003e Y=rng.normal(loc=0, scale=1, size=(300,20))\n\u003e\u003e\u003e c2st(X, Y)\n0.5013659591194969\n```\n\n## In `rsc`\n\nSmall validation study which NN architecture exposes more utility for `c2st`\ntwo sample testing. At this point, the analysis you find in\n[rsc/c2st_results.ipynb](rsc/c2st_results.ipynb) has by far not any academic\nscrutiny as you'd find in Lopez-Paz et al. However, it can serve as guidance\nwhere which implementation of `c2st` can shine.\nIf you'd like to redo the analysis, consult the notebook provided in [rsc/c2st_quality_study.ipynb](rsc/c2st_quality_study.ipynb).\n\n\n# References\n\n`c2st` is a sample based method to evaluate goodness-of-fit based on two ensembles only. For more details, see\n\n- Friedman, J. \"On Multivariate Goodness-of-Fit and Two-Sample Testing\", https://www.osti.gov/biblio/826696/\n\n- Lopez-Paz et al, \"Revisiting Classifier Two-Sample Tests\", http://arxiv.org/abs/1610.06545\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsteinb%2Fc2st","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpsteinb%2Fc2st","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsteinb%2Fc2st/lists"}