{"id":45730323,"url":"https://github.com/miaohancheng/pysmatch","last_synced_at":"2026-03-08T09:01:12.411Z","repository":{"id":62583564,"uuid":"359356587","full_name":"miaohancheng/pysmatch","owner":"miaohancheng","description":"Propensity Score Matching(PSM) on python","archived":false,"fork":false,"pushed_at":"2025-08-25T16:24:23.000Z","size":9528,"stargazers_count":108,"open_issues_count":0,"forks_count":17,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-09-05T18:45:05.746Z","etag":null,"topics":["propensity-score-matching","psm","python"],"latest_commit_sha":null,"homepage":"https://miaohancheng.com/pysmatch/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miaohancheng.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-04-19T06:46:06.000Z","updated_at":"2025-08-31T10:39:47.000Z","dependencies_parsed_at":"2025-04-22T10:58:08.531Z","dependency_job_id":"ebafeb30-ed6c-409a-a43a-48b1292352b1","html_url":"https://github.com/miaohancheng/pysmatch","commit_stats":null,"previous_names":["mhcone/pysmatch"],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/miaohancheng/pysmatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miaohancheng%2Fpysmatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miaohancheng%2Fpysmatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miaohancheng%2Fpysmatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miaohancheng%2Fpysmatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miaohancheng","download_url":"https://codeload.github.com/miaohancheng/pysmatch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miaohancheng%2Fpysmatch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29815808,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-25T05:36:42.804Z","status":"ssl_error","status_checked_at":"2026-02-25T05:36:31.934Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["propensity-score-matching","psm","python"],"created_at":"2026-02-25T09:24:06.508Z","updated_at":"2026-02-25T09:24:07.553Z","avatar_url":"https://github.com/miaohancheng.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `pysmatch`\n\n[![PyPI version](https://badge.fury.io/py/pysmatch.svg?icon=si%3Apython\u0026icon_color=%23ffffff)](https://badge.fury.io/py/pysmatch)\n[![Downloads](https://static.pepy.tech/badge/pysmatch)](https://pepy.tech/project/pysmatch)\n![GitHub License](https://img.shields.io/github/license/miaohancheng/pysmatch)\n[![codecov](https://codecov.io/github/miaohancheng/pysmatch/graph/badge.svg?token=TUYDEDRV45)](https://codecov.io/github/miaohancheng/pysmatch)\n\n**Propensity Score Matching (PSM)** helps reduce selection bias in observational studies by matching treatment and control units with similar propensity scores.\n\n`pysmatch` is an improved and extended version of [`pymatch`](https://github.com/benmiroglio/pymatch), with modernized modeling, modularized matching utilities, and better support for reproducible workflows.\n\n### Multilingual\n\n[English](https://github.com/miaohancheng/pysmatch/blob/main/README.md) | [中文](https://github.com/miaohancheng/pysmatch/blob/main/README_CHINESE.md)\n\n### Highlights\n\n- Multiple score models: Logistic Regression, KNN, CatBoost\n- Flexible balancing: oversampling and undersampling (`balance_strategy`)\n- Standard and exhaustive matching workflows\n- Balance diagnostics for categorical and continuous covariates\n- Optional Optuna tuning for automated model search\n\n## Installation\n\nInstall from PyPI:\n\n```bash\npip install pysmatch\n```\n\nInstall optional extras:\n\n```bash\npip install \"pysmatch[tree]\"   # CatBoost support\npip install \"pysmatch[tune]\"   # Optuna support\npip install \"pysmatch[all]\"    # all optional dependencies\n```\n\nInstall from source:\n\n```bash\ngit clone https://github.com/miaohancheng/pysmatch.git\ncd pysmatch\npip install -e \".[all]\"\n```\n\n## Quickstart\n\nThis minimal example runs the full core path with the built-in demo dataset (`misc/loan.csv`).\n\n```python\nimport warnings\nwarnings.filterwarnings(\"ignore\")\n\nimport numpy as np\nimport pandas as pd\nfrom pysmatch.Matcher import Matcher\n\nnp.random.seed(42)\ndata = pd.read_csv(\"misc/loan.csv\")\n\ntest = data[data.loan_status == \"Default\"].copy()\ncontrol = data[data.loan_status == \"Fully Paid\"].copy()\n\nmatcher = Matcher(\n    test=test,\n    control=control,\n    yvar=\"is_default\",\n    exclude=[\"loan_status\"],\n)\n\nmatcher.fit_scores(\n    balance=True,\n    balance_strategy=\"over\",\n    nmodels=10,\n    model_type=\"linear\",\n    n_jobs=2,\n)\nmatcher.predict_scores()\nmatcher.match(method=\"min\", nmatches=1, threshold=0.001, replacement=False)\n\nprint(matcher.matched_data.head())\n```\n\nIf this works, continue to the full workflow below.\n\n## End-to-End Workflow\n\n### Data Preparation\n\nUse domain-relevant covariates and avoid leaking post-treatment variables into matching features.\n\n```python\nimport pandas as pd\n\nfields = [\n    \"loan_amnt\",\n    \"funded_amnt\",\n    \"funded_amnt_inv\",\n    \"term\",\n    \"int_rate\",\n    \"installment\",\n    \"grade\",\n    \"sub_grade\",\n    \"loan_status\",\n]\n\nraw = pd.read_csv(\"misc/loan.csv\", usecols=fields)\ntest = raw[raw.loan_status == \"Default\"].copy()\ncontrol = raw[raw.loan_status == \"Fully Paid\"].copy()\n```\n\n### Initialize Matcher\n\n```python\nfrom pysmatch.Matcher import Matcher\n\nmatcher = Matcher(\n    test=test,\n    control=control,\n    yvar=\"is_default\",\n    exclude=[\"loan_status\"],\n)\n\nprint(\"xvars:\", matcher.xvars)\nprint(\"test/control:\", matcher.testn, matcher.controln)\n```\n\n### Fit Propensity Score Models\n\n`fit_scores` supports three model types:\n\n- `linear` (logistic regression)\n- `knn`\n- `tree` (CatBoost, requires `pysmatch[tree]`)\n\n```python\nmatcher.fit_scores(\n    balance=True,\n    balance_strategy=\"over\",   # \"over\" or \"under\"\n    nmodels=10,\n    model_type=\"linear\",\n    max_iter=200,\n    n_jobs=2,\n)\n\nprint(\"models:\", len(matcher.models))\nprint(\"avg validation accuracy:\", sum(matcher.model_accuracy) / len(matcher.model_accuracy))\n```\n\nOptuna path (single tuned model):\n\n```python\n# matcher.fit_scores(\n#     balance=True,\n#     model_type=\"tree\",\n#     use_optuna=True,\n#     n_trials=20,\n# )\n```\n\n### Predict and Plot Scores\n\n```python\nmatcher.predict_scores()\nmatcher.plot_scores()\n```\n\n`matcher.data` now contains a `scores` column.\n\n### Tune Threshold\n\n```python\nimport numpy as np\n\nmatcher.tune_threshold(\n    method=\"min\",\n    nmatches=1,\n    rng=np.arange(0.0001, 0.0051, 0.0005),\n)\n```\n\nChoose a threshold that balances quality and retained sample size.\n\n### Run Matching\n\nStandard matching:\n\n```python\nmatcher.match(\n    method=\"min\",\n    nmatches=1,\n    threshold=0.001,\n    replacement=False,\n    exhaustive_matching=False,\n)\nmatcher.plot_matched_scores()\n```\n\nExhaustive matching:\n\n```python\nmatcher.match(\n    threshold=0.001,\n    nmatches=1,\n    exhaustive_matching=True,\n)\n```\n\n### Review Matched Data and Weights\n\n```python\nprint(matcher.matched_data.head())\nprint(matcher.record_frequency().head())\nmatcher.assign_weight_vector()\nprint(matcher.matched_data[[\"record_id\", \"match_id\", \"weight\"]].head())\n```\n\n## Matching Strategies\n\n### Standard vs Exhaustive Matching\n\n- **Standard (`exhaustive_matching=False`)**: uses nearest-neighbor style control selection with configurable method/replacement behavior.\n- **Exhaustive (`exhaustive_matching=True`)**: prioritizes wider control utilization while still respecting threshold constraints.\n\n### Key Parameters\n\n- `threshold`: max allowed score distance\n- `nmatches`: controls per treated unit\n- `replacement`: whether a control can be reused\n- `method`: `\"min\"` (closest) or `\"random\"` (random within threshold)\n\n### Practical Guidance\n\n- Start with `nmatches=1`, `replacement=False`, and a moderate threshold.\n- If retention is too low, loosen `threshold` gradually.\n- If balance is weak after matching, tighten threshold or change model/balance strategy.\n- For severe class imbalance, test `balance_strategy=\"under\"` as sensitivity analysis.\n\n## Evaluation\n\nAfter matching, evaluate covariate balance before causal analysis.\n\n### Categorical Covariates\n\n```python\ncat_table = matcher.compare_categorical(return_table=True, plot_result=True)\nprint(cat_table)\n```\n\nInterpretation:\n\n- check before/after p-value shifts\n- look for reduced proportional differences after matching\n\n### Continuous Covariates\n\n```python\ncont_table = matcher.compare_continuous(return_table=True, plot_result=True)\nprint(cont_table)\n```\n\nInterpretation:\n\n- compare KS statistics and grouped permutation test p-values\n- monitor standardized mean/median differences pre vs post matching\n\n### Single Variable Proportion Test\n\n```python\nprint(matcher.prop_test(\"grade\"))\n```\n\n## Troubleshooting\n\n### `ValueError: numpy.dtype size changed`\n\nThis is usually a NumPy/Pandas binary compatibility issue.\n\n```bash\npip install --upgrade --force-reinstall \"numpy\u003e=1.26.4\" \"pandas\u003e=2.1.4\"\n```\n\nRestart your Python kernel/session after reinstalling.\n\n### `Scores column not found`\n\nRun `predict_scores()` before `match()`.\n\n```python\nmatcher.fit_scores(...)\nmatcher.predict_scores()\nmatcher.match(...)\n```\n\n### `FileNotFoundError` for dataset path\n\nUse a repo-relative path:\n\n```python\npd.read_csv(\"misc/loan.csv\")\n```\n\n### No matches found\n\nUsually threshold is too strict or groups are weakly overlapping.\n\n- increase `threshold`\n- try a different `model_type`\n- inspect score distributions with `plot_scores()`\n\n### Jupyter kernel issues in notebooks\n\nIf your notebook kernel name is unavailable, switch to an existing kernel (`python3`) and rerun cells.\n\n## FAQ\n\n### When should I use `linear` vs `tree` vs `knn`?\n\n- Start with `linear` for strong baseline interpretability.\n- Use `tree` for nonlinear relationships and mixed feature types.\n- Use `knn` as a local-structure baseline and compare sensitivity.\n\n### Is high model accuracy always better for matching?\n\nNot necessarily. Very high separability may indicate weak overlap, which can reduce matchability. Balance diagnostics matter more than raw classifier accuracy.\n\n### Should I use over- or under-sampling?\n\n- `over`: usually keeps more majority information; good default.\n- `under`: faster/smaller training sets; useful for sensitivity checks.\n\n### How do I make runs reproducible?\n\n- set `np.random.seed(...)`\n- keep fixed package versions\n- record model/matching parameters in experiment logs\n\n### Additional resources\n\n- Sekhon, J. S. (2011), *Multivariate and propensity score matching software with automated balance optimization: The Matching package for R*. Journal of Statistical Software, 42(7), 1-52. [Link](https://www.jstatsoft.org/article/view/v042i07)\n- Rosenbaum, P. R., \u0026 Rubin, D. B. (1983), *The central role of the propensity score in observational studies for causal effects*. Biometrika, 70(1), 41-55. [Link](https://stat.cmu.edu/~ryantibs/journalclub/rosenbaum_1983.pdf)\n\n### Contributing\n\nContributions are welcome. Please open an issue or pull request in this repository.\n\n### License\n\n`pysmatch` is released under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiaohancheng%2Fpysmatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiaohancheng%2Fpysmatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiaohancheng%2Fpysmatch/lists"}