https://github.com/miaohancheng/pysmatch

Propensity Score Matching(PSM) on python
https://github.com/miaohancheng/pysmatch
propensity-score-matching psm python
Last synced: 4 months ago
JSON representation
Propensity Score Matching(PSM) on python
Host: GitHub
URL: https://github.com/miaohancheng/pysmatch
Owner: miaohancheng
License: mit
Created: 2021-04-19T06:46:06.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2025-08-25T16:24:23.000Z (10 months ago)
Last Synced: 2025-09-05T18:45:05.746Z (10 months ago)
Topics: propensity-score-matching, psm, python
Language: Jupyter Notebook
Homepage: https://miaohancheng.com/pysmatch/
Size: 9.09 MB
Stars: 108
Watchers: 6
Forks: 17
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # `pysmatch`

[![PyPI version](https://badge.fury.io/py/pysmatch.svg?icon=si%3Apython&icon_color=%23ffffff)](https://badge.fury.io/py/pysmatch)

[![Downloads](https://static.pepy.tech/badge/pysmatch)](https://pepy.tech/project/pysmatch)

![GitHub License](https://img.shields.io/github/license/miaohancheng/pysmatch)

[![codecov](https://codecov.io/github/miaohancheng/pysmatch/graph/badge.svg?token=TUYDEDRV45)](https://codecov.io/github/miaohancheng/pysmatch)

**Propensity Score Matching (PSM)** helps reduce selection bias in observational studies by matching treatment and control units with similar propensity scores.

`pysmatch` is an improved and extended version of [`pymatch`](https://github.com/benmiroglio/pymatch), with modernized modeling, modularized matching utilities, and better support for reproducible workflows.

### Multilingual

[English](https://github.com/miaohancheng/pysmatch/blob/main/README.md) | [中文](https://github.com/miaohancheng/pysmatch/blob/main/README_CHINESE.md)

### Highlights

- Multiple score models: Logistic Regression, KNN, CatBoost

- Flexible balancing: oversampling and undersampling (`balance_strategy`)

- Standard and exhaustive matching workflows

- Balance diagnostics for categorical and continuous covariates

- Optional Optuna tuning for automated model search

## Installation

Install from PyPI:

```bash

pip install pysmatch

```

Install optional extras:

```bash

pip install "pysmatch[tree]"   # CatBoost support

pip install "pysmatch[tune]"   # Optuna support

pip install "pysmatch[all]"    # all optional dependencies

```

Install from source:

```bash

git clone https://github.com/miaohancheng/pysmatch.git

cd pysmatch

pip install -e ".[all]"

```

## Quickstart

This minimal example runs the full core path with the built-in demo dataset (`misc/loan.csv`).

```python

import warnings

warnings.filterwarnings("ignore")

import numpy as np

import pandas as pd

from pysmatch.Matcher import Matcher

np.random.seed(42)

data = pd.read_csv("misc/loan.csv")

test = data[data.loan_status == "Default"].copy()

control = data[data.loan_status == "Fully Paid"].copy()

matcher = Matcher(

    test=test,

    control=control,

    yvar="is_default",

    exclude=["loan_status"],

)

matcher.fit_scores(

    balance=True,

    balance_strategy="over",

    nmodels=10,

    model_type="linear",

    n_jobs=2,

)

matcher.predict_scores()

matcher.match(method="min", nmatches=1, threshold=0.001, replacement=False)

print(matcher.matched_data.head())

```

If this works, continue to the full workflow below.

## End-to-End Workflow

### Data Preparation

Use domain-relevant covariates and avoid leaking post-treatment variables into matching features.

```python

import pandas as pd

fields = [

    "loan_amnt",

    "funded_amnt",

    "funded_amnt_inv",

    "term",

    "int_rate",

    "installment",

    "grade",

    "sub_grade",

    "loan_status",

]

raw = pd.read_csv("misc/loan.csv", usecols=fields)

test = raw[raw.loan_status == "Default"].copy()

control = raw[raw.loan_status == "Fully Paid"].copy()

```

### Initialize Matcher

```python

from pysmatch.Matcher import Matcher

matcher = Matcher(

    test=test,

    control=control,

    yvar="is_default",

    exclude=["loan_status"],

)

print("xvars:", matcher.xvars)

print("test/control:", matcher.testn, matcher.controln)

```

### Fit Propensity Score Models

`fit_scores` supports three model types:

- `linear` (logistic regression)

- `knn`

- `tree` (CatBoost, requires `pysmatch[tree]`)

```python

matcher.fit_scores(

    balance=True,

    balance_strategy="over",   # "over" or "under"

    nmodels=10,

    model_type="linear",

    max_iter=200,

    n_jobs=2,

)

print("models:", len(matcher.models))

print("avg validation accuracy:", sum(matcher.model_accuracy) / len(matcher.model_accuracy))

```

Optuna path (single tuned model):

```python

# matcher.fit_scores(

#     balance=True,

#     model_type="tree",

#     use_optuna=True,

#     n_trials=20,

# )

```

### Predict and Plot Scores

```python

matcher.predict_scores()

matcher.plot_scores()

```

`matcher.data` now contains a `scores` column.

### Tune Threshold

```python

import numpy as np

matcher.tune_threshold(

    method="min",

    nmatches=1,

    rng=np.arange(0.0001, 0.0051, 0.0005),

)

```

Choose a threshold that balances quality and retained sample size.

### Run Matching

Standard matching:

```python

matcher.match(

    method="min",

    nmatches=1,

    threshold=0.001,

    replacement=False,

    exhaustive_matching=False,

)

matcher.plot_matched_scores()

```

Exhaustive matching:

```python

matcher.match(

    threshold=0.001,

    nmatches=1,

    exhaustive_matching=True,

)

```

### Review Matched Data and Weights

```python

print(matcher.matched_data.head())

print(matcher.record_frequency().head())

matcher.assign_weight_vector()

print(matcher.matched_data[["record_id", "match_id", "weight"]].head())

```

## Matching Strategies

### Standard vs Exhaustive Matching

- **Standard (`exhaustive_matching=False`)**: uses nearest-neighbor style control selection with configurable method/replacement behavior.

- **Exhaustive (`exhaustive_matching=True`)**: prioritizes wider control utilization while still respecting threshold constraints.

### Key Parameters

- `threshold`: max allowed score distance

- `nmatches`: controls per treated unit

- `replacement`: whether a control can be reused

- `method`: `"min"` (closest) or `"random"` (random within threshold)

### Practical Guidance

- Start with `nmatches=1`, `replacement=False`, and a moderate threshold.

- If retention is too low, loosen `threshold` gradually.

- If balance is weak after matching, tighten threshold or change model/balance strategy.

- For severe class imbalance, test `balance_strategy="under"` as sensitivity analysis.

## Evaluation

After matching, evaluate covariate balance before causal analysis.

### Categorical Covariates

```python

cat_table = matcher.compare_categorical(return_table=True, plot_result=True)

print(cat_table)

```

Interpretation:

- check before/after p-value shifts

- look for reduced proportional differences after matching

### Continuous Covariates

```python

cont_table = matcher.compare_continuous(return_table=True, plot_result=True)

print(cont_table)

```

Interpretation:

- compare KS statistics and grouped permutation test p-values

- monitor standardized mean/median differences pre vs post matching

### Single Variable Proportion Test

```python

print(matcher.prop_test("grade"))

```

## Troubleshooting

### `ValueError: numpy.dtype size changed`

This is usually a NumPy/Pandas binary compatibility issue.

```bash

pip install --upgrade --force-reinstall "numpy>=1.26.4" "pandas>=2.1.4"

```

Restart your Python kernel/session after reinstalling.

### `Scores column not found`

Run `predict_scores()` before `match()`.

```python

matcher.fit_scores(...)

matcher.predict_scores()

matcher.match(...)

```

### `FileNotFoundError` for dataset path

Use a repo-relative path:

```python

pd.read_csv("misc/loan.csv")

```

### No matches found

Usually threshold is too strict or groups are weakly overlapping.

- increase `threshold`

- try a different `model_type`

- inspect score distributions with `plot_scores()`

### Jupyter kernel issues in notebooks

If your notebook kernel name is unavailable, switch to an existing kernel (`python3`) and rerun cells.

## FAQ

### When should I use `linear` vs `tree` vs `knn`?

- Start with `linear` for strong baseline interpretability.

- Use `tree` for nonlinear relationships and mixed feature types.

- Use `knn` as a local-structure baseline and compare sensitivity.

### Is high model accuracy always better for matching?

Not necessarily. Very high separability may indicate weak overlap, which can reduce matchability. Balance diagnostics matter more than raw classifier accuracy.

### Should I use over- or under-sampling?

- `over`: usually keeps more majority information; good default.

- `under`: faster/smaller training sets; useful for sensitivity checks.

### How do I make runs reproducible?

- set `np.random.seed(...)`

- keep fixed package versions

- record model/matching parameters in experiment logs

### Additional resources

- Sekhon, J. S. (2011), *Multivariate and propensity score matching software with automated balance optimization: The Matching package for R*. Journal of Statistical Software, 42(7), 1-52. [Link](https://www.jstatsoft.org/article/view/v042i07)

- Rosenbaum, P. R., & Rubin, D. B. (1983), *The central role of the propensity score in observational studies for causal effects*. Biometrika, 70(1), 41-55. [Link](https://stat.cmu.edu/~ryantibs/journalclub/rosenbaum_1983.pdf)

### Contributing

Contributions are welcome. Please open an issue or pull request in this repository.

### License

`pysmatch` is released under the MIT License.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/miaohancheng/pysmatch

Awesome Lists containing this project

README