https://github.com/policyengine/py-statmatch
Python implementation of R's StatMatch package for statistical matching and data fusion
https://github.com/policyengine/py-statmatch
Last synced: 5 months ago
JSON representation
Python implementation of R's StatMatch package for statistical matching and data fusion
- Host: GitHub
- URL: https://github.com/policyengine/py-statmatch
- Owner: PolicyEngine
- License: mit
- Created: 2025-07-31T02:10:49.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-07-31T04:25:48.000Z (8 months ago)
- Last Synced: 2025-07-31T05:32:16.987Z (8 months ago)
- Language: Python
- Homepage: https://policyengine.github.io/py-statmatch/
- Size: 46.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# py-statmatch
Python implementation of R's StatMatch package for statistical matching and data fusion.
## Overview
`py-statmatch` provides tools for statistical matching (also known as data fusion or synthetic data matching) between different datasets. This package is a Python port of the popular R package [StatMatch](https://cran.r-project.org/web/packages/StatMatch/), implementing various methods to match records from different data sources that share some common variables.
## Features
- **NND.hotdeck**: Nearest Neighbor Distance Hot Deck matching
- Multiple distance metrics (Euclidean, Manhattan, Mahalanobis, etc.)
- Donation classes (match within groups/strata)
- Constrained matching using Hungarian algorithm
- Handle missing values appropriately
- **Coming soon**:
- RANDwNND.hotdeck: Random distance hot deck
- rankNND.hotdeck: Rank distance hot deck
- create.fused: Create fused datasets from matching results
## Installation
```bash
pip install py-statmatch
```
For development:
```bash
git clone https://github.com/PolicyEngine/py-statmatch.git
cd py-statmatch
pip install -e ".[dev]"
```
## Quick Start
```python
import pandas as pd
from statmatch import nnd_hotdeck
# Create donor dataset (has X and Y variables)
donor_data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [30000, 45000, 55000, 65000, 80000],
'education': ['HS', 'BA', 'BA', 'MA', 'PhD'],
'job_satisfaction': [7, 8, 6, 9, 8] # Variable to donate
})
# Create recipient dataset (has X variables but missing Y)
recipient_data = pd.DataFrame({
'age': [28, 33, 42],
'income': [35000, 50000, 70000],
'education': ['BA', 'BA', 'MA']
})
# Perform matching
result = nnd_hotdeck(
data_rec=recipient_data,
data_don=donor_data,
match_vars=['age', 'income'],
dist_fun='euclidean'
)
# Get matched donor indices
print(result['noad.index']) # [0, 1, 3] for example
# Create fused dataset
fused_data = recipient_data.copy()
fused_data['job_satisfaction'] = donor_data.iloc[result['noad.index']]['job_satisfaction'].values
```
## API Reference
### nnd_hotdeck
```python
nnd_hotdeck(
data_rec,
data_don,
match_vars,
don_class=None,
dist_fun="euclidean",
cut_don=None,
k=None,
w_don=None,
w_rec=None,
constr_alg=None
)
```
**Parameters:**
- `data_rec` (pd.DataFrame): The recipient dataset
- `data_don` (pd.DataFrame): The donor dataset
- `match_vars` (List[str]): List of variable names to use for matching
- `don_class` (str, optional): Variable name defining donation classes
- `dist_fun` (str): Distance function - "euclidean", "manhattan", "mahalanobis", etc.
- `k` (int, optional): Maximum number of times each donor can be used
- `constr_alg` (str, optional): Algorithm for constrained matching - "lpsolve" or "hungarian"
**Returns:**
- Dictionary containing:
- `mtc.ids`: DataFrame with recipient and donor IDs
- `noad.index`: Array of donor indices for each recipient (0-based)
- `dist.rd`: Array of distances between matched recipients and donors
## Development
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=statmatch
# Run specific test
pytest tests/test_nnd_hotdeck.py::TestNNDHotdeck::test_euclidean_distance_matching
```
### Code Style
This project uses Black for code formatting:
```bash
black . -l 79
```
## License
MIT License - see LICENSE file for details.
## Citation
If you use this package in your research, please cite both this package and the original R package:
```bibtex
@software{pystatmatch2024,
title = {py-statmatch: Python implementation of R's StatMatch package},
author = {PolicyEngine},
year = {2024},
url = {https://github.com/PolicyEngine/py-statmatch}
}
@Manual{rstatmatch,
title = {StatMatch: Statistical Matching or Data Fusion},
author = {Marcello D'Orazio},
year = {2023},
note = {R package version 1.4.2},
url = {https://CRAN.R-project.org/package=StatMatch},
}
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
### PyPI Publishing Setup
This repository uses GitHub Actions for automated PyPI publishing. To enable publishing:
#### Option 1: PyPI API Token (Recommended for now)
1. Create an account on [PyPI](https://pypi.org/) if you don't have one
2. Generate an API token:
- Go to your PyPI account settings
- Scroll to "API tokens" section
- Click "Add API token"
- Give it a meaningful name (e.g., "py-statmatch GitHub Actions")
- Select scope: "Entire account" (for first publish) or "Project: py-statmatch" (after first publish)
3. Add the token as a GitHub repository secret:
- Go to Settings → Secrets and variables → Actions
- Click "New repository secret"
- Name: `PYPI_API_TOKEN`
- Value: Your PyPI API token (starts with `pypi-`)
#### Option 2: Trusted Publisher (OIDC) - Future Enhancement
1. First manually publish the package once using Option 1
2. Go to your PyPI project page
3. Navigate to "Publishing" settings
4. Add GitHub as a trusted publisher:
- Owner: `PolicyEngine`
- Repository: `py-statmatch`
- Workflow: `versioning.yaml`
- Environment: (leave blank)
5. Remove the PYPI_API_TOKEN secret (optional)
The workflows automatically detect which method to use based on available secrets.
## Acknowledgments
This is a Python port of the R StatMatch package by Marcello D'Orazio. We are grateful for the original implementation which has been invaluable to the statistical matching community.