{"id":29989787,"url":"https://github.com/policyengine/py-statmatch","last_synced_at":"2025-11-05T20:01:48.568Z","repository":{"id":307410384,"uuid":"1029402338","full_name":"PolicyEngine/py-statmatch","owner":"PolicyEngine","description":"Python implementation of R's StatMatch package for statistical matching and data fusion","archived":false,"fork":false,"pushed_at":"2025-07-31T04:25:48.000Z","size":48,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-31T05:32:16.987Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://policyengine.github.io/py-statmatch/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PolicyEngine.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-31T02:10:49.000Z","updated_at":"2025-07-31T03:01:18.000Z","dependencies_parsed_at":"2025-07-31T05:42:24.315Z","dependency_job_id":null,"html_url":"https://github.com/PolicyEngine/py-statmatch","commit_stats":null,"previous_names":["policyengine/py-statmatch"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/PolicyEngine/py-statmatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolicyEngine%2Fpy-statmatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolicyEngine%2Fpy-statmatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolicyEngine%2Fpy-statmatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolicyEngine%2Fpy-statmatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PolicyEngine","download_url":"https://codeload.github.com/PolicyEngine/py-statmatch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolicyEngine%2Fpy-statmatch/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268802449,"owners_count":24309657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-04T02:00:09.867Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-04T23:26:53.189Z","updated_at":"2025-11-05T20:01:48.501Z","avatar_url":"https://github.com/PolicyEngine.png","language":"Python","readme":"# py-statmatch\n\nPython implementation of R's StatMatch package for statistical matching and data fusion.\n\n## Overview\n\n`py-statmatch` provides tools for statistical matching (also known as data fusion or synthetic data matching) between different datasets. This package is a Python port of the popular R package [StatMatch](https://cran.r-project.org/web/packages/StatMatch/), implementing various methods to match records from different data sources that share some common variables.\n\n## Features\n\n- **NND.hotdeck**: Nearest Neighbor Distance Hot Deck matching\n  - Multiple distance metrics (Euclidean, Manhattan, Mahalanobis, etc.)\n  - Donation classes (match within groups/strata)\n  - Constrained matching using Hungarian algorithm\n  - Handle missing values appropriately\n- **Coming soon**:\n  - RANDwNND.hotdeck: Random distance hot deck\n  - rankNND.hotdeck: Rank distance hot deck\n  - create.fused: Create fused datasets from matching results\n\n## Installation\n\n```bash\npip install py-statmatch\n```\n\nFor development:\n```bash\ngit clone https://github.com/PolicyEngine/py-statmatch.git\ncd py-statmatch\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n```python\nimport pandas as pd\nfrom statmatch import nnd_hotdeck\n\n# Create donor dataset (has X and Y variables)\ndonor_data = pd.DataFrame({\n    'age': [25, 30, 35, 40, 45],\n    'income': [30000, 45000, 55000, 65000, 80000],\n    'education': ['HS', 'BA', 'BA', 'MA', 'PhD'],\n    'job_satisfaction': [7, 8, 6, 9, 8]  # Variable to donate\n})\n\n# Create recipient dataset (has X variables but missing Y)\nrecipient_data = pd.DataFrame({\n    'age': [28, 33, 42],\n    'income': [35000, 50000, 70000],\n    'education': ['BA', 'BA', 'MA']\n})\n\n# Perform matching\nresult = nnd_hotdeck(\n    data_rec=recipient_data,\n    data_don=donor_data,\n    match_vars=['age', 'income'],\n    dist_fun='euclidean'\n)\n\n# Get matched donor indices\nprint(result['noad.index'])  # [0, 1, 3] for example\n\n# Create fused dataset\nfused_data = recipient_data.copy()\nfused_data['job_satisfaction'] = donor_data.iloc[result['noad.index']]['job_satisfaction'].values\n```\n\n## API Reference\n\n### nnd_hotdeck\n\n```python\nnnd_hotdeck(\n    data_rec,\n    data_don,\n    match_vars,\n    don_class=None,\n    dist_fun=\"euclidean\",\n    cut_don=None,\n    k=None,\n    w_don=None,\n    w_rec=None,\n    constr_alg=None\n)\n```\n\n**Parameters:**\n- `data_rec` (pd.DataFrame): The recipient dataset\n- `data_don` (pd.DataFrame): The donor dataset\n- `match_vars` (List[str]): List of variable names to use for matching\n- `don_class` (str, optional): Variable name defining donation classes\n- `dist_fun` (str): Distance function - \"euclidean\", \"manhattan\", \"mahalanobis\", etc.\n- `k` (int, optional): Maximum number of times each donor can be used\n- `constr_alg` (str, optional): Algorithm for constrained matching - \"lpsolve\" or \"hungarian\"\n\n**Returns:**\n- Dictionary containing:\n  - `mtc.ids`: DataFrame with recipient and donor IDs\n  - `noad.index`: Array of donor indices for each recipient (0-based)\n  - `dist.rd`: Array of distances between matched recipients and donors\n\n## Development\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=statmatch\n\n# Run specific test\npytest tests/test_nnd_hotdeck.py::TestNNDHotdeck::test_euclidean_distance_matching\n```\n\n### Code Style\n\nThis project uses Black for code formatting:\n\n```bash\nblack . -l 79\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Citation\n\nIf you use this package in your research, please cite both this package and the original R package:\n\n```bibtex\n@software{pystatmatch2024,\n  title = {py-statmatch: Python implementation of R's StatMatch package},\n  author = {PolicyEngine},\n  year = {2024},\n  url = {https://github.com/PolicyEngine/py-statmatch}\n}\n\n@Manual{rstatmatch,\n  title = {StatMatch: Statistical Matching or Data Fusion},\n  author = {Marcello D'Orazio},\n  year = {2023},\n  note = {R package version 1.4.2},\n  url = {https://CRAN.R-project.org/package=StatMatch},\n}\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n### PyPI Publishing Setup\n\nThis repository uses GitHub Actions for automated PyPI publishing. To enable publishing:\n\n#### Option 1: PyPI API Token (Recommended for now)\n1. Create an account on [PyPI](https://pypi.org/) if you don't have one\n2. Generate an API token:\n   - Go to your PyPI account settings\n   - Scroll to \"API tokens\" section\n   - Click \"Add API token\"\n   - Give it a meaningful name (e.g., \"py-statmatch GitHub Actions\")\n   - Select scope: \"Entire account\" (for first publish) or \"Project: py-statmatch\" (after first publish)\n3. Add the token as a GitHub repository secret:\n   - Go to Settings → Secrets and variables → Actions\n   - Click \"New repository secret\"\n   - Name: `PYPI_API_TOKEN`\n   - Value: Your PyPI API token (starts with `pypi-`)\n\n#### Option 2: Trusted Publisher (OIDC) - Future Enhancement\n1. First manually publish the package once using Option 1\n2. Go to your PyPI project page\n3. Navigate to \"Publishing\" settings\n4. Add GitHub as a trusted publisher:\n   - Owner: `PolicyEngine`\n   - Repository: `py-statmatch`\n   - Workflow: `versioning.yaml`\n   - Environment: (leave blank)\n5. Remove the PYPI_API_TOKEN secret (optional)\n\nThe workflows automatically detect which method to use based on available secrets.\n\n## Acknowledgments\n\nThis is a Python port of the R StatMatch package by Marcello D'Orazio. We are grateful for the original implementation which has been invaluable to the statistical matching community.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolicyengine%2Fpy-statmatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpolicyengine%2Fpy-statmatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolicyengine%2Fpy-statmatch/lists"}