{"id":31108708,"url":"https://github.com/bytedance/CausalMatch","last_synced_at":"2025-09-17T06:45:33.146Z","repository":{"id":251703691,"uuid":"836720183","full_name":"bytedance/CausalMatch","owner":"bytedance","description":"CausalMatch is a Bytedance research project aimed at integrating cutting-edge machine learning and econometrics methods to bring about automation in decision-making process.","archived":false,"fork":false,"pushed_at":"2025-09-05T07:45:13.000Z","size":772,"stargazers_count":87,"open_issues_count":1,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-09-05T09:20:28.674Z","etag":null,"topics":["causal-inference","econometrics","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bytedance.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-08-01T12:21:07.000Z","updated_at":"2025-09-05T07:43:09.000Z","dependencies_parsed_at":"2024-10-28T11:49:47.082Z","dependency_job_id":"d40ca1b0-276e-40b6-9918-b454a678a32c","html_url":"https://github.com/bytedance/CausalMatch","commit_stats":null,"previous_names":["bytedance/causalmatch"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/bytedance/CausalMatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FCausalMatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FCausalMatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FCausalMatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FCausalMatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bytedance","download_url":"https://codeload.github.com/bytedance/CausalMatch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2FCausalMatch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275548987,"owners_count":25484678,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-17T02:00:09.119Z","response_time":84,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["causal-inference","econometrics","machine-learning"],"created_at":"2025-09-17T06:45:23.246Z","updated_at":"2025-09-17T06:45:33.134Z","avatar_url":"https://github.com/bytedance.png","language":"Jupyter Notebook","funding_links":[],"categories":["🚀 GitHub Repositories","Causal Inference and Econometrics"],"sub_categories":["🌟 **Real-World Magic**","Frontier Tools"],"readme":"\u003ch1\u003e\n\u003ca href=\"\"\u003e\n\u003cimg src=\"doc/logo_nobc.png\" width=\"80px\" align=\"left\" style=\"margin-right: 10px;\", alt=\"causamatch-logo\"\u003e \n\u003c/a\u003e CausalMatch: A Python Package for Propensity Score Matching and Coarsened Exact Matching \n\u003c/h1\u003e\n\n[![PyPI version](https://badge.fury.io/py/causalmatch.svg)](https://badge.fury.io/py/causalmatch)\n[![Downloads](https://static.pepy.tech/badge/causalmatch)](https://pepy.tech/project/causalmatch)\n[![Downloads](https://static.pepy.tech/badge/causalmatch/month)](https://pepy.tech/project/causalmatch)\n[![Downloads](https://static.pepy.tech/badge/causalmatch/week)](https://pepy.tech/project/causalmatch)\n\n**CausalMatch** is a Python package that implements two classic matching methods, propensity score matching (PSM) and coarsened exact matching (CEM), to estimate average treatment effects from observational data. \nThis package was designed and built as part of the ByteDance data science research program with the goal of combining state-of-the-art machine learning techniques with econometrics to bring automation to complex causal inference problems.\nOur toolkit possess the following features:\n* Implement classic matching techniques in the literature at the intersection of econometrics and machine learning\n* Maintain flexibility in modeling the propensity score model (via various machine learning classification models), while preserving the causal interpretation of the learned model and often offering valid confidence intervals\n* Use a unified API\n* Build on standard Python packages for Machine Learning and Data Analysis\n\n[//]: # ( \u0026#40;For information on use cases and background material on causal inference and heterogeneous treatment effects see our webpage at [webpage here]\u0026#41;)\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e\u003cem\u003eTable of Contents\u003c/em\u003e\u003c/strong\u003e\u003c/summary\u003e\n\n- [News](#news)\n- [Getting Started](#getting-started)\n  - [Installation](#installation)\n  - [Usage Examples](#usage-examples)\n    - [Estimation Methods](#estimation-methods)\n- [References](#references)\n\u003c/details\u003e\n\n# News\n\nIf you'd like to contribute to this project, please contact xiaoyuzhou@bytedance.com. \nIf you have any questions, feel free to raise them in the issues section.\n\n**March 19, 2025:** Release v0.0.5, see release notes [here](https://github.com/bytedance/CausalMatch/releases/tag/v0.0.5)\n\n\n\n\u003cdetails\u003e\u003csummary\u003ePrevious releases\u003c/summary\u003e\n\n**December 10, 2024:** Release v0.0.4, see release notes [here](https://github.com/bytedance/CausalMatch/releases/tag/v0.0.4)\n\n**August 20, 2024:** Release v0.0.2, see release notes [here](https://github.com/bytedance/CausalMatch/releases/tag/v0.0.2)\n\n**August 2, 2024:** Release 0.0.1.\n\n\u003c/details\u003e\n\n# Getting Started\n\n## Installation\n\nInstall the latest release from [PyPI]:\n```\npip install causalmatch==0.0.5\n```\n\n\n## Usage Examples\n### Estimation Methods\n\n\u003cdetails\u003e\n  \u003csummary\u003ePropensity Score Matching (aka PSM) (click to expand)\u003c/summary\u003e\n\n  * Simple PSM\n\n  ```Python\nfrom causalmatch import matching, gen_test_data\nfrom sklearn.ensemble import GradientBoostingClassifier\n\ndf = gen_test_data(n = 10000, c_ratio=0.5)\ndf.head()\n\n\nX = ['c_1', 'c_2', 'c_3', 'd_1', 'gender']\ny = ['y', 'y2']\nT = 'treatment' \nid = 'user_id'\n# STEP 1: initialize object\nmatch_obj = matching(data = df,     \n                     T = T,\n                     X = X,\n                     y = y, \n                     id = id)\n\n# STEP 2: propensity score matching\n\nmatch_obj.psm(n_neighbors = 1,                      # number of neighbors\n              model = GradientBoostingClassifier(), # p-score model\n              trim_percentage = 0.1,                # trim x percent of data based on propensity score\n              caliper = 0.1)                        # caliper for p-score diff\n\n# STEP 3: balance check after propensity score matching\nmatch_obj.balance_check(include_discrete = True)\n\n# STEP 4: obtain average partial effect \nprint(match_obj.ate())\n  ```\n\n  * PSM with multiple p-score models and select the best one based on f1 score \n\n  ```Python\n# STEP 0: define all classification model you need\nfrom causalmatch import matching\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.svm import SVC\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.ensemble import GradientBoostingClassifier\nfrom lightgbm import LGBMClassifier\nfrom xgboost import XGBClassifier\n\nps_model1 = LogisticRegression(C=1e6)\nps_model2 = SVC(probability=True)\nps_model3 = GaussianNB()\nps_model4 = KNeighborsClassifier()\nps_model5 = DecisionTreeClassifier()\nps_model6 = RandomForestClassifier()\nps_model7 = GradientBoostingClassifier()\nps_model8 = LGBMClassifier()\nps_model9 = XGBClassifier()\n\nmodel_list = [ps_model1, ps_model2, ps_model3,  ps_model4, ps_model5, ps_model6,  ps_model7, ps_model8, ps_model9]\nmatch_obj = matching(data = df, T = T, X = X, id = id)\nmatch_obj.psm(n_neighbors = 1,\n              model_list = model_list, # input list of models you want to try\n              trim_percentage = 0,\n              caliper = 1,              \n              test_size = 0.2) # train-test split, what portion does test sample takes\nprint(match_obj.balance_check(include_discrete = True))\ndf_out = match_obj.df_out_final_post_trim.merge(df[y + X + [id]], how='left', on = id)\n\n  ```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n  \u003csummary\u003eCoarsened Exact Matching (click to expand)\u003c/summary\u003e\n\n  * Simple CEM\n\n\n  ```Python\n\nmatch_obj_cem = matching(data = df,  y = ['y'], T = 'treatment',  X = ['c_1','d_1','d_3'], id = 'user_id')\n# coarsened exact matching\nmatch_obj_cem.cem(n_bins = 10, # number of bins for continuous x variables, cut by percentile\n                  k2k = True)  # k2k: trim exp/base to have same observation numbers\nprint(match_obj_cem.balance_check(include_discrete=True))\nprint(match_obj_cem.ate())\n  ```\n\n  * CEM with customized bin cut\n\n  ```Python\n\nmatch_obj_cem = matching(data = df,  y = ['y'], T = 'treatment',  X = ['c_1','d_1','d_3'], id = 'user_id')\nmatch_obj_cem.cem(n_bins = 10,                                     \n                  break_points = {'c_1': [-1, 0.3, 0.6, 2]},  # cut point for continuous variable\n                  cluster_criteria = {'d_1': [['apple','pear'],['cat','dog'],['bee']],\n                                      'd_3': [['0.0','1.0','2.0'], ['3.0','4.0','5.0'], ['6.0','7.0','8.0','9.0']]}, # group values for discrete variables\n                  k2k = True) \n  ```\n\u003c/details\u003e\n\n\n\nSee the \u003ca href=\"#references\"\u003eReferences\u003c/a\u003e section for more details.\n\n# References\n\nS. Athey, J. Tibshirani, S. Wager.\n**Generalized random forests.**\n[*Annals of Statistics, 47, no. 2, 1148--1178*](https://projecteuclid.org/euclid.aos/1547197251), 2019.\n\nV. Chernozhukov, D. Nekipelov, V. Semenova, V. Syrgkanis.\n**Plug-in Regularized Estimation of High-Dimensional Parameters in Nonlinear Semiparametric Models.**\n[*Arxiv preprint arxiv:1806.04823*](https://arxiv.org/abs/1806.04823), 2018.\n\nS. Wager, S. Athey.\n**Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.**\n[*Journal of the American Statistical Association, 113:523, 1228-1242*](https://www.tandfonline.com/doi/citedby/10.1080/01621459.2017.1319839), 2018.\n\nV. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and a. W. Newey. **Double Machine Learning for Treatment and Causal Parameters.** [*ArXiv preprint arXiv:1608.00060*](https://arxiv.org/abs/1608.00060), 2016.\n\nBajari, P., Burdick, B., Imbens, G. W., Masoero, L., McQueen, J., Richardson, T., \u0026 Rosen, I. M. (2021). \n**Multiple randomization designs.** [*arXiv preprint arXiv:2112.13495*](https://arxiv.org/pdf/2112.13495).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance%2FCausalMatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbytedance%2FCausalMatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance%2FCausalMatch/lists"}