{"id":26933025,"url":"https://github.com/xiaohan2012/efficient-rashomon-rule-set","last_synced_at":"2025-10-25T12:21:45.867Z","repository":{"id":242579655,"uuid":"809942327","full_name":"xiaohan2012/efficient-rashomon-rule-set","owner":"xiaohan2012","description":"[KDD 2024] Efficient Exploration of the Rashomon Set of Rule Set Models","archived":false,"fork":false,"pushed_at":"2024-11-25T16:58:38.000Z","size":46841,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-25T17:32:10.794Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xiaohan2012.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-03T18:40:45.000Z","updated_at":"2024-11-25T16:59:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"033f0aa4-8ece-4820-b730-9fdaa8bc830a","html_url":"https://github.com/xiaohan2012/efficient-rashomon-rule-set","commit_stats":null,"previous_names":["xiaohan2012/efficient-rashomon-rule-set"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fefficient-rashomon-rule-set","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fefficient-rashomon-rule-set/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fefficient-rashomon-rule-set/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fefficient-rashomon-rule-set/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xiaohan2012","download_url":"https://codeload.github.com/xiaohan2012/efficient-rashomon-rule-set/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246785482,"owners_count":20833498,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-02T09:17:21.465Z","updated_at":"2025-10-25T12:21:45.792Z","avatar_url":"https://github.com/xiaohan2012.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"![About](https://img.shields.io/badge/ML-Interpretability-blue)\n![Unittests](https://github.com/xiaohan2012/efficient-rashomon-rule-set/actions/workflows/unittest.yml/badge.svg)\n![Lint](https://github.com/xiaohan2012/efficient-rashomon-rule-set/actions/workflows/lint.yml/badge.svg)\n![TestCoverage](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/xiaohan2012/2920be1e33237a8f2c58abcf09dfefcc/raw/covbadge.json)\n![PythonVersion](https://img.shields.io/badge/python-3.8-blue)\n\n# Efficient algorithms to explore the Rashomon set of rule set models\n\nThis repository contains the source code of the paper *\"[Efficient Exploration of the Rashomon Set of Rule Set Models](https://arxiv.org/pdf/2406.03059)\"* (KDD 2024)\n\n## What is a Rashomon set and why studying it?\n\n*The Rashomon set* of an ML problem refers to the set of models of near-optimal predictive performance.\n\n**Why studying it?** Because models with similar performance may exhibit *drastically different* properties (such as fairness), therefore a single model does not offer an adequate representation of the reality.\n\nAn example showcasing the Rashomon set of rule set models for the [COMPAS](https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis) dataset.\n\n- Each rule set is plotted as a point, whose position is determined by the statistical parity (`SP`) of the rule set on race and gender (in the X and Y axis, respectively).\n  - Statistical parity quantifies the fairness of classification models.\n- You can see that two highlighted models have very different `SP[race]` scores, though their accuracy scores are close.\n\n![](./assets/rashomon-set-example.png)\n\n## Project overview\n\n- We designed *efficient algorithms* to explore the Rashomon set of rule-set models for binary classification problems.\n  - Our focus is on rule set models, due to their inherent *interpretability* 💡.\n- We investigated two exploration modes -- *counting* and *uniform sampling* from the Rashomon set.\n- Instead of tackling exact counting and uniform sampling, we study the approximate versions of them, which reduces the search space drastically.\n- For both problems, we invented theoretically-sound algorithms sped up by *effective pruning bounds*, and a efficient implementation of it powered by Numba and Ray.\n  - Compared to off-the-shelf tools (such as [Google OR-tools](https://github.com/google/or-tools)), our implementation is often **\u003e1000x faster** ⚡⚡⚡\n\n## Environment setup\n\nThe source code is tested against Python 3.8 on MacOS 14.2.1\n\n``` shell\npip install -r requirements.txt\n```\n\n\nVerify that unit tests pass\n\n``` shell\npytest tests\n```\n\n## Example usage\n\nWe illustrate the usage of approximate counter and almost-uniform sampler applied on synthetic data.\n\n### Preparation\n\nSet up a Ray cluster for parallel computing, e.g.,\n\n``` python\nimport ray\nray.init()\n```\n\n### Approximate counting\n\n``` python\nfrom bds.rule_utils import generate_random_rules_and_y\nfrom bds.meel import approx_mc2\n\nub = 0.9  # upper bound on the rule set objective function\nlmbd = 0.1  # complexity penalty term\n\neps = 0.8  # error parameter related to estimation accuracy\ndelta = 0.8  # the estimation confidence parameter\n\n\nnum_pts, num_rules = 100, 10\n# generate the input data\nrandom_rules, random_y = generate_random_rules_and_y(\n    num_pts, num_rules, rand_seed=42\n)\n\n# get an approximate estimation of the number of good rule set models\nestimated_count = approx_mc2(\n    random_rules,\n    random_y,\n    lmbd=lmbd,\n    ub=ub,\n    delta=delta,\n    eps=eps,\n    rand_seed=42,\n    parallel=True,  # using paralle run\n)\n```\n\n### Almost uniform sampling\n\n\n``` python\nfrom bds.rule_utils import generate_random_rules_and_y\nfrom bds.meel import UniGen\n\nnum_pts, num_rules = 100, 10\nrandom_rules, random_y = generate_random_rules_and_y(\n    num_pts, num_rules, rand_seed=42\n)\n\nub = 0.9\neps = 8 #  epsilon parameter that controls the closeness between the sampled distribution and uniform distribution\nlmbd = 0.1  # complexity penalty term\n\nsampler = UniGen(random_rules, random_y, lmbd, ub, eps, rand_seed=42)\n\nsampler.prepare()  # collect necessary statistics required for sampling\n\n# sample 10 rule sets almost uniformly from the Rashomon set\nsamples = sampler.sample(10, exclude_none=True)\n```\n\n### Candidate rules extraction on real-world datasets\n\nWhen working with real-world datasets, the first step is often extract a list of candidate rules.\n\nFor this purpose, you may rely on `extract_rules_with_min_support` to extract a list of rules with support above a given threshold.\n\n``` python\nimport pandas as pd\nfrom bds.candidate_generation import extract_rules_with_min_support\n\ndataset = \"compas\"\ndata = pd.read_csv('data/compas_train-binary.csv')  # the features are binary\nX = data.to_numpy()[:,:-2]  # extract the feature matrix\n\nattribute_names = list(data.columns[:-2])\n\ncandidate_rules = extract_rules_with_min_support(X, attribute_names, min_support=70)\n\n# then you may apply the sampler or count estimator on the candidate rules\n```\n\n## Contact persons\n\n- Han Xiao: xiaohan2012@gmail.com\n- Martino Ciaperoni: martino.ciaperoni@aalto.fi\n\n## Citing this work\n\nIf you find this work useful, please consider citing it.\n\n\u003cdetails\u003e\n\u003csummary\u003eBibtex entry\u003c/summary\u003e\n\n``` bibtex\n@inproceedings{ciaperoni2024efficient,\n  title={Efficient Exploration of the Rashomon Set of Rule-Set Models},\n  author={Ciaperoni, Martino and Xiao, Han and Gionis, Aristides},\n  booktitle={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},\n  pages={478--489},\n  year={2024}\n}\n```\n\n\u003c/details\u003e\n\n\n## TODO\n\n- [ ] rename package to `ers`\n- [ ] packaging\n- [ ] maybe add a logo?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohan2012%2Fefficient-rashomon-rule-set","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxiaohan2012%2Fefficient-rashomon-rule-set","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohan2012%2Fefficient-rashomon-rule-set/lists"}