{"id":20841823,"url":"https://github.com/basisresearch/millipede","last_synced_at":"2025-10-15T11:59:35.975Z","repository":{"id":41142056,"uuid":"420716445","full_name":"BasisResearch/millipede","owner":"BasisResearch","description":"A library for bayesian variable selection","archived":false,"fork":false,"pushed_at":"2025-01-24T02:21:10.000Z","size":520,"stargazers_count":28,"open_issues_count":2,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-11T18:05:45.265Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BasisResearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-24T15:05:33.000Z","updated_at":"2025-01-24T02:21:14.000Z","dependencies_parsed_at":"2023-02-14T13:15:59.701Z","dependency_job_id":"25a1606f-86dd-45cf-bd40-35fdd8b7dc15","html_url":"https://github.com/BasisResearch/millipede","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BasisResearch%2Fmillipede","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BasisResearch%2Fmillipede/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BasisResearch%2Fmillipede/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BasisResearch%2Fmillipede/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BasisResearch","download_url":"https://codeload.github.com/BasisResearch/millipede/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253160727,"owners_count":21863624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-18T01:22:00.985Z","updated_at":"2025-10-15T11:59:30.907Z","avatar_url":"https://github.com/BasisResearch.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://github.com/BasisResearch/millipede/workflows/CI/badge.svg)](https://github.com/BasisResearch/millipede/actions)\n[![Documentation Status](https://readthedocs.org/projects/millipede/badge/?version=latest)](https://millipede.readthedocs.io/en/latest/?badge=latest)\n      \n\n# millipede: A library for Bayesian variable selection\n```\n                                        ..    ..\n                           millipede      )  (\n                      _ _ _ _ _ _ _ _ _ _(.--.)\n                     {_{_{_{_{_{_{_{_{_{_( '_')\n                     /\\/\\/\\/\\/\\/\\/\\/\\/\\/\\ `---\n```\n\nmillipede is a [PyTorch](https://pytorch.org/)-based library for Bayesian variable selection in generalized\nlinear models that can be run on both CPU and GPU and that\ncan handle datasets with numbers of data points and covariates in the tens of thousands or more.\n\n \n## What is Bayesian variable selection?\n\nBayesian variable selection is a model-based approach for identifying parsimonious explanations of observed data.\nIn the context of generalized linear models with `P` covariates `{X_1, ..., X_P}` and responses `Y`, \nBayesian variable selection can be used to identify *sparse* subsets of covariates (i.e. far fewer than `P`) \nthat are sufficient for explaining the observed responses in terms of a linear function of the covariates.\n\nIn more detail, Bayesian variable selection is formulated as a model selection problem in which we consider \nthe space of `2^P` models in which some covariates are included and the rest are excluded.\nFor example, for continuous-valued responses one particular model might take the form `Y = beta_3 X_3 + beta_9 X_9` \nwith (non-zero) coefficients `beta_3` and `beta_9`.\nA priori we assume that models with fewer included covariates are more likely than those with more included covariates.\nThe set of parsimonious models best supported by the data then emerges from the posterior distribution over the space of models.\n\nWhat's especially appealing about Bayesian variable selection is that it provides an interpretable score\ncalled the PIP (posterior inclusion probability) for each covariate `X_p`. \nThe PIP is a true probability and so it satisfies `0 \u003c= PIP \u003c= 1` by definition.\nCovariates with large PIPs are good candidates for being explanatory of the response `Y`.\n\nBeing able to compute PIPs is particularly useful for high-dimensional datasets with large `P`.\nFor example, we might want to select a small number of covariates to include in a predictive model (i.e. feature selection). \nAlternatively, in settings where it is implausible to subject all `P` covariates to \nsome expensive downstream analysis (e.g. a laboratory experiment),\nBayesian variable selection can be used to select a small number of covariates for further analysis. \n  \n\n## Requirements\n\nmillipede requires Python 3.8 or later and the following Python packages: [PyTorch](https://pytorch.org/), [pandas](https://pandas.pydata.org/), and [polyagamma](https://github.com/zoj613/polyagamma). \n\nNote that if you wish to run millipede on a GPU you need to install PyTorch with CUDA support. \nIn particular if you run the following command from your terminal it should report True:\n```\npython -c 'import torch; print(torch.cuda.is_available())'\n```\n\n\n## Installation instructions\n\nInstall directly from GitHub:\n\n```pip install git+https://github.com/BasisResearch/millipede.git```\n\nInstall from source:\n```\ngit clone git@github.com:BasisResearch/millipede.git\ncd millipede\npip install .\n```\n\n## Basic usage\n\nUsing millipede is easy:\n```python\n# import millipede \nfrom millipede import NormalLikelihoodVariableSelector\n\n# create a VariableSelector object appropriate to your datatype\nselector = NormalLikelihoodVariableSelector(dataframe,  # pass in the data\n                                            'Response', # indicate the column of responses\n                                            S=1,        # specify the expected number of covariates to include a priori\n                                           )\n\n# run the MCMC algorithm to compute posterior inclusion probabilities\n# and other posterior quantities of interest\nselector.run(T=1000, T_burnin=500)\n\n# inspect the results\nprint(selector.summary)\n```\n\nSee the Jupyter notebooks in the [notebooks](https://github.com/BasisResearch/millipede/tree/master/notebooks) directory for detailed example usage.\n\n\n## Supported data types \n\nThe covariates `X` are essentially arbitrary and can be continuous-valued, binary-valued, a mixture of the two, etc.\nCurrently the response `Y` can be any of the following:\n\n| Response type     | Selector class \n| ------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|\n| continuous-valued | [NormalLikelihoodVariableSelector](https://millipede.readthedocs.io/en/latest/selection.html#millipede.selection.NormalLikelihoodVariableSelector)       |\n| binary-valued     | [BernoulliLikelihoodVariableSelector](https://millipede.readthedocs.io/en/latest/selection.html#millipede.selection.BernoulliLikelihoodVariableSelector) |\n| bounded counts    | [BinomialLikelihoodVariableSelector](https://millipede.readthedocs.io/en/latest/selection.html#binomiallikelihoodvariableselector)                       |\n| unbounded counts  | [NegativeBinomialLikelihoodVariableSelector](https://millipede.readthedocs.io/en/latest/selection.html#negativebinomiallikelihoodvariableselector)       |\n\n\n## Scalability\n\nRoughly speaking, the cost of the MCMC algorithms implemented in millipede is proportional\n to `N x P`, where `N` is the total number of data points and `P` is the total number of covariates. \nFor an **approximate** guide to hardware requirements please consult the following table:\n\n| Regime                 | Expectations                            |\n| -----------------------|-----------------------------------------|\n| `N x P \u003c 10^7`         | Use a CPU                               |\n| `10^7 \u003c N x P \u003c 10^8`  | Use a GPU                               |\n| `10^8 \u003c N x P \u003c 10^10` | Use a GPU with the subset_size argument |\n| `10^10 \u003c N x P`        | You may be out of luck                  |\n\n\n## Documentation\n\nRead the docs [here](https://millipede.readthedocs.io/en/latest/).\n\n\n## FAQ\n\n- How many MCMC iterations do I need for good results?\n\nIt's hard to say. Generally speaking, difficult regimes with highly-correlated covariates or a large number of\ncovariates are expected to require more iterations. Similarly, datasets with count-based responses are expected to require\nmore iterations than those with continuous-valued responses (because the underlying inference problem is more difficult).\nThe best way to determine if you need more MCMC iterations is to run millipede twice with different random number seeds.\nIf the results for both runs are not similar, you probably want to increase the number of iterations.\nAs a general rule of thumb, it's probably good to aim for at least `10^4-10^5` samples if doing so is feasible. \nAlso, you probably want at least 1000 burn-in iterations.\n\n\n## Contact information\n\nMartin Jankowiak: jankowiak@gmail.com \n\n\n## References\n\nJankowiak, M., 2023. [Bayesian Variable Selection in a Million Dimensions](https://proceedings.mlr.press/v206/jankowiak23a.html). AISTATS.\n\nZanella, G. and Roberts, G., 2019. [Scalable importance tempering and Bayesian variable selection](https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12316). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(3), pp.489-517.\n\n## Citing millipede\n\nIf you use millipede please consider citing:\n```\n\n@InProceedings{pmlr-v206-jankowiak23a,\n  title = \t {Bayesian Variable Selection in a Million Dimensions},\n  author =       {Jankowiak, Martin},\n  booktitle = \t {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},\n  pages = \t {253--282},\n  year = \t {2023},\n  volume = \t {206},\n  series = \t {Proceedings of Machine Learning Research},\n  publisher =    {PMLR},\n  pdf = \t {https://proceedings.mlr.press/v206/jankowiak23a/jankowiak23a.pdf},\n  url = \t {https://proceedings.mlr.press/v206/jankowiak23a.html},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasisresearch%2Fmillipede","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbasisresearch%2Fmillipede","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasisresearch%2Fmillipede/lists"}