{"id":20751303,"url":"https://github.com/cbg-ethz/bnpc","last_synced_at":"2025-04-28T13:13:18.201Z","repository":{"id":37620435,"uuid":"232813825","full_name":"cbg-ethz/BnpC","owner":"cbg-ethz","description":"Bayesian non-parametric clustering (BnpC) of binary data with missing values and uneven error rates","archived":false,"fork":false,"pushed_at":"2024-07-11T08:02:26.000Z","size":243,"stargazers_count":20,"open_issues_count":8,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-28T13:13:09.000Z","etag":null,"topics":["binary-data","clustering","genotyping","mcmc","split-merge"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cbg-ethz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-09T13:19:21.000Z","updated_at":"2025-04-22T05:38:24.000Z","dependencies_parsed_at":"2024-01-15T10:54:34.400Z","dependency_job_id":"c3e1545a-89df-4b77-97da-5e946ad0e9a4","html_url":"https://github.com/cbg-ethz/BnpC","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2FBnpC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2FBnpC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2FBnpC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2FBnpC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cbg-ethz","download_url":"https://codeload.github.com/cbg-ethz/BnpC/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251319593,"owners_count":21570428,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary-data","clustering","genotyping","mcmc","split-merge"],"created_at":"2024-11-17T08:32:33.942Z","updated_at":"2025-04-28T13:13:17.659Z","avatar_url":"https://github.com/cbg-ethz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BnpC\nBayesian non-parametric clustering (BnpC) of binary data with missing values and uneven error rates.\n\nBnpC is a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles.\nBnpC employs a Chinese Restaurant Process prior to handle the unknown number of clonal populations. The model introduces a combination of Gibbs sampling, a modified non-conjugate split-merge move and Metropolis-Hastings updates to explore the joint posterior space of all parameters. Furthermore, it employs a novel estimator, which accounts for the shape of the posterior distribution, to predict the clones and genotypes.\n\nThe corresponsing paper can be found in [Bioinformatics](https://doi.org/10.1093/bioinformatics/btaa599 \"Borgsmueller et al.\")\n\n# Contents\n- [Installation](#Installation)\n- [Usage](#Usage)\n- [Example data](#Example-data)\n\n# Requirements\n- Python 3.X\n\n# Installation\n## Clone repository\nFirst, download BnpC from github and change to the directory:\n```bash\ngit clone https://github.com/cbg-ethz/BnpC\ncd BnpC\n```\n\n## Create conda environment (optional)\nFirst, create a new environment named \"BnpC\":\n```bash\nconda create --name BnpC python=3\n```\n\nSecond, source it:\n```bash\nconda activate BnpC\n```\n\n## Install requirements\nUse pip to install the requirements:\n```bash\npython -m pip install -r requirements.txt\n```\n\nNow you are ready to run **BnpC**!\n\n# Usage\nThe BnpC wrapper script `run_BnpC.py` can be run with the following shell command:\n```bash\npython run_BnpC.py \u003cINPUT_DATA\u003e [-t] [-FN] [-FP] [-FN_m] [-FN_sd] [-FP_m] [-FP_sd] [-dpa] [-pp] [-n] [-s] [-r] [-ls] [-b] [-smp] [-cup] [-e] [-sc] [--seed] [-o] [-v] [-np] [-tr] [-tc] [-td]]\n```\n\n## Input\nBnpC requires a binary matrix as input, where each row corresponds with a mutations and each columns with a cell.\nAll matrix entries must be of the following: 0|1|3/\" \", where 0 indicates the absence of a mutation, 1 the presence, and a 3 or empty element a missing value.\n\n\u003e ## Note\n\u003e If your data is arranged in the transposed way (cells = columns, rows = mutations), use the `-t` argument.\n\n## Arguments\n### Input Data Arguments\n- `\u003cstr\u003e`, Path to the input data.\n- `-t \u003cflag\u003e`, If set, the input matrix is transposed.\n\n### Optional input arguments (for simulated data)\n- `-tr \u003cstr\u003e`, Path to the mutation tree file (in .gv format) used for data generation.\n- `-tc \u003cstr\u003e`, Path to the true clusters assignments to compare clustering methods.\n- `-td \u003cstr\u003e`, Path to the true/raw data/genotypes.\n\n### Model Arguments\n- `-FN \u003cfloat\u003e`, Replace \u003cfloat\\\u003e with the fixed error rate for false negatives.\n- `-FP \u003cfloat\u003e`, Replace \u003cfloat\\\u003e with the fixed error rate for false positives.\n- `-FN_m \u003cfloat\u003e`, Replace \u003cfloat\\\u003e with the mean for the prior for the false negative rate.\n- `-FN_sd \u003cfloat\u003e`, Replace \u003cfloat\\\u003e with the standard deviation for the prior for the false negative rate.\n- `-FP_m \u003cfloat\u003e`, Replace \u003cfloat\\\u003e with the mean for the prior for the false positive rate.\n- `-FP_sd \u003cfloat\u003e`, Replace \u003cfloat\\\u003e with the standard deviation for the prior for the false positive rate.\n- `-ap \u003cfloat\u003e`, Alpha value of the Beta function used as prior for the concentration parameter of the CRP.\n- `-pp \u003cfloat\u003e \u003cfloat\u003e`, Beta function shape parameters used for the cluster parameter prior.\n\n\u003e ## Note\n\u003e If you run BnpC on panel data with **few mutation** only or on **error free** data, we recommend changing the `-pp` argument to beta distribution closer to uniform, like `-pp 0.75 0.75` or even `-pp 1 1`. Otherwise, BnpC will incorrectly report many singleton clusters.\n\n### MCMC Arguments\n- `-n \u003cint\u003e`, Number of MCMC chains to run in parallel (1 chain per thread).\n- `-s \u003cint\u003e`, Number of MCMC steps.\n- `-r \u003cint\u003e`, Runtime in minutes. If set, steps argument is overwritten.\n- `-ls \u003cfloat\u003e`, Lugsail batch means estimator as convergence diagnostics [Vats and Flegal, 2018].\n- `-b  \u003cfloat\u003e`, Ratio of MCMC steps discarded as burn-in.\n- `-cup  \u003cfloat\u003e`, Probability of updating the CRP concentration parameter.\n- `-eup \u003cfloat\u003e`, Probability to do update the error rates in An MCMC step.\n- `-smp \u003cfloat\u003e`, Probability to do a split/merge step instead of Gibbs sampling.\n- `-sms \u003cint\u003e`, Number of intermediate, restricted Gibbs steps in the split-merge move.\n- `-smr \u003cfloat, float\u003e`, Ratio of splits/merges in the split merge move.\n- `-e +\u003cstr\u003e`, Estimator(s) for inferrence. If more than one, seperate by space. Options = posterior|ML|MAP.\n- `-sc \u003cflag\u003e`, If set, infer a result for each chain individually (instead of from all chains together).\n- `--seed \u003cint\u003e`, Seed used for random number generation.\n\n### Output Arguments\n- `-o \u003cstr\u003e`, Path to an output directory.\n- `-np \u003cflag\u003e`, If set, no plots are generated.\n- `-v \u003cint\u003e`, Stdout verbosity level. Options = 0|1|2.\n\n# Example data\n\nLets employ the toy dataset that one can find in the `data` folder (data.csv) to understand the functionality of the different arguments. First go to the folder and activate the environment:\n\n        cd /path/to/crp_clustering\n        conda activate environment_name\n\nBnpC can run in three different settings:\n1. Number of steps. Runs for the given number of MCMC steps. Arument: -s\n2. Running time limit. Every MCMC the time is tracked and the method stops after the introduced time is achieved. Argument: -r\n3. Lugsail for convergence diagnosis. The chain is terminated if the estimator undercuts a threshold defined by a significance level of 0.05 and a user defined float between [0,1], comparable to the half-width of the confidence interval in sample size calculation for a one sample t-test. Reasonal values = 0.1, 0.2, 0.3. Argument: -ls\n\nThe simplest way to run the BnpC is to leave every argument as default and hence only the path to the data needs to be given. In this case BnpC runs in the setting 1.\n```bash\npython run_BnpC.py example_data/data.csv \n```\nIf the error rates are known for a particular sequenced data (e.g FP = 0.0001 and FN = 0.3), one can run BnpC with fixed error rates by:\n```bash\npython run_BnpC.py example_data/data.csv -FP 0.0001 -FN 0.3\n```\nOn the other hand, if errors are not known one can leave it blank as in the first case or if there is some intuition add the mean and standard deviation priors for the method to learn them:\n```bash\npython run_BnpC.py example_data/data.csv -FP_m 0.0001 -FN_m 0.3 -FP_sd 0.000001 -FN_sd 0.05\n```\nAdditional MCMC arguments can be employed to allow faster convergence. Among other options:\n- Reduce burnin to include more posterior samples in the estimation. Example: -b 0.2, discard 20 % of the total MCMC steps.\n- Adapt split-merge probability to better explore the posterior landscape. Example: -smp 0.33, 1 out of every 3 steps will be a split-merge move on average.\n- Adjust the Dirchlet Process alpha which accounts for the probability of starting a new cluster. Example: -dpa 10. Increasing the value, leads to a larger probability of starting a new cluster in the cell assignment step.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcbg-ethz%2Fbnpc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcbg-ethz%2Fbnpc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcbg-ethz%2Fbnpc/lists"}