{"id":14982420,"url":"https://github.com/cbg-ethz/pybda","last_synced_at":"2025-07-24T08:10:34.613Z","repository":{"id":57467036,"uuid":"140821515","full_name":"cbg-ethz/pybda","owner":"cbg-ethz","description":":computer::computer::computer: A commandline tool for analysis of big biological data sets for distributed HPC clusters.","archived":false,"fork":false,"pushed_at":"2022-11-11T07:27:14.000Z","size":379759,"stargazers_count":9,"open_issues_count":7,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-04-25T04:44:57.311Z","etag":null,"topics":["apache-spark","big-data","machine-learning","python","snakemake"],"latest_commit_sha":null,"homepage":"https://pybda.rtfd.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cbg-ethz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"code-of-conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-13T08:43:10.000Z","updated_at":"2023-04-28T14:47:28.000Z","dependencies_parsed_at":"2022-09-10T03:43:39.870Z","dependency_job_id":null,"html_url":"https://github.com/cbg-ethz/pybda","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2Fpybda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2Fpybda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2Fpybda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbg-ethz%2Fpybda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cbg-ethz","download_url":"https://codeload.github.com/cbg-ethz/pybda/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238825748,"owners_count":19537118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","big-data","machine-learning","python","snakemake"],"created_at":"2024-09-24T14:05:23.090Z","updated_at":"2025-02-14T10:31:46.955Z","avatar_url":"https://github.com/cbg-ethz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyBDA \u003cimg src=\"https://raw.githubusercontent.com/cbg-ethz/pybda/master/_fig/sticker_pybda.png\" align=\"right\" width=\"160px\"/\u003e\n\n[![Project Status](https://img.shields.io/badge/repo%20status-active-brightgreen.svg)](http://www.repostatus.org/#active)\n[![travis](https://img.shields.io/travis/cbg-ethz/pybda/master.svg?\u0026logo=travis)](https://travis-ci.org/cbg-ethz/pybda/)\n[![circleci](https://img.shields.io/circleci/project/github/cbg-ethz/pybda/master.svg?\u0026logo=circleci)](https://circleci.com/gh/cbg-ethz/pybda/)\n[![codecov](https://codecov.io/gh/cbg-ethz/pybda/branch/master/graph/badge.svg)](https://codecov.io/gh/cbg-ethz/pybda)\n[![codedacy](https://api.codacy.com/project/badge/Grade/a4cca665933a4def9c2cfc88d7bbbeae)](https://www.codacy.com/app/simon-dirmeier/pybda?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=cbg-ethz/pybda\u0026amp;utm_campaign=Badge_Grade)\n[![readthedocs](https://readthedocs.org/projects/pybda/badge/?version=latest)](http://pybda.readthedocs.io/en/latest)\n[![bioconda](https://img.shields.io/badge/install%20with-bioconda-black.svg)](http://bioconda.github.io/recipes/pybda/README.html)\n[![version](https://img.shields.io/pypi/v/pybda.svg?colorB=black)](https://pypi.org/project/pybda/)\n\nA commandline tool for analysis of big biological data sets for distributed HPC clusters.\n\n## About\n\nPyBDA is a Python library and command line tool for big data analytics and machine learning scaling to big, high-dimensional data sets.\n\nIn order to make PyBDA scale to big data sets, we use [Apache Spark](https://spark.apache.org/)'s DataFrame API which, if developed against, automatically distributes\ndata to the nodes of a high-performance cluster and does the computation of expensive machine learning tasks in parallel.\nFor scheduling, PyBDA uses [Snakemake](https://snakemake.readthedocs.io/en/stable/) to automatically execute pipelines of jobs. In particular, PyBDA will first build a DAG of methods/jobs\nyou want to execute in succession (e.g. dimensionality reduction into clustering) and then compute every method by traversing the DAG.\nIn the case of a successful computation of a job, PyBDA will write results and plots, and create statistics. If one of the jobs fails PyBDA will report where and which method failed\n(owing to Snakemake's scheduling) such that the same pipeline can effortlessly be continued from where it failed the last time.\n\nFor instance, if you want to first reduce your data set into a lower dimensional space, cluster it using several cluster centers, and fit a random forest you would first specify a config file similar to this:\n\n```bash\n$ cat data/pybda-usecase.config\n\nspark: spark-submit\ninfile: data/single_cell_imaging_data.tsv\npredict: data/single_cell_imaging_data.tsv\noutfolder: data/results\nmeta: data/meta_columns.tsv\nfeatures: data/feature_columns.tsv\ndimension_reduction: pca\nn_components: 5\nclustering: kmeans\nn_centers: 50, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200\nregression: forest\nfamily: binomial\nresponse: is_infected\nsparkparams:\n  - \"--driver-memory=3G\"\n  - \"--executor-memory=6G\"\ndebug: true\n```\n\nExecuting PyBDA, and calling the methods above, is then as easy as this:\n\n```bash\n$ pybda run data/pybda-usecase.config local\n```\n\n## Installation\n\nI recommend installing PyBDA from [Bioconda](https://bioconda.github.io/recipes/pybda/README.html?highlight=pybda#recipe-Recipe%20\u0026#x27;pybda\u0026#x27;):\n\n```bash\n$ conda install -c bioconda pybda\n```\n\nYou can however also directly install using [PyPI](https://pypi.org/project/pybda/):\n\n```bash\n$ pip install pybda\n```\n\nOtherwise you could download the latest [release](https://github.com/cbg-ethz/pybda/releases) and install that.\n\n## Documentation\n\nCheck out the documentation [here](https://pybda.readthedocs.io/en/latest/).\nThe documentation will walk you through\n\n* the installation process,\n* setting up Apache Spark,\n* using `pybda`.\n\n## Author\n\nSimon Dirmeier \u003ca href=\"mailto:simon.dirmeier@bsse.ethz.ch\"\u003esimon.dirmeier@bsse.ethz.ch\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcbg-ethz%2Fpybda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcbg-ethz%2Fpybda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcbg-ethz%2Fpybda/lists"}