{"id":29428724,"url":"https://github.com/urbslab/streamline","last_synced_at":"2025-07-12T15:19:31.616Z","repository":{"id":39031554,"uuid":"489535987","full_name":"UrbsLab/STREAMLINE","owner":"UrbsLab","description":"Simple Transparent End-To-End Automated Machine Learning Pipeline for Supervised Learning in Tabular Binary Classification Data","archived":false,"fork":false,"pushed_at":"2025-04-30T18:09:40.000Z","size":623543,"stargazers_count":76,"open_issues_count":2,"forks_count":11,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-30T18:29:11.569Z","etag":null,"topics":["automl-pipeline","binary-classification","data-science","data-visualization","feature-selection","imputation","machine-learning","model-application","statistical-analysis","supervised-learning"],"latest_commit_sha":null,"homepage":"https://urbslab.github.io/STREAMLINE/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UrbsLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-05-07T01:22:20.000Z","updated_at":"2025-04-30T17:14:08.000Z","dependencies_parsed_at":"2023-09-22T20:49:04.528Z","dependency_job_id":"937f9cdc-3c64-4b62-913a-86f105d50f52","html_url":"https://github.com/UrbsLab/STREAMLINE","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/UrbsLab/STREAMLINE","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UrbsLab%2FSTREAMLINE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UrbsLab%2FSTREAMLINE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UrbsLab%2FSTREAMLINE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UrbsLab%2FSTREAMLINE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UrbsLab","download_url":"https://codeload.github.com/UrbsLab/STREAMLINE/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UrbsLab%2FSTREAMLINE/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265009537,"owners_count":23697193,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl-pipeline","binary-classification","data-science","data-visualization","feature-selection","imputation","machine-learning","model-application","statistical-analysis","supervised-learning"],"created_at":"2025-07-12T15:19:23.119Z","updated_at":"2025-07-12T15:19:31.606Z","avatar_url":"https://github.com/UrbsLab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"![alttext](https://github.com/UrbsLab/STREAMLINE/blob/main/docs/source/pictures/STREAMLINE_Logo_Full.png?raw=true)\n# Overview\n\nSTREAMLINE is an end-to-end automated machine learning (AutoML) pipeline\nthat empowers anyone to easily train, interpret, and apply a variety of predictive models as\npart of a rigorous and optionally customizable data mining analysis. It is programmed in\nPython 3 using many common libraries including [Pandas](https://pandas.pydata.org/)\nand [scikit-learn](https://scikit-learn.org/stable/).\n\nThe schematic below summarizes the automated STREAMLINE analysis pipeline with individual elements organized into 9 phases.\n\n![alttext](https://github.com/UrbsLab/STREAMLINE/blob/main/docs/source/pictures/STREAMLINE_paper_new_lightcolor.png?raw=true)\n\n* Detailed documentation of STREAMLINE is available [here](https://urbslab.github.io/STREAMLINE/index.html).\n\n* A simple demonstration of STREAMLINE on example biomedical data in our ready-to-run Google Colab Notebook [here](https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing).\n\n* A video tutorial playlist covering all aspects of STREAMLINE is available [here](https://www.youtube.com/playlist?list=PLafPhSv1OSDcvu8dcbxb-LHyasQ1ZvxfJ)\n\n### YouTube Overview of STREAMLINE\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/xVc4JEbnIs8/0.jpg)](https://www.youtube.com/watch?v=xVc4JEbnIs8)\n\n### Pipeline Design\nThe goal of STREAMLINE is to provide an easy and transparent framework\nto reliably learn predictive associations from tabular data with a particular focus on the needs of biomedical data applications. \nThe design of this pipeline is meant to not only pick a best performing algorithm/model for a given dataset,\nbut to leverage the different algorithm perspectives (i.e. biases, strengths,\nand weaknesses) to gain a broader understanding of the associations in that data.\n\nThe overall development of this pipeline focused on:\n   1. Automation and ease of use\n   2. Optimizing modeling performance\n   3. Capturing complex associations in data (e.g. feature interactions)\n   4. Enhancing interpretability of output throughout the analysis\n   5. Avoiding and detecting common sources of bias\n   6. Reproducibility (see STREAMLINE parameter settings)\n   7. Run mode flexibility (accomodates users with different levels of expertise)\n   8. More advanced users can easily add their own scikit-learn compatible modeling algorithms to STREAMLINE\n\nSee the [About (FAQs)](https://urbslab.github.io/STREAMLINE/about.html) to gain a deeper understanding of STREAMLINE with respect to it's overall design, what it includes, what it can be used for, and implementation highlights that differentiate it from other AutoML tools.\n\n### Current Limitations\n* At present, STREAMLINE is limited to supervised learning on tabular,\nbinary classification data. We are currently expanding STREAMLINE to multi-class\nand regression outcome data. \n\n* STREAMLINE also does not automate feature extraction from unstructured data (e.g. text, images, video, time-series data), or handle more advanced aspects of data cleaning or feature engineering that would likely require domain expertise for a given dataset. \n\n* As STREAMLINE is currently in its 'beta' release, we recommend users first check that they have downloaded the\nmost recent release of STREAMLINE before use. We are actively updating this software as feedback is received.\n\n### Publications and Citations\nThe most recent publication on STREAMLINE (release Beta 0.3.4) with benchmarking on simulated data and application to investigate obstructive sleep apena risk prediction as a clinical outcome is available as a preprint on arxiv [here](\nhttps://doi.org/10.48550/arXiv.2312.05461). \n\nThe first publication detailing the initial implementation of STREAMLINE (release Beta 0.2.4) and applying it to\nsimulated benchmark data can be found [here](https://link.springer.com/chapter/10.1007/978-981-19-8460-0_9), or as a preprint on arxiv, [here](https://arxiv.org/abs/2206.12002?fbclid=IwAR1toW5AtDJQcna0_9Sj73T9kJvuB-x-swnQETBGQ8lSwBB0z2N1TByEwlw).\n\nSee [citations](https://urbslab.github.io/STREAMLINE/citation.html) for more information on citing STREAMLINE, as well as publications applying STREAMLINE and publications on algorithms developed in our research group and incorporated into STREAMLINE.\n\n***\n# Installation and Use\nSTREAMLINE can be run using a variety of modes balancing ease of use and efficiency.\n* Google Colab Notebook: runs serially on Google Cloud (best for beginners)\n* Jupyter Notebook: runs serially/locally\n* Command Line: runs serially or locally\n   * Locally, serially\n   * Locally, cpu core in parallel\n   * CPU Computing Cluster (HPC), in parallel (best for efficiency)\n      * All phases can be run from a single command (with a job monitor/submitter running on the head node until completion)\n      * Each phase can be run separately in sequence\n\nSee the [documentation](https://urbslab.github.io/STREAMLINE/index.html) for requirements, installation, and use details for each.\n\nBasic installation instructions for use on Google Colab, and local runs are given below.\n\n### Google Colab\nThere is no local installation or additional steps required to run\nSTREAMLINE on Google Colab.\n\nJust have a Google Account and open this Colab link to run the demo (takes ~ 6-7 min):\n[https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing](https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing)\n\n\n### Local\nInstall STREAMLINE for local use with the following command line commands:\n\n```\ngit clone --single-branch https://github.com/UrbsLab/STREAMLINE\ncd STREAMLINE\npip install -r requirements.txt\n```\n\nNow your STREAMLINE package is ready to use from the `STREAMLINE` folder either\nfrom the included [Jupyter Notebook](https://github.com/UrbsLab/STREAMLINE/blob/main/STREAMLINE_Notebook.ipynb) file or the command line.\n\n***\n# Other Information\n## Demonstration Data\nIncluded with this pipeline is a folder named `DemoData` including [two small datasets](https://urbslab.github.io/STREAMLINE/data.html#demonstration-data) used as a demonstration of\npipeline efficacy. New users can easily test/run STREAMLINE in all run modes set up to run automatically on these datasets.\n\n## List of Run Parameters\nA complete list of STREAMLINE Parameters can be found [here](https://urbslab.github.io/STREAMLINE/parameters.html).\n\n***\n## Disclaimer\nWe make no claim that this is the best or only viable way to assemble an ML analysis pipeline for a given\nclassification problem, nor that the included ML modeling algorithms will yield the best performance possible.\nWe intend many expansions/improvements to this pipeline in the future. We welcome feedback, suggestions, and contributions for improvement.\n\n***\n# Contact\nWe welcome ideas, suggestions on improving the pipeline, [code-contributions](https://https://urbslab.github.io/STREAMLINE/contributing.html), and collaborations!\n\n* For general questions, or to discuss potential collaborations (applying, or extending STREAMLINE); contact Ryan Urbanowicz at ryan.urbanowicz@cshs.org.\n\n* For questions on the code-base, installing/running STREAMLINE, report bugs, or discuss other troubleshooting issues; contact Harsh Bandhey at harsh.bandhey@cshs.org.\n\n# Other STREAMLINE Tutorial Videos on YouTube\n### A Brief Introduction to Automated Machine Learning\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/IjX0phz3LLE/0.jpg)](https://www.youtube.com/watch?v=IjX0phz3LLE)\n\n### A Detailed Walkthrough\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/sAB8d1KnMDw/0.jpg)](https://www.youtube.com/watch?v=sAB8d1KnMDw)\n\n### Input Data\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/5HnangrEF5E/0.jpg)](https://www.youtube.com/watch?v=5HnangrEF5E)\n\n### Run Parameters\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/qMi9vhVag-4/0.jpg)](https://www.youtube.com/watch?v=qMi9vhVag-4)\n\n### Running in Google Colab Notebook\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/nknyJWhm7pg/0.jpg)](https://www.youtube.com/watch?v=nknyJWhm7pg)\n\n### Running in Jupyter Notebook\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/blat3gAfUaI/0.jpg)](https://www.youtube.com/watch?v=blat3gAfUaI)\n\n### Running From Command Line\n[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/-5yjGxnJ7eI/0.jpg)](https://www.youtube.com/watch?v=-5yjGxnJ7eI)\n\n***\n# Acknowledgements\nThe development of STREAMLINE benefited from feedback across multiple biomedical research collaborators at the University of Pennsylvania, Fox Chase Cancer Center, Cedars Sinai Medical Center, and the University of Kansas Medical Center.\n\nThe bulk of the coding was completed by Ryan Urbanowicz, Robert Zhang, and Harsh Bandhey. Special thanks to\nYuhan Cui, Pranshu Suri, Patryk Orzechowski, Trang Le, Sy Hwang, Richard Zhang, Wilson Zhang,\nand Pedro Ribeiro for their code contributions and feedback.  \n\nWe also thank the following collaborators for their feedback on application\nof the pipeline during development: Shannon Lynch, Rachael Stolzenberg-Solomon,\nUlysses Magalang, Allan Pack, Brendan Keenan, Danielle Mowery, Jason Moore, and Diego Mazzotti.\n\nFunding supporting this work comes from NIH grants: R01 AI173095, U01 AG066833, and P01 HL160471.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Furbslab%2Fstreamline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Furbslab%2Fstreamline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Furbslab%2Fstreamline/lists"}