{"id":20644617,"url":"https://github.com/epistasislab/tpot2","last_synced_at":"2025-04-09T06:09:37.776Z","repository":{"id":156818044,"uuid":"618219085","full_name":"EpistasisLab/tpot2","owner":"EpistasisLab","description":"A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. ","archived":false,"fork":false,"pushed_at":"2024-09-17T21:58:58.000Z","size":5558,"stargazers_count":186,"open_issues_count":45,"forks_count":26,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-09-18T02:51:23.200Z","etag":null,"topics":["adsp","ag066833","aiml","alzheimer","alzheimers","automated-machine-learning","automation","automl","data-science","feature-engineering","gradient-boosting","hyperparameter-optimization","lm010098","machine-learning","model-selection","nia","parameter-tuning","python","random-forest","scikit-learn"],"latest_commit_sha":null,"homepage":"https://epistasislab.github.io/tpot2/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EpistasisLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"docs/support.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-24T01:56:05.000Z","updated_at":"2024-09-17T21:58:06.000Z","dependencies_parsed_at":"2023-12-05T02:24:49.708Z","dependency_job_id":"d0a6be5f-d608-4a0d-83ce-32fa544a601d","html_url":"https://github.com/EpistasisLab/tpot2","commit_stats":{"total_commits":272,"total_committers":7,"mean_commits":"38.857142857142854","dds":"0.26102941176470584","last_synced_commit":"908eeca1af6c23b99ca81edb34f8be5d457d4a90"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Ftpot2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Ftpot2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Ftpot2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Ftpot2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EpistasisLab","download_url":"https://codeload.github.com/EpistasisLab/tpot2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247987285,"owners_count":21028895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adsp","ag066833","aiml","alzheimer","alzheimers","automated-machine-learning","automation","automl","data-science","feature-engineering","gradient-boosting","hyperparameter-optimization","lm010098","machine-learning","model-selection","nia","parameter-tuning","python","random-forest","scikit-learn"],"created_at":"2024-11-16T16:17:01.287Z","updated_at":"2025-04-09T06:09:37.743Z","avatar_url":"https://github.com/EpistasisLab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TPOT\n\n\u003ccenter\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-logo.jpg\" width=300 /\u003e\n\u003c/center\u003e\n\n\u003cbr\u003e\n\n![Tests](https://github.com/EpistasisLab/tpot/actions/workflows/tests.yml/badge.svg)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/tpot?label=pypi%20downloads)](https://pypi.org/project/TPOT)\n[![Conda Downloads](https://img.shields.io/conda/dn/conda-forge/tpot?label=conda%20downloads)](https://anaconda.org/conda-forge/tpot)\n\nTPOT stands for Tree-based Pipeline Optimization Tool. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. Consider TPOT your Data Science Assistant.\n\n## Contributors\n\nTPOT recently went through a major refactoring. The package was rewritten from scratch to improve efficiency and performance, support new features, and fix numerous bugs. New features include genetic feature selection, a significantly expanded and more flexible method of defining search spaces, multi-objective optimization, a more modular framework allowing for easier customization of the evolutionary algorithm, and more. While in development, this new version was referred to as \"TPOT2\" but we have now merged what was once TPOT2 into the main TPOT package. You can learn more about this new version of TPOT in our GPTP paper titled \"TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning.\"\n\n    Ribeiro, P. et al. (2024). TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning. In: Winkler, S., Trujillo, L., Ofria, C., Hu, T. (eds) Genetic Programming Theory and Practice XX. Genetic and Evolutionary Computation. Springer, Singapore. https://doi.org/10.1007/978-981-99-8413-8_1\n\nThe current version of TPOT was developed at Cedars-Sinai by:  \n    - Pedro Henrique Ribeiro (Lead developer - https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/)  \n    - Anil Saini (anil.saini@cshs.org)  \n    - Jose Hernandez (jgh9094@gmail.com)  \n    - Jay Moran (jay.moran@cshs.org)  \n    - Nicholas Matsumoto (nicholas.matsumoto@cshs.org)  \n    - Hyunjun Choi (hyunjun.choi@cshs.org)  \n    - Miguel E. Hernandez (miguel.e.hernandez@cshs.org)  \n    - Jason Moore (moorejh28@gmail.com)  \n\nThe original version of TPOT was primarily developed at the University of Pennsylvania by:  \n    - Randal S. Olson (rso@randalolson.com)  \n    - Weixuan Fu (weixuanf@upenn.edu)  \n    - Daniel Angell (dpa34@drexel.edu)  \n    - Jason Moore (moorejh28@gmail.com)  \n    - and many more generous open-source contributors  \n\n## License\n\nPlease see the [repository license](https://github.com/EpistasisLab/tpot/blob/main/LICENSE) for the licensing and usage information for TPOT.\nGenerally, we have licensed TPOT to make it as widely usable as possible.\n\nTPOT is free software: you can redistribute it and/or modify\nit under the terms of the GNU Lesser General Public License as\npublished by the Free Software Foundation, either version 3 of\nthe License, or (at your option) any later version.\n\nTPOT is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\nGNU Lesser General Public License for more details.\n\nYou should have received a copy of the GNU Lesser General Public\nLicense along with TPOT. If not, see \u003chttp://www.gnu.org/licenses/\u003e.\n\n## Documentation\n\n[The documentation webpage can be found here.](https://epistasislab.github.io/tpot/)\n\nWe also recommend looking at the Tutorials folder for jupyter notebooks with examples and guides.\n\n## Installation\n\nTPOT requires a working installation of Python.\n\n### Creating a conda environment (optional)\n\nWe recommend using conda environments for installing TPOT, though it would work equally well if manually installed without it.\n\n[More information on making anaconda environments found here.](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)\n\n```\nconda create --name tpotenv python=3.10\nconda activate tpotenv\n```\n\n### Packages Used\n\npython version \u003c3.12\nnumpy\nscipy\nscikit-learn\nupdate_checker\ntqdm\nstopit\npandas\njoblib\nxgboost\nmatplotlib\ntraitlets\nlightgbm\noptuna\njupyter\nnetworkx\ndask\ndistributed\ndask-ml\ndask-jobqueue\nfunc_timeout\nconfigspace\n\nMany of the hyperparameter ranges used in our configspaces were adapted from either the original TPOT package or the AutoSklearn package. \n\n### Note for M1 Mac or other Arm-based CPU users\n\nYou need to install the lightgbm package directly from conda using the following command before installing TPOT. \n\nThis is to ensure that you get the version that is compatible with your system.\n\n```\nconda install --yes -c conda-forge 'lightgbm\u003e=3.3.3'\n```\n\n### Installing Extra Features with pip\n\nIf you want to utilize the additional features provided by TPOT along with `scikit-learn` extensions, you can install them using `pip`. The command to install TPOT with these extra features is as follows:\n\n```\npip install tpot[sklearnex]\n```\n\nPlease note that while these extensions can speed up scikit-learn packages, there are some important considerations:\n\nThese extensions may not be fully developed and tested on Arm-based CPUs, such as M1 Macs. You might encounter compatibility issues or reduced performance on such systems.\n\nWe recommend using Python 3.9 when installing these extra features, as it provides better compatibility and stability.\n\n\n### Developer/Latest Branch Installation\n\n\n```\npip install -e /path/to/tpotrepo\n```\n\nIf you downloaded with git pull, then the repository folder will be named TPOT. (Note: this folder is the one that includes setup.py inside of it and not the folder of the same name inside it).\nIf you downloaded as a zip, the folder may be called tpot-main. \n\n\n## Usage \n\nSee the Tutorials Folder for more instructions and examples.\n\n### Best Practices\n\n#### 1 \nTPOT uses dask for parallel processing. When Python is parallelized, each module is imported within each processes. Therefore it is important to protect all code within a `if __name__ == \"__main__\"` when running TPOT from a script. This is not required when running TPOT from a notebook.\n\nFor example:\n\n```\n#my_analysis.py\n\nimport tpot\nif __name__ == \"__main__\":\n    X, y = load_my_data()\n    est = tpot.TPOTClassifier()\n    est.fit(X,y)\n    #rest of analysis\n```\n\n#### 2\n\nWhen designing custom objective functions, avoid the use of global variables.\n\nDon't Do:\n```\nglobal_X = [[1,2],[4,5]]\nglobal_y = [0,1]\ndef foo(est):\n    return my_scorer(est, X=global_X, y=global_y)\n\n```\n\nInstead use a partial\n\n```\nfrom functools import partial\n\ndef foo_scorer(est, X, y):\n    return my_scorer(est, X, y)\n\nif __name__=='__main__':\n    X = [[1,2],[4,5]]\n    y = [0,1]\n    final_scorer = partial(foo_scorer, X=X, y=y)\n```\n\nSimilarly when using lambda functions.\n\nDont Do:\n\n```\ndef new_objective(est, a, b)\n    #definition\n\na = 100\nb = 20\nbad_function = lambda est :  new_objective(est=est, a=a, b=b)\n```\n\nDo:\n```\ndef new_objective(est, a, b)\n    #definition\n\na = 100\nb = 20\ngood_function = lambda est, a=a, b=b : new_objective(est=est, a=a, b=b)\n```\n\n### Tips\n\nTPOT will not check if your data is correctly formatted. It will assume that you have passed in operators that can handle the type of data that was passed in. For instance, if you pass in a pandas dataframe with categorical features and missing data, then you should also include in your configuration operators that can handle those feautures of the data. Alternatively, if you pass in `preprocessing = True`, TPOT will impute missing values, one hot encode categorical features, then standardize the data. (Note that this is currently fitted and transformed on the entire training set before splitting for CV. Later there will be an option to apply per fold, and have the parameters be learnable.)\n\n\nSetting `verbose` to 5 can be helpful during debugging as it will print out the error generated by failing pipelines. \n\n\n## Contributing to TPOT\n\nWe welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.\n\n## Citing TPOT\n\nIf you use TPOT in a scientific publication, please consider citing at least one of the following papers:\n\nTrang T. Le, Weixuan Fu and Jason H. Moore (2020). [Scaling tree-based automated machine learning to biomedical big data with a feature set selector](https://academic.oup.com/bioinformatics/article/36/1/250/5511404). *Bioinformatics*.36(1): 250-256.\n\nBibTeX entry:\n\n```bibtex\n@article{le2020scaling,\n  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},\n  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},\n  journal={Bioinformatics},\n  volume={36},\n  number={1},\n  pages={250--256},\n  year={2020},\n  publisher={Oxford University Press}\n}\n```\n\n\nRandal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). [Automating biomedical data science through tree-based pipeline optimization](http://link.springer.com/chapter/10.1007/978-3-319-31204-0_9). *Applications of Evolutionary Computation*, pages 123-137.\n\nBibTeX entry:\n\n```bibtex\n@inbook{Olson2016EvoBio,\n    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},\n    editor={Squillero, Giovanni and Burelli, Paolo},\n    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},\n    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},\n    year={2016},\n    publisher={Springer International Publishing},\n    pages={123--137},\n    isbn={978-3-319-31204-0},\n    doi={10.1007/978-3-319-31204-0_9},\n    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}\n}\n```\n\nRandal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). [Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science](http://dl.acm.org/citation.cfm?id=2908918). *Proceedings of GECCO 2016*, pages 485-492.\n\nBibTeX entry:\n\n```bibtex\n@inproceedings{OlsonGECCO2016,\n    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},\n    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},\n    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},\n    series = {GECCO '16},\n    year = {2016},\n    isbn = {978-1-4503-4206-3},\n    location = {Denver, Colorado, USA},\n    pages = {485--492},\n    numpages = {8},\n    url = {http://doi.acm.org/10.1145/2908812.2908918},\n    doi = {10.1145/2908812.2908918},\n    acmid = {2908918},\n    publisher = {ACM},\n    address = {New York, NY, USA},\n}\n```\n\n## Support for TPOT\n\nTPOT was developed in the [Artificial Intelligence Innovation (A2I) Lab](http://epistasis.org/) at Cedars-Sinai with funding from the [NIH](http://www.nih.gov/) under grants U01 AG066833 and R01 LM010098. We are incredibly grateful for the support of the NIH and the Cedars-Sinai during the development of this project.\n\nThe TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepistasislab%2Ftpot2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepistasislab%2Ftpot2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepistasislab%2Ftpot2/lists"}