{"id":30892238,"url":"https://github.com/babilonczyk/bio-ai-software-engineering-roadmap","last_synced_at":"2025-09-08T19:12:28.663Z","repository":{"id":306921325,"uuid":"1027684956","full_name":"babilonczyk/bio-ai-software-engineering-roadmap","owner":"babilonczyk","description":"Explore roadmap to becoming a Bio AI Software Engineer - combining machine learning, bioinformatics, and software engineering to build the future of biotechnology. Join the journey on GitHub! ✨","archived":false,"fork":false,"pushed_at":"2025-08-16T17:46:36.000Z","size":32,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-16T19:18:05.764Z","etag":null,"topics":["ai-in-biology","artificial-intelligence","bioai","bioinformatics","data-science","deep-learning","drug-discovery","fastapi","genomics","machine-learning","protein-design","python","roadmap"],"latest_commit_sha":null,"homepage":"https://www.bioaisoftware.engineer/roadmap","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/babilonczyk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-28T11:31:25.000Z","updated_at":"2025-08-16T17:46:39.000Z","dependencies_parsed_at":"2025-07-28T13:26:12.026Z","dependency_job_id":"28b645b7-b7c3-43ec-8a11-61d6428e1790","html_url":"https://github.com/babilonczyk/bio-ai-software-engineering-roadmap","commit_stats":null,"previous_names":["babilonczyk/bio-ai-software-engineering-roadmap"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/babilonczyk/bio-ai-software-engineering-roadmap","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babilonczyk%2Fbio-ai-software-engineering-roadmap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babilonczyk%2Fbio-ai-software-engineering-roadmap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babilonczyk%2Fbio-ai-software-engineering-roadmap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babilonczyk%2Fbio-ai-software-engineering-roadmap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/babilonczyk","download_url":"https://codeload.github.com/babilonczyk/bio-ai-software-engineering-roadmap/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babilonczyk%2Fbio-ai-software-engineering-roadmap/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274231389,"owners_count":25245585,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-08T02:00:09.813Z","response_time":121,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-in-biology","artificial-intelligence","bioai","bioinformatics","data-science","deep-learning","drug-discovery","fastapi","genomics","machine-learning","protein-design","python","roadmap"],"created_at":"2025-09-08T19:12:21.161Z","updated_at":"2025-09-08T19:12:28.637Z","avatar_url":"https://github.com/babilonczyk.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧬 Bio AI Software Engineer Roadmap\n\nWelcome to the **Bio AI Software Engineer Roadmap** - a practical and evolving learning path for developers at the intersection of software engineering, biology, and AI.\n\nWhether you're an AI engineer entering biotech, a bioinformatician diving deeper into ML, or a developer curious about life science tools - this roadmap gives you the real-world skills to build impactful software in biology.\n\nBuilt and maintained alongside the live roadmap at [**bioaisoftware.engineer**](https://bioaisoftware.engineer)\n\n---\n\n## 🧠 What is a Bio AI Software Engineer?\n\nA **Bio AI Software Engineer** builds intelligent tools, pipelines, and infrastructure for biological problems - like protein folding, variant prediction, drug discovery, or lab automation. This includes:\n\n- Writing production-ready Python and backend APIs\n- Applying ML/AI to biological sequences and images\n- Working with bioinformatics pipelines and HPC/cloud compute\n- Making research workflows reproducible, interpretable, and scalable\n\nThey speak both the language of code and biology - and often translate between worlds.\n\n---\n\n## 📚 Roadmap Structure\n\nThe roadmap is split into **7 progressive stages**, each with hands-on projects and verified learning resources:\n\n| Stage | Description |\n|-------|-------------|\n| **Stage 1** | Programming Foundations (Python, CLI, Git, clean code) |\n| **Stage 2** | Software Engineering for Data \u0026 APIs (FastAPI, SQL, testing) |\n| **Stage 3** | Data Literacy \u0026 ML (stats, sklearn, PyTorch, LLMs) |\n| **Stage 4** | Biology \u0026 Bioinformatics Foundations (DNA, proteins, pipelines) |\n| **Stage 5** | Bio-AI (Genomics, Proteomics, LLMs for Bio) |\n| **Stage 6** | Data Engineering \u0026 MLOps (Snakemake, DVC, CI/CD, cloud) |\n| **Stage 7** | Compliance, Reproducibility \u0026 Communication |\n\nEach skill has:\n\n- 🔗 Hand-picked courses or docs\n- 💻 A real-world project challenge\n\n---\n\n## 🚀 Getting Started\n\nBrowse the interactive roadmap:  \n👉 [bioaisoftware.engineer/roadmap](https://bioaisoftware.engineer/roadmap)\n\nStart with Stage 1 if you're new to backend development or Python.  \nJump into Stage 4+ if you already know biology but want to learn ML or engineering.\n\n---\n\n## 📌 Table of Contents\n\n- [Stage 1: Programming Foundations](#stage-1-programming-foundations)\n- [Stage 2: Software Engineering for Data \u0026 APIs](#stage-2-software-engineering-for-data--apis)\n- [Stage 3: Data Literacy \u0026 ML Foundations](#stage-3-data-literacy--ml-foundations)\n- [Stage 4: Biology \u0026 Bioinformatics Foundations](#stage-4-biology--bioinformatics-foundations)\n- [Stage 5: Bio-AI (Genomics, Proteomics, Cheminformatics)](#stage-5-bio-ai-genomics-proteomics-cheminformatics)\n- [Stage 6: Data Engineering, Pipelines \u0026 MLOps](#stage-6-data-engineering-pipelines--mlops)\n- [Stage 7: Compliance, Safety \u0026 Communication](#stage-7-compliance-safety--communication)\n\n\n## ✅ Stage 1: Programming Foundations\n**Goal:** Learn core programming, clean coding practices, Git, Linux, and basic scripting. \n\n| Skill                      | Topics                                                                 | Recommended Resources |\n|---------------------------|------------------------------------------------------------------------|------------------------|\n| Python Core               | Types, OOP, typing, venv, packaging                                    | [Python Docs](https://docs.python.org/3/) |\n| Dev Environment           | Black, Ruff, logging, .env, structured logging                         | [pre-commit](https://pre-commit.com/) |\n| Git \u0026 GitHub              | Branching, semantic commits, PRs, changelogs                           | [Git Handbook](https://guides.github.com/introduction/git-handbook/) |\n| Linux / CLI Basics        | Bash, grep/sed/awk, SSH, tmux                                          | [LinuxCommand.org](http://linuxcommand.org/) |\n\n\n## 📊 Stage 2: Software Engineering for Data \u0026 APIs\n**Goal:** Learn to process data efficiently and build robust APIs and services.\n\n| Skill                      | Topics                                                              | Recommended Resources |\n|---------------------------|---------------------------------------------------------------------|------------------------|\n| Data Analysis             | NumPy, Pandas, tidy data, visualization                              | [Kaggle Pandas](https://www.kaggle.com/learn/pandas) |\n| SQL \u0026 Data Access         | CTEs, indexing, joins, SQLAlchemy ORM                                | [SQLBolt](https://sqlbolt.com/) |\n| APIs \u0026 FastAPI            | FastAPI, Pydantic, OpenAPI, JWT auth                                 | [FastAPI Docs](https://fastapi.tiangolo.com/) |\n| Testing \u0026 Packaging       | pytest, tox, wheels, SemVer                                          | [pytest Docs](https://docs.pytest.org/) |\n\n\n## 🤖 Stage 3: Data Literacy \u0026 ML Foundations\n**Goal:** Learn statistical thinking, classic ML, deep learning, and LLM foundations.\n\n| Skill                      | Topics                                                        | Recommended Resources |\n|---------------------------|---------------------------------------------------------------|------------------------|\n| Statistics for ML         | Hypothesis testing, bootstrapping, effect size               | [Khan Stats](https://www.khanacademy.org/math/statistics-probability) |\n| ML (scikit-learn)         | Pipelines, metrics, CV, model selection                      | [Kaggle ML](https://www.kaggle.com/learn/intro-to-machine-learning) |\n| Deep Learning (PyTorch)   | CNNs, RNNs, transformers, autograd, schedulers               | [PyTorch](https://pytorch.org/tutorials/) |\n| LLMs \u0026 RAG                | Embeddings, retrieval, prompt engineering                    | [HuggingFace NLP](https://huggingface.co/learn/nlp-course) |\n\n\n## 🧬 Stage 4: Biology \u0026 Bioinformatics Foundations\n**Goal:** Understand the biological data and systems you're working with.\n\n| Skill                      | Topics                                                        | Recommended Resources |\n|---------------------------|---------------------------------------------------------------|------------------------|\n| Molecular Biology         | DNA/RNA, expression, mutations                                | [Khan Biology](https://www.khanacademy.org/science/biology) |\n| Bio Data Formats \u0026 Repos  | FASTA, FASTQ, BAM, PDB, Ensembl                               | [NCBI Developer](https://www.ncbi.nlm.nih.gov/home/develop/) |\n| Bioinformatics Tools      | BLAST, MAFFT, bcftools, VCF                                   | [Rosalind](https://rosalind.info/) |\n| Protein Structure         | PDB, motifs, UniProt, visualization                           | [PDB 101](https://pdb101.rcsb.org/) |\n\n\n## 🧪 Stage 5: Bio-AI (Genomics, Proteomics, Cheminformatics)\n**Goal:** Apply AI models to biological sequences, protein structures, and small molecules.\n\n| Skill                      | Topics                                                        | Recommended Resources |\n|---------------------------|---------------------------------------------------------------|------------------------|\n| AI for Genomics           | Variant effect prediction, embeddings                         | [Hugging Face Spaces](https://huggingface.co/spaces?search=genomics) |\n| Protein Language Models   | ProtTrans, ESM, similarity search                             | [Meta ESM](https://github.com/facebookresearch/esm) |\n| Structure Prediction      | AlphaFold, OpenFold, pLDDT                                   | [AlphaFold DB](https://alphafold.ebi.ac.uk/) |\n| Cheminformatics           | RDKit, SMILES, ADMET, QSAR                                   | [DeepChem](https://deepchem.io/) |\n| LLMs for Bio              | Tool use, agents, protocol copilot                            | [LangChain Docs](https://python.langchain.com/) |\n\n\n## ⚙️ Stage 6: Data Engineering, Pipelines \u0026 MLOps\n**Goal:** Build reproducible, scalable pipelines with versioning and deployment.\n\n| Skill                      | Topics                                                        | Recommended Resources |\n|---------------------------|---------------------------------------------------------------|------------------------|\n| Data Engineering          | Parquet, ETL, Airflow, Great Expectations                    | [DataTalks Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) |\n| Reproducible Pipelines    | Snakemake, Nextflow, containers                               | [Snakemake Docs](https://snakemake.readthedocs.io/) |\n| Experiment Tracking       | MLflow, W\u0026B, model registry                                  | [MLflow Docs](https://mlflow.org/docs/latest/index.html) |\n| Cloud \u0026 HPC               | AWS, GCP, SLURM, cost control                                | [SLURM Quick Start](https://slurm.schedmd.com/quickstart.html) |\n| Deployment \u0026 CI/CD        | Docker, GitHub Actions, autoscaling                          | [GitHub Actions Docs](https://docs.github.com/actions) |\n\n\n## 🧾 Stage 7: Compliance, Safety \u0026 Communication\n**Goal:** Make your work reproducible, ethical, and understandable.\n\n| Skill                      | Topics                                                        | Recommended Resources |\n|---------------------------|---------------------------------------------------------------|------------------------|\n| Data Governance           | PII, HIPAA, audit trails                                     | [NIST Privacy Framework](https://www.nist.gov/privacy-framework) |\n| Reproducible Science      | FAIR principles, DVC, DOIs                                   | [DVC Docs](https://dvc.org/doc) |\n| Communication             | Visuals, explainability, methods sections                    | [Nature Data Viz](https://www.nature.com/collections/gjcfhifjhc) |\n\n---\n\n\n## ✅ Technologies Covered\n\n- Python, Pandas, PyTorch, FastAPI\n- scikit-learn, transformers, LangChain\n- Docker, GitHub Actions, SQL, Airflow\n- Snakemake, Nextflow, Biopython, UniProt, AlphaFold\n- LLMs, RAG, FAISS, vector DBs\n- DVC, MLflow, RDKit, DeepChem\n\nAnd more - updated continuously.\n\n---\n\n## 🔬 Example Project Challenges\n\nEvery skill is paired with a small but powerful project:\n\n| Project | Description |\n|---------|-------------|\n| **DNA→Protein Translator** | Build a tool that converts DNA to amino acid chains using codon tables |\n| **Microscopy Image Classifier** | Train a CNN to triage cellular image quality |\n| **Sample Registry API** | Serve metadata with FastAPI, JWT auth, and OpenAPI docs |\n| **Variant Effect Scorer** | Use sequence models to rank genomic variants for lab validation |\n| **Reproducible RNA-seq Pipeline** | Build an RNA-seq workflow with Nextflow and containers |\n| **RAG Assistant for Protocols** | QA system over lab protocols with citation-backed answers |\n\nMore projects coming soon. All designed for clarity, real-world value, and resume use.\n\n---\n\n## 🌐 Tools \u0026 Links\n\n- [bioaisoftware.engineer](https://bioaisoftware.engineer) – Main roadmap, articles, visualizer\n- [biotechsoftware.engineer](https://biotechsoftware.engineer) – community-driven list of biotech software engineers for hire or collaboration for bio/AI/software roles\n- [Roadmap.sh](https://roadmap.sh/bio-ai-software-engineer) – Interactive roadmap format\n- [AlphaFold DB](https://alphafold.ebi.ac.uk/)\n- [UniProt](https://www.uniprot.org/)\n- [NCBI Developer Hub](https://www.ncbi.nlm.nih.gov/home/develop/)\n- [FAIR Principles](https://www.go-fair.org/fair-principles/)\n- [MLflow](https://mlflow.org/docs/latest/index.html)\n- [DeepLearning.ai Courses](https://www.deeplearning.ai/short-courses/)\n\n---\n\n## 🤝 Contributing\n\nThis roadmap is actively maintained.  \nPlease suggest improvements, add quality resources, or flag outdated links by opening an issue.\n\n---\n\n## 📄 License\n\nThis project is open and free to use, modify, and remix for non-commercial learning.  \nFor inquiries about collaboration or licensing for teaching/research, contact via the site.\n\n---\n\n_This roadmap was built by engineers, not marketers.  \nNo fluff. Just skills that matter in bio, ML, and software._\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbabilonczyk%2Fbio-ai-software-engineering-roadmap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbabilonczyk%2Fbio-ai-software-engineering-roadmap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbabilonczyk%2Fbio-ai-software-engineering-roadmap/lists"}