https://github.com/not-that-guy-again/bioaf
A self-hosted platform for orchestrating bioinformatics pipelines, managing experimental metadata, and running reproducible compute workloads.
https://github.com/not-that-guy-again/bioaf
bioinformatics computational-biology data-provenance gcp genomics kubernetes lims ngs pipeline-management platform-engineering reproducible-research research-data-management scientific-workflows terraform workflow-orchestration
Last synced: about 1 month ago
JSON representation
A self-hosted platform for orchestrating bioinformatics pipelines, managing experimental metadata, and running reproducible compute workloads.
- Host: GitHub
- URL: https://github.com/not-that-guy-again/bioaf
- Owner: not-that-guy-again
- License: apache-2.0
- Created: 2026-03-05T23:55:09.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-05-02T23:25:47.000Z (about 1 month ago)
- Last Synced: 2026-05-03T00:37:52.370Z (about 1 month ago)
- Topics: bioinformatics, computational-biology, data-provenance, gcp, genomics, kubernetes, lims, ngs, pipeline-management, platform-engineering, reproducible-research, research-data-management, scientific-workflows, terraform, workflow-orchestration
- Language: Python
- Homepage:
- Size: 6.81 MB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
bioAF
Computational Biology Automation Framework
A turnkey computational biology platform for small biotech companies (5-50 researchers), deployed on Google Cloud Platform. bioAF provides a web-based control plane for managing HPC clusters, notebook environments, pipeline engines, and data visualization tools -- all provisioned through UI-driven Terraform.
## Features
- **Experiment Tracking** - MINSEQE-compliant metadata, sample management, batch processing, project organization
- **Compute Orchestration** - Kubernetes (GKE) compute via the BioAF Adapter Layer, JupyterHub/RStudio notebooks, versioned compute environments, auto-scaling, Cloud Build image pipeline
- **Pipeline Engine** - Nextflow integration, custom pipelines, pipeline catalog, run monitoring, parameter management
- **Data Management** - File upload/download, dataset browser, GCS storage integration, GEO export, SuperSeries cross-experiment packaging
- **Results & Visualization** - QC dashboards, cellxgene single-cell viewer, plot archive, search
- **SSH Access** - One-click kubectl exec into running pipeline jobs and notebook sessions
- **Notifications** - Event-driven alerts via in-app, email (SMTP), and Slack (OAuth integration)
- **Cost Center** - GCP billing integration, budget alerts, component cost breakdown, projections
- **Backup & Recovery** - 4-tier GCS backups (pg_dump, GCS versioning, platform config, terraform state), restore with review period
- **Session Credentials** - Per-user RStudio credentials with PAM authentication, auto-generated usernames
- **Role-Based Access** - Permission-based RBAC with four built-in roles, custom role creation, and per-resource/action grants
- **Upgrade System** - GitHub-based version checking, managed upgrade flow with rollback
- **Audit Log** - Immutable audit trail with filtering, pagination, and human-readable descriptions
- **GitOps** - Version-controlled platform configuration with diff and rollback
## Architecture
### How it works
A computational biologist registers an experiment, links FASTQ files (uploaded or auto-ingested from a sequencer drop), selects a pipeline from the catalog (nf-core/scrnaseq, rnaseq, or custom), and launches a run. The **BioAF Adapter Layer** handles everything below that: staging inputs from GCS, submitting Kubernetes Jobs to GKE Autopilot, monitoring execution via Nextflow trace parsing, collecting outputs back to GCS, and transitioning the experiment through its status lifecycle (`registered` -> `library_prep` -> `sequencing` -> `fastq_uploaded` -> `processing` -> `pipeline_complete` -> [`reviewed` ->] `analysis` -> `complete`). Pipeline completion triggers event-driven notifications (in-app, email, Slack), and results are browsable through the plot archive, cellxgene viewer, and GEO export tools. Jupyter and RStudio sessions run as Kubernetes Pods with GCS-backed home directories and SSH access. RStudio sessions use per-user PAM authentication ([ADR-030](decisions/ADR-030-session-credentials-pam-auth.md)), and notebook container images are managed as versioned environments ([ADR-033](decisions/ADR-033-versioned-compute-environments.md)), built automatically via Cloud Build ([ADR-031](decisions/ADR-031-notebook-image-build-pipeline.md)).
The adapter layer ([ADR-020](decisions/ADR-020-bioaf-adapter-layer.md)) abstracts compute, storage, and notebook providers behind clean interfaces, so all application logic is decoupled from infrastructure specifics. Today that means GKE + GCS ([ADR-021](decisions/ADR-021-kubernetes-compute-backend.md), [ADR-022](decisions/ADR-022-gcs-storage-backend.md)).
Infrastructure is provisioned through UI-driven Terraform ([ADR-007](decisions/ADR-007-ui-driven-terraform.md)) -- researchers never touch HCL. All secrets live in Secret Manager ([ADR-008](decisions/ADR-008-secret-manager.md)), all actions are recorded in an immutable audit log ([ADR-009](decisions/ADR-009-immutable-audit-log.md)), and data portability is guaranteed ([ADR-012](decisions/ADR-012-data-portability.md)).
See all architecture decision records in [decisions/README.md](decisions/README.md).
## Quick Start
### Prerequisites
- Docker and Docker Compose
- Git
- openssl (for secret generation)
### Deploy on GCP (one command)
Run this on your local machine to provision a GCP VM and get started:
```bash
curl -fsSL https://raw.githubusercontent.com/not-that-guy-again/bioAF/main/install-gcp.sh | bash
```
The script sets up gcloud, creates a VM with Docker, and walks you through
the process. Once the VM is ready, SSH in and run:
```bash
git clone https://github.com/not-that-guy-again/bioAF.git
cd bioAF
./bioaf setup
```
### Deploy on an existing server
If you already have a Linux server with Docker installed:
```bash
git clone https://github.com/not-that-guy-again/bioAF.git
cd bioAF
./bioaf setup
```
The `setup` command handles everything: checks prerequisites, generates
secrets and TLS certs, pulls pre-built images, runs migrations, and prints
a one-time setup code. Open the URL it shows in your browser and enter the
code to create your admin account and configure the platform.
### Management Commands
| Command | Description |
| ------- | ----------- |
| `./bioaf setup` | First-run setup (pulls images, generates secrets, prints setup code) |
| `./bioaf start` | Start all services in dependency order |
| `./bioaf stop` | Stop all services |
| `./bioaf restart` | Restart all services |
| `./bioaf status` | Show service status |
| `./bioaf logs [service]` | Tail logs (all or one service) |
| `./bioaf build [service]` | Build container images locally (development only) |
| `./bioaf migrate` | Run database migrations |
| `./bioaf migrate-down ` | Downgrade database to a specific revision |
| `./bioaf seed ` | Run a seed/data script in the backend container |
| `./bioaf backup` | Create a database backup |
| `./bioaf update [version]` | Update to latest (or specific) version |
| `./bioaf reset-db` | Destroy and recreate the database (with confirmation) |
| `./bioaf shell [service]` | Open a shell in a container (default: backend) |
| `./bioaf dbshell` | Open a psql session to the database |
| `./bioaf register-outputs` | Register pipeline output files from GCS |
| `./bioaf help` | Show all commands |
See the full [Deployment Guide](docs/deployment-guide.md) for detailed instructions.
## Documentation
- [Quickstart](docs/README.md) - Documentation hub
- [Deployment Guide](docs/deployment-guide.md) - Full deployment walkthrough
- [Bench Scientist Guide](docs/user-guide-bench.md) - Experiments, samples, results
- [Computational Biologist Guide](docs/user-guide-compbio.md) - Pipelines, notebooks, environments
- [Admin Guide](docs/user-guide-admin.md) - User management, costs, backups, notifications
- [Life After bioAF](docs/life-after-bioaf.md) - Data portability after teardown
- [ADR Index](decisions/README.md) - Architecture Decision Records
- [SSH Access Guide](docs/guides/ssh-access.md) - Connecting to running workloads
- [GEO Export Guide](docs/guides/geo-export.md) - Exporting to NCBI GEO
- [Reference Data Guide](docs/guides/reference-data.md) - Managing reference genomes and annotations
- [Compute Stack Setup](docs/guides/compute-stack-setup.md) - Kubernetes configuration
## Development Setup
### Using Docker Compose (recommended)
```bash
# Start backend, frontend, and PostgreSQL
docker compose -f docker/docker-compose.dev.yml up
# Backend: http://localhost:8000
# Frontend: http://localhost:3000
# Postgres: localhost:5432
```
### Manual Setup
```bash
# Backend
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
uvicorn app.main:app --reload
# Frontend
cd frontend
npm install
npm run dev
# Database (requires PostgreSQL 16)
cd backend
alembic upgrade head
```
### Running Tests
```bash
# Backend tests (requires PostgreSQL)
docker compose -f docker/docker-compose.dev.yml up -d db
cd backend && python -m pytest tests/ -v
# Frontend tests
cd frontend && npm test
```
## Component Catalog
bioAF manages these infrastructure components through its UI:
| Component | Category | Compute Stack | Dependencies |
| --------- | -------- | ------------- | ----------- |
| GKE Cluster | Compute | Kubernetes | None |
| GCS Buckets | Storage | Kubernetes | GKE |
| JupyterHub | Notebooks | Kubernetes | Compute, Storage |
| RStudio Server | Notebooks | Kubernetes | Compute, Storage |
| Nextflow | Pipelines | Kubernetes | Compute |
| cellxgene | Visualization | Any | None |
| QC Dashboard | Visualization | Any | None |
## Project Structure
```text
bioAF/
backend/ FastAPI application
frontend/ Next.js 14 application
docker/ Dockerfiles, compose, and nginx config
terraform/ GCP infrastructure as code
helm/ Kubernetes deployment chart
decisions/ Architecture Decision Records
documentation/ Product and architecture specs
docs/ User-facing documentation
scripts/ Utility scripts (seed data, update agent)
tests/shell/ BATS tests for install.sh and bioaf scripts
bioaf Management script (entry point)
install.sh First-time installer (prereq checks + env generation)
install-gcp.sh One-command GCP provisioning script
```
## Contributing
See the ADRs in [decisions/](decisions/) for architectural context before making changes. All infrastructure changes must go through the UI-driven Terraform workflow (ADR-007). The audit log is immutable by design (ADR-009).