{"id":30054541,"url":"https://github.com/sacdallago/biotrainer","last_synced_at":"2026-01-16T06:53:49.421Z","repository":{"id":62599790,"uuid":"477575405","full_name":"sacdallago/biotrainer","owner":"sacdallago","description":"Biological prediction models made simple.","archived":false,"fork":false,"pushed_at":"2025-07-10T09:56:37.000Z","size":9389,"stargazers_count":44,"open_issues_count":14,"forks_count":8,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-08-04T06:25:15.346Z","etag":null,"topics":["deep-learning","language-model","machine-learning","protein","proteins"],"latest_commit_sha":null,"homepage":"https://biocentral.cloud/app","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"afl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sacdallago.png","metadata":{"files":{"readme":"README.md","changelog":"Changelog.md","contributing":"Contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-04-04T06:26:22.000Z","updated_at":"2025-07-03T14:58:53.000Z","dependencies_parsed_at":"2024-02-28T10:32:03.711Z","dependency_job_id":"a59a4619-cbee-4922-8122-7499840bdfd0","html_url":"https://github.com/sacdallago/biotrainer","commit_stats":null,"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"purl":"pkg:github/sacdallago/biotrainer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sacdallago%2Fbiotrainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sacdallago%2Fbiotrainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sacdallago%2Fbiotrainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sacdallago%2Fbiotrainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sacdallago","download_url":"https://codeload.github.com/sacdallago/biotrainer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sacdallago%2Fbiotrainer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269323557,"owners_count":24398029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-07T02:00:09.698Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","language-model","machine-learning","protein","proteins"],"created_at":"2025-08-07T20:53:45.555Z","updated_at":"2026-01-16T06:53:49.405Z","avatar_url":"https://github.com/sacdallago.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Biotrainer\n\n[![License](https://img.shields.io/github/license/sacdallago/biotrainer)](https://github.com/sacdallago/biotrainer/blob/main/LICENSE)\n[![Documentation](https://img.shields.io/badge/docs-biocentral-blue)](https://biocentral.cloud/docs/biotrainer/config_file_options)\n[![GitHub release (latest by date)](https://img.shields.io/github/v/release/sacdallago/biotrainer)](https://github.com/sacdallago/biotrainer/releases)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"25%\" height=\"20%\" alt=\"biotrainer logo\" src=\"biotrainer_logo.svg\" /\u003e\n\u003cbr /\u003e\nBiological prediction models made simple. \n\u003c/p\u003e\n\n## Overview\n*Biotrainer* is an open-source framework that simplifies machine learning model development for protein analysis. \nIt provides:\n- **Easy-to-use** training and inference pipelines for protein feature prediction\n- **Standardized data formats** for various prediction tasks\n- **Built-in support** for protein language models and embeddings\n- **Flexible configuration** through simple YAML files\n\n## Quick Start\n\n### 1. Installation\n\nInstall using pip:\n```shell\npip install biotrainer\n```\n\nManual installation using [uv](https://github.com/astral-sh/uv):\n```shell\n# First, install uv if you haven't already:\npip install uv\n\n# Create and activate a virtual environment\nuv venv\nsource .venv/bin/activate  # On Unix/macOS\n# OR\n.venv\\Scripts\\activate  # On Windows\n\n# Basic installation\nuv pip install -e .\n\n# Installing with jupyter notebook support:\nuv pip install -e \".[jupyter]\"\n\n# Installing with onnxruntime support (for onnx embedders and inference):\nuv pip install -e \".[onnx-cpu]\"    # CPU version\nuv pip install -e \".[onnx-gpu]\"    # CUDA version\nuv pip install -e \".[onnx-mac]\"    # CoreML version (for Apple Silicon)\n\n# You can also combine extras:\nuv pip install -e \".[jupyter,onnx-cpu]\"\n\n# For Windows users with CUDA support:\n# Visit https://pytorch.org/get-started/locally/ and follow GPU-specific installation, e.g.:\npip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118\n```\n\n### 2. Basic Usage\n```shell\n# Training\nbiotrainer train --config examples/sequence_to_class/config.yml\n\n# Inference\npython3\n\u003e\u003e\u003e from biotrainer.inference import Inferencer\n\u003e\u003e\u003e inferencer, _ = Inferencer.create_from_out_file('output/out.yml')\n\u003e\u003e\u003e predictions = inferencer.from_embeddings(your_embeddings)\n```\n\n### 3. Quick Start Datasets\n- **Subcellular Localization Prediction**\n  - *Protocol*: `sequence_to_class`/`residues_to_class`\n  - [Citations and Download](https://github.com/Rostlab/pbc/tree/main/supervised/scl)\n- **Secondary Structure Prediction** \n  - *Protocol*: `residue_to_class`\n  - [Citations and Download](https://github.com/Rostlab/pbc/tree/main/supervised/secondary_structure)\n\n\n## Features\n\n### Supported Prediction Tasks\n- **Residue-level classification** (`residue_to_class`)\n- **Residue-level regression** (`residue_to_value`) *[BETA]*\n- **Sequence-level classification** (`sequence_to_class`)\n- **Sequence-level regression** (`sequence_to_value`)\n- **Residues-level classification** (`residues_to_class`, like sequence_to_class with per-residue embeddings)\n- **Residues-level regression** (`residues_to_value`, like sequence_to_value with per-residue embeddings)\n\n### Built-in Capabilities\n- Multiple embedding methods (ProtT5, ESM-2, ONNX, etc.)\n- Various neural network architectures\n- Cross-validation and model evaluation\n- Performance metrics and visualization\n- Sanity checks and automatic calculation of baselines (such as random, mean...)\n- Docker support for reproducible environments\n\n## Documentation\n\n### Tutorials\n- [First Steps Guide](docs/first_steps.md)\n- [Interactive Tutorials](examples/tutorials)\n- [Config Options Overview](docs/config_file_options_overview.md)\n- [Biocentral Web Interface](https://biocentral.cloud/app)\n\n### Detailed Guides\n- [Data Standards](docs/data_standardization.md)\n- [Configuration Options](docs/config_file_options.md)\n- [Troubleshooting](docs/troubleshooting.md)\n\n## Example Configuration\n```yaml\nprotocol: residue_to_class\ninput_file: input.fasta\nmodel_choice: CNN\noptimizer_choice: adam\nlearning_rate: 1e-3\nloss_choice: cross_entropy_loss\nuse_class_weights: True\nnum_epochs: 200\nbatch_size: 128\nembedder_name: Rostlab/prot_t5_xl_uniref50\n```\n\n## Docker Support\n```shell\n# Run using pre-built image\ndocker run --gpus all --rm \\\n    -v \"$(pwd)/examples/docker\":/mnt \\\n    -u $(id -u ${USER}):$(id -g ${USER}) \\\n    ghcr.io/sacdallago/biotrainer:latest /mnt/config.yml\n```\n\nMore information on running docker with gpus: \n[Nvidia container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)\n\n## Getting Help\n- Check our [Troubleshooting Guide](docs/troubleshooting.md)\n- [Create an issue](https://github.com/sacdallago/biotrainer/issues/new)\n- Visit [biocentral.cloud](https://biocentral.cloud/docs/biotrainer/config_file_options)\n\n## Citation\n```bibtex\n@inproceedings{\nsanchez2022standards,\ntitle={Standards, tooling and benchmarks to probe representation learning on proteins},\nauthor={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},\nbooktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},\nyear={2022},\nurl={https://openreview.net/forum?id=adODyN-eeJ8}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsacdallago%2Fbiotrainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsacdallago%2Fbiotrainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsacdallago%2Fbiotrainer/lists"}