https://github.com/robaina/protscout
Filter protein sequences by predicted protein properties
https://github.com/robaina/protscout
Last synced: 7 months ago
JSON representation
Filter protein sequences by predicted protein properties
- Host: GitHub
- URL: https://github.com/robaina/protscout
- Owner: Robaina
- License: gpl-3.0
- Created: 2025-01-30T18:01:35.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-10-19T16:49:35.000Z (7 months ago)
- Last Synced: 2025-10-23T19:57:17.668Z (7 months ago)
- Language: Python
- Homepage:
- Size: 7.59 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## π― About
ProtScout is a Python package that enables ranking of protein sequences based on multiple properties predicted by state-of-the-art AI models. It provides a unified interface to assess and compare proteins using various characteristics such as stability, solubility, catalytic efficiency, and thermal properties.
## β¨ Features
- 𧬠Comprehensive protein property analysis (structure, embeddings, catalytic activity, kinetic parameters, thermal stability, melting temperature, environmental tolerances, solubility, classical properties)
- π³ Containerized execution of prediction tools with Docker
- π Modular, parallel workflow with configurable steps and automatic resume
- π Automatic retry and resume support for robust execution
- β
Validation and dry-run modes to preview workflow
- π§ Fully configurable via YAML files and environment variable overrides
- π Detailed logging and resource monitoring
## π Installation
### Prerequisites
- Python 3.8 or higher
- Docker (for running containerized prediction tools)
- NVIDIA GPU with CUDA support (recommended)
- Conda or Poetry for environment management
### Install with Poetry (Recommended)
```bash
# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout
# Install Poetry if you haven't already
pip install poetry
# Install package and dependencies
poetry install
# Activate the virtual environment
poetry shell
```
### Install with pip
```bash
# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout
# Install in development mode
pip install -e .
```
## π Quick Start
Generate a configuration file:
```bash
protscout init -o my_config.yaml
```
Edit the configuration file with your paths and settings:
```yaml
condition: ultra
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: false
```
Validate your setup:
```bash
protscout validate -c my_config.yaml
```
Run the workflow:
```bash
protscout run -c my_config.yaml
```
## π Usage Examples
### Basic Workflow
```bash
# Run complete workflow
protscout run -c config.yaml
# Run specific steps only
protscout run -c config.yaml -s clean_sequences -s esmfold
# Override condition from command line
protscout run -c config.yaml --condition ultra
```
### Advanced Features
```bash
# Resume from last successful step after failure
protscout run -c config.yaml --resume
# Dry run to see what would be executed
protscout run -c config.yaml --dry-run
# Monitor logs in real-time
protscout logs logs/protscout_run_20240112_143022.log -f
```
### Parallel Execution
The workflow automatically runs compatible steps in parallel:
- ESMFold and ESM-2 run simultaneously
- All prediction tools (CatPred, Catapro, Temberture, Temstapro, GeoPoc, GATSol) run in parallel
- Result processing steps are parallelized
## π§ Configuration
ProtScout uses YAML configuration files for workflow management. Key configuration sections:
```yaml
# Analysis condition
condition: ultra
# Core directories
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights
# Execution settings
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: false
# Container images (optional overrides)
containers:
esmfold:
image: ghcr.io/new-atlantis-labs/esmfold:latest
max_containers: 1
# ... other containers: esm2, catpred, catapro, temberture, temstapro, geopoc, gatsol
# GPU and shared memory settings
resources:
gpus: all
shm_size: 100g
# Workflow steps
steps:
- clean_sequences
- esmfold
- esm2
- remove_sequences_without_pdb
- prepare_catpred
- catpred
- catapro
- temberture
- temstapro
- geopoc
- gatsol
- classical_properties
- process_temberture
- process_temstapro
- process_geopoc
- process_gatsol
- process_catpred
- process_catapro
- consolidate_results
```
See `configs/example_workflow.yaml` for a complete example.
## π οΈ Workflow Steps
- `clean_sequences` - Clean and deduplicate input sequences
- `esmfold` - Predict protein structures using ESMFold
- `esm2` - Generate protein embeddings using ESM-2
- `remove_sequences_without_pdb` - Filter sequences without structures
- `prepare_catpred` - Prepare inputs for catalytic prediction
- `catpred` - Predict catalytic properties
- `catapro` - Predict kinetic parameters (KM, Kcat, catalytic efficiency)
- `temberture` - Predict temperature stability
- `temstapro` - Predict melting temperature
- `geopoc` - Predict environmental conditions (temp, pH, salt)
- `gatsol` - Predict solubility
- `classical_properties` - Calculate classical protein properties
- `process_*` - Process results from each tool
- `consolidate_results` - Create final output tables
## π Output Structure
```
/ # raw outputs (artifacts)
βββ structures/ # PDB files from ESMFold
βββ embeddings/ # ESM-2 embeddings
βββ clean_sequences/ # cleaned FASTA files
βββ catpred_data/ # prepared inputs for CatPred
βββ catpred/ # CatPred raw output
βββ catapro/ # Catapro kinetic predictions (KM, Kcat, efficiency)
βββ temberture/ # temperature stability predictions
βββ temstapro/ # melting temperature predictions
βββ geopoc/ # environmental predictions (temp, pH, salt)
βββ gatsol/ # solubility predictions
/ # processed results
βββ classical_properties_results/ # classical property outputs
βββ temberture_results/ # processed temperature results
βββ temstapro_results/ # processed melting temperature results
βββ geopoc_results/ # processed environmental results
βββ gatsol_results/ # processed solubility results
βββ catpred_results/ # processed CatPred results
βββ catapro_results/ # processed Catapro kinetic results
βββ consolidated_results/ # final consolidated tables
```
## π Resume Capability
ProtScout automatically saves workflow state and can resume from failures:
```bash
# If workflow fails at step 'gatsol'
protscout run -c config.yaml --resume
# Workflow will skip completed steps and continue from 'gatsol'
```
## π Logging
Comprehensive logging with multiple levels:
- Console output: INFO level (progress and important messages)
- Log file: DEBUG level (detailed execution information)
Logs are saved to: `{workdir}/logs/protscout_run_YYYYMMDD_HHMMSS.log`
## π Troubleshooting
### Docker Issues
```bash
# Check if Docker is running
docker info
# Ensure user has Docker permissions
sudo usermod -aG docker $USER
```
### GPU Issues
```bash
# Check GPU availability
nvidia-smi
# Verify CUDA installation
nvcc --version
```
### Memory Issues
- Reduce `max_containers` in configuration
- Decrease `toks_per_batch` for ESM-2
- Lower batch sizes for prediction tools
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## π License
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.
## π Citation
If you use ProtScout in your research, please cite:
```bibtex
@software{protscout2024,
author = {Robaina-EstΓ©vez, SemidΓ‘n},
title = {ProtScout: AI-powered protein sequence ranking},
year = {2025},
url = {https://github.com/Robaina/ProtScout}
}
```
## π Acknowledgments
ProtScout integrates several state-of-the-art protein prediction tools:
- ESMFold for structure prediction
- ESM-2 for sequence embeddings
- CatPred for catalytic activity prediction
- Catapro for kinetic parameters (KM, Kcat, catalytic efficiency)
- Temberture for thermal stability prediction
- Temstapro for melting temperature prediction
- GeoPoc for environmental condition prediction
- GATSol for solubility prediction