An open API service indexing awesome lists of open source software.

https://github.com/robaina/protscout

Filter protein sequences by predicted protein properties
https://github.com/robaina/protscout

Last synced: 7 months ago
JSON representation

Filter protein sequences by predicted protein properties

Awesome Lists containing this project

README

          



## 🎯 About

ProtScout is a Python package that enables ranking of protein sequences based on multiple properties predicted by state-of-the-art AI models. It provides a unified interface to assess and compare proteins using various characteristics such as stability, solubility, catalytic efficiency, and thermal properties.

## ✨ Features

- 🧬 Comprehensive protein property analysis (structure, embeddings, catalytic activity, kinetic parameters, thermal stability, melting temperature, environmental tolerances, solubility, classical properties)
- 🐳 Containerized execution of prediction tools with Docker
- πŸš€ Modular, parallel workflow with configurable steps and automatic resume
- πŸ”„ Automatic retry and resume support for robust execution
- βœ… Validation and dry-run modes to preview workflow
- πŸ”§ Fully configurable via YAML files and environment variable overrides
- πŸ“ˆ Detailed logging and resource monitoring

## πŸš€ Installation

### Prerequisites

- Python 3.8 or higher
- Docker (for running containerized prediction tools)
- NVIDIA GPU with CUDA support (recommended)
- Conda or Poetry for environment management

### Install with Poetry (Recommended)

```bash
# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout

# Install Poetry if you haven't already
pip install poetry

# Install package and dependencies
poetry install

# Activate the virtual environment
poetry shell
```

### Install with pip

```bash
# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout

# Install in development mode
pip install -e .
```

## πŸ“‹ Quick Start

Generate a configuration file:

```bash
protscout init -o my_config.yaml
```

Edit the configuration file with your paths and settings:

```yaml
condition: ultra
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: false
```

Validate your setup:

```bash
protscout validate -c my_config.yaml
```

Run the workflow:

```bash
protscout run -c my_config.yaml
```

## πŸ“– Usage Examples

### Basic Workflow

```bash
# Run complete workflow
protscout run -c config.yaml

# Run specific steps only
protscout run -c config.yaml -s clean_sequences -s esmfold

# Override condition from command line
protscout run -c config.yaml --condition ultra
```

### Advanced Features

```bash
# Resume from last successful step after failure
protscout run -c config.yaml --resume

# Dry run to see what would be executed
protscout run -c config.yaml --dry-run

# Monitor logs in real-time
protscout logs logs/protscout_run_20240112_143022.log -f
```

### Parallel Execution

The workflow automatically runs compatible steps in parallel:

- ESMFold and ESM-2 run simultaneously
- All prediction tools (CatPred, Catapro, Temberture, Temstapro, GeoPoc, GATSol) run in parallel
- Result processing steps are parallelized

## πŸ”§ Configuration

ProtScout uses YAML configuration files for workflow management. Key configuration sections:

```yaml
# Analysis condition
condition: ultra

# Core directories
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights

# Execution settings
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: false

# Container images (optional overrides)
containers:
esmfold:
image: ghcr.io/new-atlantis-labs/esmfold:latest
max_containers: 1
# ... other containers: esm2, catpred, catapro, temberture, temstapro, geopoc, gatsol

# GPU and shared memory settings
resources:
gpus: all
shm_size: 100g

# Workflow steps
steps:
- clean_sequences
- esmfold
- esm2
- remove_sequences_without_pdb
- prepare_catpred
- catpred
- catapro
- temberture
- temstapro
- geopoc
- gatsol
- classical_properties
- process_temberture
- process_temstapro
- process_geopoc
- process_gatsol
- process_catpred
- process_catapro
- consolidate_results
```

See `configs/example_workflow.yaml` for a complete example.

## πŸ› οΈ Workflow Steps

- `clean_sequences` - Clean and deduplicate input sequences
- `esmfold` - Predict protein structures using ESMFold
- `esm2` - Generate protein embeddings using ESM-2
- `remove_sequences_without_pdb` - Filter sequences without structures
- `prepare_catpred` - Prepare inputs for catalytic prediction
- `catpred` - Predict catalytic properties
- `catapro` - Predict kinetic parameters (KM, Kcat, catalytic efficiency)
- `temberture` - Predict temperature stability
- `temstapro` - Predict melting temperature
- `geopoc` - Predict environmental conditions (temp, pH, salt)
- `gatsol` - Predict solubility
- `classical_properties` - Calculate classical protein properties
- `process_*` - Process results from each tool
- `consolidate_results` - Create final output tables

## πŸ“Š Output Structure

```
/ # raw outputs (artifacts)
β”œβ”€β”€ structures/ # PDB files from ESMFold
β”œβ”€β”€ embeddings/ # ESM-2 embeddings
β”œβ”€β”€ clean_sequences/ # cleaned FASTA files
β”œβ”€β”€ catpred_data/ # prepared inputs for CatPred
β”œβ”€β”€ catpred/ # CatPred raw output
β”œβ”€β”€ catapro/ # Catapro kinetic predictions (KM, Kcat, efficiency)
β”œβ”€β”€ temberture/ # temperature stability predictions
β”œβ”€β”€ temstapro/ # melting temperature predictions
β”œβ”€β”€ geopoc/ # environmental predictions (temp, pH, salt)
└── gatsol/ # solubility predictions

/ # processed results
β”œβ”€β”€ classical_properties_results/ # classical property outputs
β”œβ”€β”€ temberture_results/ # processed temperature results
β”œβ”€β”€ temstapro_results/ # processed melting temperature results
β”œβ”€β”€ geopoc_results/ # processed environmental results
β”œβ”€β”€ gatsol_results/ # processed solubility results
β”œβ”€β”€ catpred_results/ # processed CatPred results
β”œβ”€β”€ catapro_results/ # processed Catapro kinetic results
└── consolidated_results/ # final consolidated tables
```

## πŸ”„ Resume Capability

ProtScout automatically saves workflow state and can resume from failures:

```bash
# If workflow fails at step 'gatsol'
protscout run -c config.yaml --resume
# Workflow will skip completed steps and continue from 'gatsol'
```

## πŸ“ Logging

Comprehensive logging with multiple levels:

- Console output: INFO level (progress and important messages)
- Log file: DEBUG level (detailed execution information)

Logs are saved to: `{workdir}/logs/protscout_run_YYYYMMDD_HHMMSS.log`

## πŸ› Troubleshooting

### Docker Issues

```bash
# Check if Docker is running
docker info

# Ensure user has Docker permissions
sudo usermod -aG docker $USER
```

### GPU Issues

```bash
# Check GPU availability
nvidia-smi

# Verify CUDA installation
nvcc --version
```

### Memory Issues

- Reduce `max_containers` in configuration
- Decrease `toks_per_batch` for ESM-2
- Lower batch sizes for prediction tools

## 🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## πŸ“„ License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

## πŸ“š Citation

If you use ProtScout in your research, please cite:

```bibtex
@software{protscout2024,
author = {Robaina-EstΓ©vez, SemidΓ‘n},
title = {ProtScout: AI-powered protein sequence ranking},
year = {2025},
url = {https://github.com/Robaina/ProtScout}
}
```

## πŸ™ Acknowledgments

ProtScout integrates several state-of-the-art protein prediction tools:

- ESMFold for structure prediction
- ESM-2 for sequence embeddings
- CatPred for catalytic activity prediction
- Catapro for kinetic parameters (KM, Kcat, catalytic efficiency)
- Temberture for thermal stability prediction
- Temstapro for melting temperature prediction
- GeoPoc for environmental condition prediction
- GATSol for solubility prediction