https://github.com/robaina/seqpicker
https://github.com/robaina/seqpicker
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/robaina/seqpicker
- Owner: Robaina
- License: other
- Created: 2024-10-31T15:31:20.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-01T13:34:12.000Z (over 1 year ago)
- Last Synced: 2025-07-19T14:33:51.054Z (11 months ago)
- Language: Python
- Size: 24.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🔍 seqpicker
A Python tool for selecting representative protein sequences from large datasets. It combines CD-HIT clustering with an advanced representative set selection algorithm: [Repset](https://onlinelibrary.wiley.com/doi/10.1002/prot.25461) to maintain sequence diversity while reducing redundancy.
## ✨ Features
- Reduce protein sequence redundancy using CD-HIT
- Select representative sequences using submodular optimization
- Maintain sequence diversity while minimizing dataset size
- Easy-to-use command line interface
- Flexible Python API for integration into bioinformatics pipelines
## ⚙️ Installation
```bash
conda create -f environment.yml
conda activate seqpicker
(seqpicker) poetry build
(seqpicker) pip install dist/seqpicker-0.1.0-py3-none-any.whl
```
## 🚀 Usage
### 💻 Command Line
```bash
# Basic usage
seqpick input.fasta -o output.fasta --maxsize 1000
# Use only CD-HIT (faster but less sophisticated)
seqpick input.fasta --cdhit-only --similarity 0.9
# Use only RepSet selection (slower but more accurate)
seqpick input.fasta --repset-only --maxsize 500
# Fine-tune the selection process
seqpick input.fasta \
--maxsize 1000 \
--mixture-weight 0.7 \
--cdhit-args "-c 0.9 -n 5"
```
### 🐍 Python API
```python
from seqpicker import reduce_database_redundancy
# Basic usage
reduce_database_redundancy(
input_fasta="input.fasta",
output_fasta="output.fasta",
maxsize=1000
)
# Advanced usage with more control
reduce_database_redundancy(
input_fasta="input.fasta",
output_fasta="output.fasta",
cdhit=True,
maxsize=1000,
cdhit_args="-c 0.9 -n 5",
mixture_weight=0.7
)
```
## 🧠 How It Works
seqpicker uses a two-step approach to select representative sequences:
1. **Initial Redundancy Reduction** (optional)
- Uses CD-HIT to quickly remove highly similar sequences
- Configurable similarity threshold and parameters
2. **Representative Selection**
- Implements RepSet, a submodular optimization algorithm to select representative sequences
- Balances sequence diversity and coverage
- Uses sequence similarity and redundancy metrics
- Configurable mixture weight between objectives
## 📦 Dependencies
- [CD-HIT](http://weizhongli-lab.org/cd-hit/)
- [RepSet](https://onlinelibrary.wiley.com/doi/10.1002/prot.25461)
- [Mafft](https://mafft.cbrc.jp/alignment/software/)
- [HMMER (esl-alipid)](http://hmmer.org/)
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## 📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
## ✏️ Citation
If you use seqpicker in your research, please cite:
```bibtex
@software{seqpicker2024,
author = {Semidán Robaina Estévez},
title = {seqpicker: A tool for selecting representative protein sequences},
year = {2024},
publisher = {GitHub},
url = {https://github.com/Robaina/seqpicker}
}
```