https://github.com/ulelab/cluster_kmers
Get clusters of k-mers based solely on their sequence or in combination with enrichment in PEKA.
https://github.com/ulelab/cluster_kmers
Last synced: 5 months ago
JSON representation
Get clusters of k-mers based solely on their sequence or in combination with enrichment in PEKA.
- Host: GitHub
- URL: https://github.com/ulelab/cluster_kmers
- Owner: ulelab
- License: gpl-3.0
- Created: 2023-02-22T15:04:05.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-11-21T09:16:28.000Z (over 2 years ago)
- Last Synced: 2025-09-05T04:12:02.115Z (10 months ago)
- Language: Python
- Size: 1.29 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# cluster_kmers
[](https://zenodo.org/badge/latestdoi/605154452)
Get clusters of k-mers based solely on their sequence or in combination with enrichment in PEKA.
## Installation
```
conda create -n cluster_kmers
conda activate cluster_kmers
# Python must be >=3.8 or <=3.11 to support SciPy 1.9
conda install python=3.9 pip
pip install git+https://github.com/ulelab/cluster_kmers.git@master
```
## Usage
```
cluster_kmers [-h] -k KMERS [KMERS ...] -o RESULTS_FOLDER [-co {seq,seq_enrichment}] [-n N_CLUSTERS] [-peka PEKA_FILES [PEKA_FILES ...]] [-tl MIN_TOKEN_LENGTH] [-wl {True,False}] [-cons CONSENSUS_LENGTH]
```
Cluster k-mers based on sequence or on a combination of sequence and enrichment in CLIP data.
```
required arguments:
-k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
A list of k-mers encoded in RNA alphabet separated by spaces: AAGG GGAG GCCU.
-o RESULTS_FOLDER, --output_folder RESULTS_FOLDER
A path to an existing output folder, for example "~/results"
optional arguments:
-co {seq,seq_enrichment}, --cluster_on {seq,seq_enrichment}
Inputs to clustering. Valid options are: seq - Cluster only based on sequence similarity. seq_enrichment - Cluster based on sequence similarity and based on enrichment of motifs in CLIP data (this option requires arguments passed to -peka)
-n N_CLUSTERS, --n_clusters N_CLUSTERS
Number of clusters to split k-mers into. Valid options are "auto" or integer. Default is "auto".
-peka PEKA_FILES [PEKA_FILES ...], --peka_files PEKA_FILES [PEKA_FILES ...]
A list of peka output files with extensions *mer_distribution_{region_name}.tsv, separated by spaces.
-tl MIN_TOKEN_LENGTH, --min_token_length MIN_TOKEN_LENGTH
Minimal length of a substrings used for clustering. For k-mers with lengths greater than 5, setting this value to be greater than 1 can improve the results of clustering.
-wl {True,False}, --weblogos {True,False}
Whether to plot weblogos for motif groups, True by default.
-cons CONSENSUS_LENGTH, --consensus_length CONSENSUS_LENGTH
Length of consensus sequence to name k-mer groups. Automatically this length is determined as k-mer length - 1. Valid choices are "auto" or integer.
```