https://github.com/maxibor/corecomb
Toolkit to deal with core-genome alignments recombination detection
https://github.com/maxibor/corecomb
Last synced: 2 months ago
JSON representation
Toolkit to deal with core-genome alignments recombination detection
- Host: GitHub
- URL: https://github.com/maxibor/corecomb
- Owner: maxibor
- License: gpl-3.0
- Created: 2024-01-31T13:44:43.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2025-01-07T09:56:12.000Z (over 1 year ago)
- Last Synced: 2026-02-28T20:29:12.761Z (3 months ago)
- Language: Jupyter Notebook
- Size: 2.26 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
**Corecomb**: create a XMFA file from Panaroo core gene alignments to detect recombination in core-genome using ClonalFrameML.
## Installation
```bash
pip install corecomb
```
## Quick start
If you are in Panaroo output directory, just run:
```
corecomb
```
## Get help
```bash
$ corecomb --help
Usage: corecomb [OPTIONS]
Create XMFA file from ClonalFrameML input from Panaroo core-genome gene alignments
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --gene_al_dir TEXT Path to directory containing core-genome gene alignments [default: core_gene_alignments] │
│ --pan_fa TEXT Path to Panaroo pan_genome_reference.fa [default: pan_genome_reference.fa] │
│ --extension TEXT File extension of core-genome gene alignments [default: fas] │
│ --outfile TEXT Path to output XMFA file [default: corecomb.xmfa] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
## Why
In theory, using the indivudal core-gene multiple sequence alignments from the `core_gene_alignments` directory of Panaroo, one could just run a `sed` command to concatenate these in a [XMFA file](https://darlinglab.org/mauve/user-guide/files.html).
```bash
sed -e '$s/$/\n=/' -s ../tests/data/aligned_gene_sequences_raw/*.fas > core_gene_alignment.xmfa
```
However, this approach suffers from 3 different issues:
- Sequence names need to be cleaned
- Ambiguous non `N` IUPAC characters need to be taken care of (CFML only accepts `A,T,G,C,N,-`)
- Genomes with missing genes will cause CFML to crash (core-genome defined at less 100%)
> CoRecomb addresses all 3 of these issues. Additionally, CoRecomb uses the order of the genes [defined in the `pan_genome_reference.fa`](https://github.com/gtonkinhill/panaroo/issues/146) to re-order the genes in the XMFA file (which will be kept by CFML output `core_gene_test_cfml.filtered.fasta`).
## Test it for yourself
```bash
poetry run pytest -vv
```
Test data can be found here [tests/data](tests/data)
```bash
corecomb \
--gene_al_dir tests/data/aligned_gene_sequences_raw \
--pan_fa tests/data/pan_genome_reference.fa \
--extension fas \
--outfile corecomb.xmfa
```
## Use the XMFA with ClonalFrameML
```bash
ClonalFrameML \
input_tree.nwk \
corecomb.xmfa \
cfml_output_basename \
-xmfa_file true \
-show_progress true \
-output_filtered true
```