https://github.com/maxibor/corecomb

Toolkit to deal with core-genome alignments recombination detection
https://github.com/maxibor/corecomb

Last synced: 2 months ago
JSON representation

Toolkit to deal with core-genome alignments recombination detection

Host: GitHub
URL: https://github.com/maxibor/corecomb
Owner: maxibor
License: gpl-3.0
Created: 2024-01-31T13:44:43.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2025-01-07T09:56:12.000Z (over 1 year ago)
Last Synced: 2026-02-28T20:29:12.761Z (3 months ago)
Language: Jupyter Notebook
Size: 2.26 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

**Corecomb**: create a XMFA file from Panaroo core gene alignments to detect recombination in core-genome using ClonalFrameML.

## Installation

```bash
pip install corecomb
```

## Quick start

If you are in Panaroo output directory, just run:

```
corecomb
```

## Get help

```bash
$ corecomb --help

Usage: corecomb [OPTIONS]

Create XMFA file from ClonalFrameML input from Panaroo core-genome gene alignments

╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --gene_al_dir TEXT Path to directory containing core-genome gene alignments [default: core_gene_alignments] │
│ --pan_fa TEXT Path to Panaroo pan_genome_reference.fa [default: pan_genome_reference.fa] │
│ --extension TEXT File extension of core-genome gene alignments [default: fas] │
│ --outfile TEXT Path to output XMFA file [default: corecomb.xmfa] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

## Why

In theory, using the indivudal core-gene multiple sequence alignments from the `core_gene_alignments` directory of Panaroo, one could just run a `sed` command to concatenate these in a [XMFA file](https://darlinglab.org/mauve/user-guide/files.html).

```bash
sed -e '$s/$/\n=/' -s ../tests/data/aligned_gene_sequences_raw/*.fas > core_gene_alignment.xmfa
```

However, this approach suffers from 3 different issues:

- Sequence names need to be cleaned
- Ambiguous non `N` IUPAC characters need to be taken care of (CFML only accepts `A,T,G,C,N,-`)
- Genomes with missing genes will cause CFML to crash (core-genome defined at less 100%)

> CoRecomb addresses all 3 of these issues. Additionally, CoRecomb uses the order of the genes [defined in the `pan_genome_reference.fa`](https://github.com/gtonkinhill/panaroo/issues/146) to re-order the genes in the XMFA file (which will be kept by CFML output `core_gene_test_cfml.filtered.fasta`).

## Test it for yourself

```bash
poetry run pytest -vv
```

Test data can be found here [tests/data](tests/data)

```bash
corecomb \
--gene_al_dir tests/data/aligned_gene_sequences_raw \
--pan_fa tests/data/pan_genome_reference.fa \
--extension fas \
--outfile corecomb.xmfa
```

## Use the XMFA with ClonalFrameML

```bash
ClonalFrameML \
input_tree.nwk \
corecomb.xmfa \
cfml_output_basename \
-xmfa_file true \
-show_progress true \
-output_filtered true
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/maxibor/corecomb

Awesome Lists containing this project

README