Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kjenike/panagram

Last synced: 28 days ago
JSON representation

Host: GitHub
URL: https://github.com/kjenike/panagram
Owner: kjenike
Created: 2022-09-01T16:32:40.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-05-21T19:29:20.000Z (7 months ago)
Last Synced: 2024-05-21T20:42:44.470Z (7 months ago)
Language: Python
Size: 1.14 MB
Stars: 53
Watchers: 8
Forks: 3
Open Issues: 9
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-pangenomes - Panagram - mer conservation (A list of software capable of analyzing mainly **eukaryotic** genomes for pangenomics. A new section for microbial genomes has also been added, these tools may not scale to large genomes.)

README

# Panagram: Interactive, alignment-free pan-genome browser

#### Katie Jenike, Sam Kovaka, Shujun Ou, Stephen Hwang, Srividya Ramakrishnan, Ben Langmead, Zach Lippman, Michael Schatz

[An alignment-free pan-genome viewer](https://www.dropbox.com/s/g7snjgr8bs6c2uj/2023.01.17.Panagram.pdf)

Please note: installation instructions and pre-processing scripts are a work in progress.

# Installation

```
git clone --recursive https://github.com/kjenike/panagram.git
cd panagram
pip install .
```

The `--recursive` option is required to install the KMC dependency. If you forget to include it, you can update the repository with the command `git submodule update --init`.

Installation may fail if pip is not up-to-date or if setuptools is not up-to-date. In order to update pip and setuptools run:
```
pip install --upgrade pip
pip install --upgrade setuptools
```

## Dependencies

Requires python version >=3.7, pip, samtools, and tabix. All other dependencies should be automatically installed via pip.

Panagram relies on [KMC](https://github.com/refresh-bio/KMC) to build its kmer index. This should be installed automatically, however it is possible that the KMC installation will fail but panagram will successfully install. In this case `panagram view` can be run, but `panagram index` will return an error. You may be able to debug the KMC installation by running `make -C KMC py_kmc_api` and attempting to fix any errors, then re-run `pip install -v .` after the errors are fixed.

# Running
Panagram runs in two steps, the pre-processing step (index command) and the viewing (view command).

# Preprocessing
Usage:
Anchor KMC bitvectors to reference FASTA files to create pan-kmer bitmap
```
usage: panagram index [-h]
```
See example config.toml file for more details on the layout. Must include paths to all of the fasta files and optionally any annotations in gff format.

Panagram may fail to index datasets with more than 32 genomes. This is **not** a fundamental limitation, and we are working on fixing it.

Currently genome IDs should only contain alphanumeric characters and underscores due to KMC requirements.

# View

Usage:
Display panagram viewer
```
usage: panagram view [-h] [genome] [chrom] [start] [end]
index_dir Panagram index directory
genome Initial anchor genome (optional)
chrom Initial chromosome (optional)
start Initial start coordinate (optional)
end Initial end coordinate (optional)
--ndebug Run server in production mode (important for a public-
facing server)
--port str Server port (default: 8050)
--host str Server address (default: 127.0.0.1)
--url_base str A local URL prefix to use app-wide (passed to
Dash.dash(url_base_pathname=...)) (default: /)
--bookmarks str Bed file with bookmarked regions (default: None)

```

Runs a local Dash server. Browser can be viewed at http://127.0.0.1:8050/ by default.

# Bitdump

Usage:
```
usage: panagram bitdump [-h] [-v bool] index_dir coords step
Query pan-kmer bitmap generated by "panagram index"/

index_dir Panagram index directory
coords Coordinates to query in chr:start-end format
step Spacing between output kmers (optimized for multiples
of 100) (default: 1)
-v bool, --verbose bool
Output the full bitmap (default: False)
```

# Example run

First download the example_data.zip bacterial data from:
[http://data.schatz-lab.org/panagram/](http://data.schatz-lab.org/panagram/)

[Direct link](https://bx.bio.jhu.edu/data/panagram/example_data.zip)

Unzip the archive and you will find 5 bacterial genomes plus their annotations
```
unzip example_data.zip
```

To run, first index the genomes:

```
cd example_data
panagram index conf.toml
```
It is super important that any gff files are in the correct format. GFF format is supported. We strongly suggest that if you run into any problems you first check the format annotation format. This can be done with command line tools like gff3validator or online here: https://genometools.org/cgi-bin/gff3validator.cgi

Then you can panagram to visualize (from the example_data directory):
```
panagram view .
```

From there, you can view the results in your webbrowser at [http://127.0.0.1:8050/](http://127.0.0.1:8050)

# Hosting and Proxies

Panagram uses [Dash](https://dash.plotly.com/introduction) to serve the plotly visualizations.
By default the dedicated webserver runs on localhost (127.0.0.1) on port 8050, but you can reverse proxy to a different port and path using a web engine
such as [nginx](https://www.nginx.com/)

For nginx, first reconfigure your nginx configuration file to add (note to be very careful
with the use of the slash ('/') character):

```
location /panagram {
proxy_pass http://127.0.0.1:8050;
}
```

The retart nginx with

```
systemctl stop nginx
systemctl start nginx
```

For a secure public-facing server, be sure to run with the option `panagram view --ndebug` to disable debug mode.

You may also wish to change the base URL path with the `--url_base` option, for example to something like `--url_base /panagram/`. The port and host name can be specified by the `--port` and `--host` options.

Finally you will need to run panagram using `panagram view `. You will probably want to run this in a loop
in case it needs to be restarted, such as:

```
until panagram view --ndebug .; do echo "restarting"; sleep 1; done
```

We will optimize this process in future releases.

# Example config.toml file

```
k = 12
prefix = "."
processes = 5

lowres_step = 100
chr_bin_kbp = 200

gff_anno_types = ["exon", "CDS"]

[kmc]
memory = 10
processes = 5
threads = 4

[fasta]
ecoli = "FASTAS/ecoli_GCF_001612495.1_ASM161249v1_genomic.fna"
ecoli_k12 = "FASTAS/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.fna"
klebsiella = "FASTAS/klebsiella_GCF_000240185.1_ASM24018v2_genomic.fna"
salmonella = "FASTAS/salmonella_GCF_016117835.1_ASM1611783v1_genomic.fna"
shigella = "FASTAS/shigella_GCF_000006925.2_ASM692v2_genomic.fna"

[gff]
ecoli = "gffs/ecoli_GCF_001612495.1_ASM161249v1_genomic.gff"
ecoli_k12 = "gffs/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.gff"
klebsiella = "gffs/klebsiella_GCF_000240185.1_ASM24018v2_genomic.gff"
salmonella = "gffs/salmonella_GCF_016117835.1_ASM1611783v1_genomic.gff"
shigella = "gffs/shigella_GCF_000006925.2_ASM692v2_genomic.gff"
```

# Using Snakemake (dev branch)
The dev branch, while actively being developed, currently utilizes Snakemake. This is straightforward to use, you just need a tsv file with a list of samples and corresponding fasta files.

Example tsv file:
```
name fasta gff id anchor
ecoli FASTAS/ecoli_GCF_001612495.1_ASM161249v1_genomic.fna ANNO/ecoli_GCF_001612495.1_ASM161249v1_genomic.gff 0 True
ecoli_k12 FASTAS/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.fna ANNO/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.gff 1 True
klebsiella FASTAS/klebsiella_GCF_000240185.1_ASM24018v2_genomic.fna ANNO/klebsiella_GCF_000240185.1_ASM24018v2_genomic.gff True
salmonella FASTAS/salmonella_GCF_016117835.1_ASM1611783v1_genomic.fna ANNO/salmonella_GCF_016117835.1_ASM1611783v1_genomic.gfTrue
shigella FASTAS/shigella_GCF_000006925.2_ASM692v2_genomic.fna ANNO/shigella_GCF_000006925.2_ASM692v2_genomic.gff 4 True
```

# Known issues
- Right now, there is a bug (issue #7) when indexing very large genomes with very large chromosomes. We are activley working to fix this.
- Indexing sometimes fails when working with more than 32 genomes
- Mash dendogram leaf placement is not always perfect
- Installing on a mac can be tricky. Will need to include a more detailed list of dependancies

# Ideas for improvement
- Add a row for gene coverage (rather than just gene density) for the third tab.
- Update the step size in the control panel.
- Add the actual sequence.

## More information coming soon!