Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dubssieg/pancat

Pangenome graphs visualisation, distance computing, reconstruction of sequences and other utility functions
https://github.com/dubssieg/pancat

pangenome pangenome-graph pangenomics variation-graph variation-graphs

Last synced: 28 days ago
JSON representation

Pangenome graphs visualisation, distance computing, reconstruction of sequences and other utility functions

Awesome Lists containing this project

README

        

[![](https://img.shields.io/badge/python-3.10-blue.svg)]()
[![](https://img.shields.io/badge/python-3.11-blue.svg)]()
[![](https://img.shields.io/badge/python-3.12-blue.svg)]()
[![](https://img.shields.io/badge/documentation-unfinished-orange.svg)]()
[![https://tharos-ux.github.io/pangenome-notes/](https://img.shields.io/badge/docs-unfinished-orange.svg)]()

# PANCAT - PANgenome Comparison and Anlaysis Toolkit

> [!WARNING]\
> A paper is in preparation about this work. If you consider to use this tool, please contact the author for attribution.

Implementations of many functions for performing various actions on GFA-like graphs in a command-line tool, such as extracting or offseting a pangenome graph.
Is capable of comparing graphs topology between graphs that happen to contain the same set of sequences. Does pangenome graphs visualisation with interactive html files.
Uses the [gfagraphs library](https://pypi.org/project/gfagraphs/) to load and manipulate pangenome graphs.
Details about implementation can be [found here](https://hal.science/hal-04213245) (in french only, sorry).

![](https://media.discordapp.net/attachments/874430800802754623/1180182798968033280/graph_big.png)

> [!NOTE]\
> Want to contribute? Feel free to open a PR on an issue about a missing, buggy or incomplete feature!

## Installation

Requires **python $\geq$ 3.10**.

Installation can be made with the following command line, and updates may be run using `just` (requires [just](https://github.com/casey/just))

```bash
git clone https://github.com/Tharos-ux/pancat.git
cd pancat
pip install -r requirements.txt --upgrade
python -m pip install . --quiet
```

## Troubleshooting

> [!WARNING]\
> This tool is under heavy devlopment, and so it's [associated library](https://github.com/Tharos-ux/gfagraphs). I advise to update `pip install gfagraphs --upgrade` every now and then, when you update the tool. Any issue to this project is more than welcome, as I could not test all usecases! Feel free to [open one here](https://github.com/Tharos-ux/pancat/issues) if any problems occurs.

## Quick start : provided commands

This program is a collection of tools. Not every function or script is accessible through the front-end `pancat`, but this front-end showcase what the tools can do.
Other tools are in the `scripts` folder.

Are available through `pancat`:

- **offset** adds relative position information as a tag in GFA file
- **correct** (WIP, experimental) corrects the graph by adding missing information back into it.
- **grapher** creates interactive graph representation from a GFA file
- **multigrapher** creates interactive graph representation of the differnces between two pangenome graphs
- **stats** gathers basic stats from the input GFA
- **complete** assesses if the graph is a complete pangenome graph (all genomes fully embedded in the graph)
- **reconstruct** recreates the linear sequences from the graph
- **edit** computes a edit distance between variation graphs
- **compress** (WIP, experimental) compresses the graph by collapsing substitution bubbles, losselessly
- **unfold** (WIP, experimental) break cycles in the graph by adding nodes and edges in it

Were available before (and will be back soon):
- **isolate** extracts a subgraph from positions in the paths
- **neigborhood** extracts a subgraph from a set of nodes around a node
- **cycles** detect and (optionnally) linearizes all loops in graph

## Render interactive html view

With this command, you can create a html interactive view of your graph, with sequence in the nodes (S-lines) and nodes connected by edges (L-lines). If additional information is given (as such as W-lines or P-lines), supplementary edges will be drawn in order to show the path that the genomes follows in the graph.

```bash
pancat grapher [-h] [-b BOUNDARIES [BOUNDARIES ...]] file output

positional arguments:
file Path to a gfa-like file
output Output path for the html graph file.

options:
-h, --help show this help message and exit
-b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
and one for nodes in range 2001-inf bp).
```

When using this command, please only work with graphs with under 10k nodes. To do so, you may flatten the graph or extract subgraphs (using for instance **pancat neighborhood** or **pancat isolate**).

The `-b`/`--boundaries` option lets you choose size classes to differentiate. They will have a different color, and their number will be computed separately.

The `output` argument may be : a path to a folder (existing or not) or a path to a file (with .HTML extension or not).

## Compute stats on your graph

With this command, you can output basic stats on your graph.

```bash
pancat stats [-h] [-b BOUNDARIES [BOUNDARIES ...]] file

positional arguments:
file Path to a gfa-like file

options:
-h, --help show this help message and exit
-b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
and one for nodes in range 2001-inf bp).
```

This program displays stats in command-line (stdout). You may pipe it to a file if you want to use it on a cluster. (pancat stats graph.gfa > out.txt)

The `-b`/`--boundaries` option lets you choose size classes to differentiate. Their number will be computed separately.

## Extract sequences from the graph

With this command, you can reconstruct linear sequences from the graph.

```bash
pancat reconstruct [-h] -r REFERENCE [--start START] [--stop STOP] [-s] file out

positional arguments:
file Path to a gfa-like file
out Output path (without extension)

options:
-h, --help show this help message and exit
-r REFERENCE, --reference REFERENCE
Tells the reference sequence we seek start and stop into
--start START To specifiy a starting node on reference to create a subgraph
--stop STOP To specifiy a ending node on reference to create a subgraph
-s, --split Tells to split in different files
```

For this function, the `-r`/`--reference` option is needed only if you specify starting and ending points.

## Adding coordinate system

With this command, you ca add a JSON GFA-compatible string to each S-line of the graph (each node). This field will contain starting position, ending position and orientation, for each path in the graph.

```bash
pancat offset [-h] file out

positional arguments:
file Path to a gfa-like file
out Output path (with extension)

options:
-h, --help show this help message and exit
```

## Compute edition between graphs

In order to compare two graphs, they need to :
+ have at least some shared paths
+ the reconstruction of those shared paths must yield the same sequences

If those criteria are met, you may compare your graphs.

```bash
pancat edit [-h] -o OUTPUT_PATH [-p PATTERN] [-g] [-c CORES] [-s [SELECTION ...]] [-t] graph_A graph_B

positional arguments:
graph_A Path to a GFA-like file.
graph_B Path to a GFA-like file.

options:
-h, --help show this help message and exit
-o OUTPUT_PATH, --output_path OUTPUT_PATH
Path to a .json output for results.
-p PATTERN, --pattern PATTERN
Regexp to filter if present in path/walks names.
-g, --graph_level Asks to perform edition computation at graph level.
-c CORES, --cores CORES
Number of cores for computing edition
-s [SELECTION ...], --selection [SELECTION ...]
Names of the paths you want to compute edition on.
-t, --trace_memory Print to log file memory usage of data structures.
```

It also now supports regexp to easily match paths that are differing, as for instance in HPRC files where `pancat edit $CACTUS $PGGB --output_path $WD"hprc_21_edition.json" --graph_level --cores 16 --pattern "^(.+?)#" --trace_memory` can be used to compare individual chromosoms.