https://github.com/nschan/syri_gggenomes

Plot SyRi output via gggenomes
https://github.com/nschan/syri_gggenomes
genomics-visualization gggenomes ggplot2 syri
Last synced: 9 days ago
JSON representation
Plot SyRi output via gggenomes
Host: GitHub
URL: https://github.com/nschan/syri_gggenomes
Owner: nschan
License: mit
Created: 2024-09-12T14:09:18.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-04-04T09:28:46.000Z (about 2 months ago)
Last Synced: 2025-04-04T10:29:27.117Z (about 2 months ago)
Topics: genomics-visualization, gggenomes, ggplot2, syri
Language: R
Homepage:
Size: 5.69 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        Parse SyRi into R and plot with gggenomes

================

Niklas Schandry

# About

Here I provide a function to read in SyRi outputs from

[`nf-plotsv`](https://github.com/nschan/nf-plotsv) for plotting.

Running this requires ‘tidyverse’ (`dplyr`, `dtplyr`, `magrittr`, and

`vroom`) and the output is designed to be compatible with

[`gggenomes`](https://github.com/thackl/gggenomes) for plotting.

This repo also comes with a snapshot that can be used with

`renv::restore()`.

The calculation of polygons to draw curves between sequences is directly

lifted from the amazing

[`GENESPACE`](https://github.com/jtlovell/GENESPACE) package, but

`GENESPACE` is not a dependency.

The files included in `data/` for demonstration are the

[`plotsr`](https://github.com/schneebergerlab/plotsr/) example files.

# Input

`parse_syri()` was intended to work with the outputs from

[`nf-plotsv`](https://github.com/nschan/nf-plotsv). Therefore, the

script expects the SyRi output to be named

`genomeA_on_genomeB.syri.out`, and will split based on this. There is

*no* flexibility here.

# Function reference

`parse_syri()` has a number of arguments. Most of those are outlined

below with [examples](#Options):

    files:                    a list of files. These files are expected to: end with `.syri.out` 

                              and follow the naming scheme A_on_B.syri.out

    order:                    a dataframe with a column bin_id , containing the order of genomes

    chroms:                   (optional) list of chromosomes to retain.

    spacing:                  spacing between chromosomes from the same genome (bin_id). 

                              This spacing works the same way as the spacing parameter of

                              gggenomes: "between sequences in bases (>1) or relative to 

                              longest bin (<1)",which is actually relative to 

                              (longest bin)/sqrt(number of seq_ids).

                              Default: 0.05

    resize_polygons:          (logical) should polygons of short links be resized?

                              Default: TRUE

    resize_polygons_size:     if polygons are resized, to what fraction of the total length?

                              Default: 0.003

    min_polygon_feat_size:    minimum length of links to be resized. Default: 5000

    no_polygons:              (logical) do not compute polygons. 

                              Default: FALSE, will compute polygons.

    verbose:                  (logical), if TRUE returns some extra information for debugging.

                              Default: FALSE

`parse_syri()` returns a list of data-frames:

    $seqs:                    contains sequenece information, compatible with gggenomes

    $links:                   contains links between sequences, compatible with gggenomes

    $polygons:                contains polygons that can be plotted via `geom_polygon()`

# Running

In this example, `genomeA` is `col` and `genomeB` is `ler.`

The output from SyRi can be parsed using `parse_syri()` (in

`functions/parse_syri.R`)

If not installed, I recommend to install

[`gggenomes`](https://github.com/thackl/gggenomes).

``` r

renv::install("tidyverse","thackl/gggenomes")

```

`parse_syri()` builds on `tidyverse` and uses some special pipes from

`magrittr`.

``` r

library(tidyverse)

library(gggenomes)

library(magrittr)

source("functions/parse_syri.R") # Contains syri_plot_fills

```

## Data in

Data is read using `parse_syri()`.

``` r

dat <- parse_syri("data/col_on_ler.syri.out",

                  order = data.frame(bin_id = c("col","ler"))

                  )

```

## Plotting

After parsing the data, it is ready for plotting.

## gggenomes links

The parsed data can be used with `gggenomes` geoms, such as `geom_seq`,

`geom_bin`, `geom_link`, etc.

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  geom_link(aes(fill = type),color = NA) +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler")

```

![](parse_files/figure-gfm/unnamed-chunk-4-1.png)

## With polygons

`gggenomes::geom_link()` currently draws simple rectangles. An

alternative is to draw sequence relationships using polygons. These

polygons are computed during parsing (unless `no_polygons` is set to

`TRUE`) and returned in a dataframe in the `$polys` slot of the list.

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  #geom_link() +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler")

```

![](parse_files/figure-gfm/unnamed-chunk-5-1.png)

## Sequence labels

Names of individual sequences can be added using

e.g. `gggenomes::geom_seq_label()`

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  geom_seq_label(nudge_y = 0.1, hjust = 0, size = 4) +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler")

```

![](parse_files/figure-gfm/unnamed-chunk-6-1.png)

In some cases, it might be preferred to change some labels, for example

to standardize them, or to only show some. This can be done by

manipulating the table in `$seqs`. The easiest is to add a new column

that contains new labels, modifying `seq_id` directly is probably a bad

idea as it can break the mapping between sequence names and links.

Below, a new `seqlab` column is created where only labels for `col` are

kept:

``` r

dat$seqs <- dat$seqs %>%

  mutate(seqlab = case_when(bin_id != "col" ~ "",

                            TRUE ~ seq_id)

         )

```

This can be used within `gggenomes::geom_seq_label()`:

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  geom_seq_label(aes(label = seqlab), nudge_y = 0.1, hjust = 0, size = 4) +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler")

```

![](parse_files/figure-gfm/unnamed-chunk-8-1.png)

## Options

### Selecting chromosomes

Sometimes, only a subset of chromosomes is relevant. `parse_syri()`

expects chromosome names to be identical across genomes. If that is the

case, chromosomes can be selected with the `chroms` parameter

``` r

dat <- parse_syri("data/col_on_ler.syri.out",

                  order = data.frame(bin_id = c("col","ler")),

                  chroms = c("Chr1","Chr3")

                  )

```

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  #geom_link() +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler Chromosomes 1 and 3")

```

![](parse_files/figure-gfm/unnamed-chunk-10-1.png)

### Spacing

Sometimes, the default spacing between chromosomes may not be optimal.

`parse_syri()` follows gggenomes in spacing rules. If spacing is \< 1,

it is relative to the longest bin / sqrt(number of sequences), if it is

\>= 1 it is base pairs. The default is 0.05 (as for gggenomes)

#### In basepairs

``` r

dat <- parse_syri("data/col_on_ler.syri.out",

                  order = data.frame(bin_id = c("col","ler")),

                  spacing = 5000000 # spacing in bp

                  )

```

Of course, if the spacing was changed, this also needs to be adjusted in

gggenomes:

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links,

                     spacing = 5000000) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  syri_plot_fills  +

  ggtitle("Synteny between Col - Ler with 5MB spacing between chromsomes")

```

![](parse_files/figure-gfm/unnamed-chunk-12-1.png)

#### Relative

4 times the standard spacing:

``` r

dat <- parse_syri("data/col_on_ler.syri.out",

                  order = data.frame(bin_id = c("col","ler")),

                  spacing = 0.2 # relative spacing

                  )

```

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links,

                     spacing = 0.2) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  syri_plot_fills  +

  ggtitle("Synteny between Col - Ler, spacing increased 4x")

```

![](parse_files/figure-gfm/unnamed-chunk-14-1.png)

### No resizing

By default, short syntenic regions larger than 5000 bp are resized to

make them visible. Since this does not reflect the original input, this

can be disabled:

``` r

dat <- parse_syri("data/col_on_ler.syri.out",

                  order = data.frame(bin_id = c("col","ler")),

                  resize_polygons = F)

```

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler without resizing")

```

![](parse_files/figure-gfm/unnamed-chunk-16-1.png)

### Minimum resize size

Only regions larger than `min_polygon_feat_size` are resized (default

5000), this can be modified to also include smaller regions

``` r

dat <- parse_syri("data/col_on_ler.syri.out",

                  order = data.frame(bin_id = c("col","ler")),

                  resize_polygons = T,

                  min_polygon_feat_size = 1000)

```

Naturally, this will create a busier plot.

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler, resizing regions larger than 999bp")

```

![](parse_files/figure-gfm/unnamed-chunk-18-1.png)

### Resize output size

Regions are resized to have a certain length relative to the chromosome,

controlled by `resize_polygons_size`, which defaults to `0.003` (0.3%)

of the chromosome length. Changing this parameter will make resized

regions larger or smaller.

``` r

dat <- parse_syri("data/col_on_ler.syri.out",

                  order = data.frame(bin_id = c("col","ler")),

                  resize_polygons = T,

                  resize_polygons_size = 0.01)

```

This will produce wider polygons for resized links.

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  syri_plot_fills  +

  ggtitle("Synteny between Col and Ler")

```

![](parse_files/figure-gfm/unnamed-chunk-20-1.png)

# Multiple genomes

Comparing two genomes is nice, but more might be better.

`parse_syri()` can handle multiple inputs in one go when those are

provided as a list:

``` r

file_list <- list.files("data", full.names = T)

syri_order <- data.frame(bin_id = c("col", "ler", "cvi", "eri"))

dat <- parse_syri(file_list, order = syri_order)

```

Making a plot from this works the same way of making a plot of only one

comparison. The order of sequences is set via the `order` argument to

`parse_syri()`

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_seq(linewidth = 1) + 

  geom_bin_label(size=7) +

  syri_plot_fills  +

  ggtitle("Synteny between Col - Ler - Cvi - Eri")

```

![](parse_files/figure-gfm/unnamed-chunk-22-1.png)

# Keeping chromosomes separate

By default, `parse_syri()` takes all chromosomes from the same genome

(`bin_id`) and puts them on one axis, adding space between them as

needed (See [spacing](#Spacing)).

Sometimes, it might be useful to have the chromosomes each on their own

coordinate system instead. This can be done by making use of the

[`chroms`](#Selecting-chromosomes) argument to read each chromosome

individually and then combining them. Below is an example for the

col-ler-cvi-eri data used above and included in `data/`.

``` r

file_list <- list.files("data", full.names = T)

syri_order <- data.frame(bin_id = c("col", "ler", "cvi", "eri"))

chromosomes <-  c("Chr1", "Chr2", "Chr3", "Chr4", "Chr5")

dat_tmp <- lapply(chromosomes,

                  \(chrom) parse_syri(file_list, order = syri_order, chroms = chrom))

# Bind sequences

dat$seqs <- lapply(1:length(chromosomes), \(l) pluck(dat_tmp, l, "seqs")) %>%

  bind_rows()

# Create y coordinates for sequences

seq_pos <- left_join(dat$seqs, syri_order %>%

                       mutate(y = rev(1:length(bin_id))),

                     by = join_by(bin_id))

# Bind links

dat$links <- lapply(1:length(chromosomes), \(l) pluck(dat_tmp, l, "links")) %>%

  bind_rows()

# Bind polygons

dat$polys <- lapply(1:length(chromosomes), \(l) pluck(dat_tmp, l, "polys")) %>%

  bind_rows()

# Add seq_id column to polygons, only keep polygons that connect the same chromosome

dat$polys <- dat$polys %>%

  mutate(

    Chr_grp1 = str_extract_all(link, "Chr[0-9]*", simplify = T)[, 1],

    Chr_grp2 = str_extract_all(link, "Chr[0-9]*", simplify = T)[, 2]

  ) %>%

  filter(Chr_grp1 == Chr_grp2) %>%

  mutate(seq_id = Chr_grp1)

```

Note that `geom_segment()` should be used to draw chromosomes, since

`geom_seq()` would again place the chromosomes onto a single axis.

``` r

gggenomes::gggenomes(seqs = dat$seqs,

                     links = dat$links) + 

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.6

  ) +

  geom_polygon(

    data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),

    aes(

      x = x,

      y = y,

      fill = type,

      group = link_grp

    ),

    alpha = 0.8

  ) +

  geom_segment(aes(x = 0, xend = length, y=y, yend=y), data = seq_pos) +

  geom_bin_label(size=7, 

                 # Avoid overly long extension of x to the left

                 expand_left = 1e-2,

                 nudge_left = 5e-3) +

  facet_wrap(~seq_id, ncol = 1,  scales = "free_x") +

  syri_plot_fills +

  theme(strip.background = element_rect(fill = "white")) +

  ggtitle("Synteny between Col - Ler - Cvi - Eri")

```

![](parse_files/figure-gfm/chrom_plot-1.png)

# Contributing

If you encounter any problems, please open an issue.

If you have suggestions for improvement, please open a pull request.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nschan/syri_gggenomes

Awesome Lists containing this project

README