https://github.com/nschan/syri_gggenomes
Plot SyRi output via gggenomes
https://github.com/nschan/syri_gggenomes
genomics-visualization gggenomes ggplot2 syri
Last synced: 9 days ago
JSON representation
Plot SyRi output via gggenomes
- Host: GitHub
- URL: https://github.com/nschan/syri_gggenomes
- Owner: nschan
- License: mit
- Created: 2024-09-12T14:09:18.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-04-04T09:28:46.000Z (about 2 months ago)
- Last Synced: 2025-04-04T10:29:27.117Z (about 2 months ago)
- Topics: genomics-visualization, gggenomes, ggplot2, syri
- Language: R
- Homepage:
- Size: 5.69 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Parse SyRi into R and plot with gggenomes
================
Niklas Schandry# About
Here I provide a function to read in SyRi outputs from
[`nf-plotsv`](https://github.com/nschan/nf-plotsv) for plotting.Running this requires ‘tidyverse’ (`dplyr`, `dtplyr`, `magrittr`, and
`vroom`) and the output is designed to be compatible with
[`gggenomes`](https://github.com/thackl/gggenomes) for plotting.This repo also comes with a snapshot that can be used with
`renv::restore()`.The calculation of polygons to draw curves between sequences is directly
lifted from the amazing
[`GENESPACE`](https://github.com/jtlovell/GENESPACE) package, but
`GENESPACE` is not a dependency.The files included in `data/` for demonstration are the
[`plotsr`](https://github.com/schneebergerlab/plotsr/) example files.# Input
`parse_syri()` was intended to work with the outputs from
[`nf-plotsv`](https://github.com/nschan/nf-plotsv). Therefore, the
script expects the SyRi output to be named
`genomeA_on_genomeB.syri.out`, and will split based on this. There is
*no* flexibility here.# Function reference
`parse_syri()` has a number of arguments. Most of those are outlined
below with [examples](#Options):files: a list of files. These files are expected to: end with `.syri.out`
and follow the naming scheme A_on_B.syri.out
order: a dataframe with a column bin_id , containing the order of genomes
chroms: (optional) list of chromosomes to retain.
spacing: spacing between chromosomes from the same genome (bin_id).
This spacing works the same way as the spacing parameter of
gggenomes: "between sequences in bases (>1) or relative to
longest bin (<1)",which is actually relative to
(longest bin)/sqrt(number of seq_ids).
Default: 0.05
resize_polygons: (logical) should polygons of short links be resized?
Default: TRUE
resize_polygons_size: if polygons are resized, to what fraction of the total length?
Default: 0.003
min_polygon_feat_size: minimum length of links to be resized. Default: 5000
no_polygons: (logical) do not compute polygons.
Default: FALSE, will compute polygons.
verbose: (logical), if TRUE returns some extra information for debugging.
Default: FALSE`parse_syri()` returns a list of data-frames:
$seqs: contains sequenece information, compatible with gggenomes
$links: contains links between sequences, compatible with gggenomes
$polygons: contains polygons that can be plotted via `geom_polygon()`# Running
In this example, `genomeA` is `col` and `genomeB` is `ler.`
The output from SyRi can be parsed using `parse_syri()` (in
`functions/parse_syri.R`)If not installed, I recommend to install
[`gggenomes`](https://github.com/thackl/gggenomes).``` r
renv::install("tidyverse","thackl/gggenomes")
````parse_syri()` builds on `tidyverse` and uses some special pipes from
`magrittr`.``` r
library(tidyverse)
library(gggenomes)
library(magrittr)
source("functions/parse_syri.R") # Contains syri_plot_fills
```## Data in
Data is read using `parse_syri()`.
``` r
dat <- parse_syri("data/col_on_ler.syri.out",
order = data.frame(bin_id = c("col","ler"))
)
```## Plotting
After parsing the data, it is ready for plotting.
## gggenomes links
The parsed data can be used with `gggenomes` geoms, such as `geom_seq`,
`geom_bin`, `geom_link`, etc.``` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
geom_link(aes(fill = type),color = NA) +
syri_plot_fills +
ggtitle("Synteny between Col and Ler")
```
## With polygons
`gggenomes::geom_link()` currently draws simple rectangles. An
alternative is to draw sequence relationships using polygons. These
polygons are computed during parsing (unless `no_polygons` is set to
`TRUE`) and returned in a dataframe in the `$polys` slot of the list.``` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
#geom_link() +
syri_plot_fills +
ggtitle("Synteny between Col and Ler")
```
## Sequence labels
Names of individual sequences can be added using
e.g. `gggenomes::geom_seq_label()```` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
geom_seq_label(nudge_y = 0.1, hjust = 0, size = 4) +
syri_plot_fills +
ggtitle("Synteny between Col and Ler")
```
In some cases, it might be preferred to change some labels, for example
to standardize them, or to only show some. This can be done by
manipulating the table in `$seqs`. The easiest is to add a new column
that contains new labels, modifying `seq_id` directly is probably a bad
idea as it can break the mapping between sequence names and links.
Below, a new `seqlab` column is created where only labels for `col` are
kept:``` r
dat$seqs <- dat$seqs %>%
mutate(seqlab = case_when(bin_id != "col" ~ "",
TRUE ~ seq_id)
)
```This can be used within `gggenomes::geom_seq_label()`:
``` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
geom_seq_label(aes(label = seqlab), nudge_y = 0.1, hjust = 0, size = 4) +
syri_plot_fills +
ggtitle("Synteny between Col and Ler")
```
## Options
### Selecting chromosomes
Sometimes, only a subset of chromosomes is relevant. `parse_syri()`
expects chromosome names to be identical across genomes. If that is the
case, chromosomes can be selected with the `chroms` parameter``` r
dat <- parse_syri("data/col_on_ler.syri.out",
order = data.frame(bin_id = c("col","ler")),
chroms = c("Chr1","Chr3")
)
`````` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
#geom_link() +
syri_plot_fills +
ggtitle("Synteny between Col and Ler Chromosomes 1 and 3")
```
### Spacing
Sometimes, the default spacing between chromosomes may not be optimal.
`parse_syri()` follows gggenomes in spacing rules. If spacing is \< 1,
it is relative to the longest bin / sqrt(number of sequences), if it is
\>= 1 it is base pairs. The default is 0.05 (as for gggenomes)#### In basepairs
``` r
dat <- parse_syri("data/col_on_ler.syri.out",
order = data.frame(bin_id = c("col","ler")),
spacing = 5000000 # spacing in bp
)
```Of course, if the spacing was changed, this also needs to be adjusted in
gggenomes:``` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links,
spacing = 5000000) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
syri_plot_fills +
ggtitle("Synteny between Col - Ler with 5MB spacing between chromsomes")
```
#### Relative
4 times the standard spacing:
``` r
dat <- parse_syri("data/col_on_ler.syri.out",
order = data.frame(bin_id = c("col","ler")),
spacing = 0.2 # relative spacing
)
`````` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links,
spacing = 0.2) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
syri_plot_fills +
ggtitle("Synteny between Col - Ler, spacing increased 4x")
```
### No resizing
By default, short syntenic regions larger than 5000 bp are resized to
make them visible. Since this does not reflect the original input, this
can be disabled:``` r
dat <- parse_syri("data/col_on_ler.syri.out",
order = data.frame(bin_id = c("col","ler")),
resize_polygons = F)
`````` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
syri_plot_fills +
ggtitle("Synteny between Col and Ler without resizing")
```
### Minimum resize size
Only regions larger than `min_polygon_feat_size` are resized (default
5000), this can be modified to also include smaller regions``` r
dat <- parse_syri("data/col_on_ler.syri.out",
order = data.frame(bin_id = c("col","ler")),
resize_polygons = T,
min_polygon_feat_size = 1000)
```Naturally, this will create a busier plot.
``` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
syri_plot_fills +
ggtitle("Synteny between Col and Ler, resizing regions larger than 999bp")
```
### Resize output size
Regions are resized to have a certain length relative to the chromosome,
controlled by `resize_polygons_size`, which defaults to `0.003` (0.3%)
of the chromosome length. Changing this parameter will make resized
regions larger or smaller.``` r
dat <- parse_syri("data/col_on_ler.syri.out",
order = data.frame(bin_id = c("col","ler")),
resize_polygons = T,
resize_polygons_size = 0.01)
```This will produce wider polygons for resized links.
``` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
syri_plot_fills +
ggtitle("Synteny between Col and Ler")
```
# Multiple genomes
Comparing two genomes is nice, but more might be better.
`parse_syri()` can handle multiple inputs in one go when those are
provided as a list:``` r
file_list <- list.files("data", full.names = T)
syri_order <- data.frame(bin_id = c("col", "ler", "cvi", "eri"))
dat <- parse_syri(file_list, order = syri_order)
```Making a plot from this works the same way of making a plot of only one
comparison. The order of sequences is set via the `order` argument to
`parse_syri()```` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_seq(linewidth = 1) +
geom_bin_label(size=7) +
syri_plot_fills +
ggtitle("Synteny between Col - Ler - Cvi - Eri")
```
# Keeping chromosomes separate
By default, `parse_syri()` takes all chromosomes from the same genome
(`bin_id`) and puts them on one axis, adding space between them as
needed (See [spacing](#Spacing)).Sometimes, it might be useful to have the chromosomes each on their own
coordinate system instead. This can be done by making use of the
[`chroms`](#Selecting-chromosomes) argument to read each chromosome
individually and then combining them. Below is an example for the
col-ler-cvi-eri data used above and included in `data/`.``` r
file_list <- list.files("data", full.names = T)
syri_order <- data.frame(bin_id = c("col", "ler", "cvi", "eri"))
chromosomes <- c("Chr1", "Chr2", "Chr3", "Chr4", "Chr5")
dat_tmp <- lapply(chromosomes,
\(chrom) parse_syri(file_list, order = syri_order, chroms = chrom))# Bind sequences
dat$seqs <- lapply(1:length(chromosomes), \(l) pluck(dat_tmp, l, "seqs")) %>%
bind_rows()
# Create y coordinates for sequences
seq_pos <- left_join(dat$seqs, syri_order %>%
mutate(y = rev(1:length(bin_id))),
by = join_by(bin_id))
# Bind links
dat$links <- lapply(1:length(chromosomes), \(l) pluck(dat_tmp, l, "links")) %>%
bind_rows()
# Bind polygons
dat$polys <- lapply(1:length(chromosomes), \(l) pluck(dat_tmp, l, "polys")) %>%
bind_rows()
# Add seq_id column to polygons, only keep polygons that connect the same chromosome
dat$polys <- dat$polys %>%
mutate(
Chr_grp1 = str_extract_all(link, "Chr[0-9]*", simplify = T)[, 1],
Chr_grp2 = str_extract_all(link, "Chr[0-9]*", simplify = T)[, 2]
) %>%
filter(Chr_grp1 == Chr_grp2) %>%
mutate(seq_id = Chr_grp1)
```Note that `geom_segment()` should be used to draw chromosomes, since
`geom_seq()` would again place the chromosomes onto a single axis.``` r
gggenomes::gggenomes(seqs = dat$seqs,
links = dat$links) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type == "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.6
) +
geom_polygon(
data = dat$polys %>% filter(direct) %>% filter(type != "SYN"),
aes(
x = x,
y = y,
fill = type,
group = link_grp
),
alpha = 0.8
) +
geom_segment(aes(x = 0, xend = length, y=y, yend=y), data = seq_pos) +
geom_bin_label(size=7,
# Avoid overly long extension of x to the left
expand_left = 1e-2,
nudge_left = 5e-3) +
facet_wrap(~seq_id, ncol = 1, scales = "free_x") +
syri_plot_fills +
theme(strip.background = element_rect(fill = "white")) +
ggtitle("Synteny between Col - Ler - Cvi - Eri")
```
# Contributing
If you encounter any problems, please open an issue.
If you have suggestions for improvement, please open a pull request.