https://github.com/mbhall88/lrge

Genome size estimation from long read overlaps
https://github.com/mbhall88/lrge
bioinformatics estimate genome-size genomics library overlap size
Last synced: 3 months ago
JSON representation
Genome size estimation from long read overlaps
Host: GitHub
URL: https://github.com/mbhall88/lrge
Owner: mbhall88
License: mit
Created: 2024-10-21T02:54:15.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-12-10T06:18:57.000Z (7 months ago)
Last Synced: 2025-03-29T04:29:13.442Z (4 months ago)
Topics: bioinformatics, estimate, genome-size, genomics, library, overlap, size
Language: Rust
Homepage: https://doi.org/10.1101/2024.11.27.625777
Size: 91.7 MB
Stars: 51
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

        # LRGE

[![check](https://github.com/mbhall88/lrge/actions/workflows/check.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/check.yml)

[![test](https://github.com/mbhall88/lrge/actions/workflows/test.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/test.yml)

[![DOI:10.1101/2024.11.27.625777](https://img.shields.io/badge/citation-10.1101/2024.11.27.625777-blue)][doi]

**L**ong **R**ead-based **G**enome size **E**stimation from overlaps

LRGE (pronounced "large") is a command line tool for estimating genome size from long read overlaps. The tool is built 

on top of the [`liblrge`][liblrge] Rust library, which is also available as a standalone library for use in other projects.

> Hall, M. B.; Coin, L. J. M. Genome Size Estimation from Long Read Overlaps. bioRxiv 2024, 2024.11.27.625777. doi:[10.1101/2024.11.27.625777][doi].

## Table of Contents

- [Installation](#installation)

- [Usage](#usage)

- [Method](#method)

- [Results](#results)

- [Benchmark](#benchmark)

- [Alternatives](#alternatives)

- [Citation](#citation)

 

## Installation

- [Precompiled binary](#precompiled-binary)

- [Conda](#conda)

- [Cargo](#cargo)

- [Container](#container)

  - [Apptainer](#apptainer)

  - [Docker](#docker)

- [Build from source](#build-from-source)

### Precompiled binary

![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/mbhall88/lrge/total)

![GitHub Release](https://img.shields.io/github/v/release/mbhall88/lrge)

```shell

curl -sSL lrge.mbh.sh | sh

# or with wget

wget -nv -O - lrge.mbh.sh | sh

```

You can also pass options to the script like so

```

$ curl -sSL lrge.mbh.sh | sh -s -- --help

install.sh [option]

Fetch and install the latest version of lrge, if lrge is already

installed it will be updated to the latest version.

Options

        -V, --verbose

                Enable verbose output for the installer

        -f, -y, --force, --yes

                Skip the confirmation prompt during installation

        -p, --platform

                Override the platform identified by the installer [default: apple-darwin]

        -b, --bin-dir

                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch

                Override the architecture identified by the installer [default: aarch64]

        -B, --base-url

                Override the base URL used for downloading releases [default: https://github.com/mbhall88/lrge/releases]

        -h, --help

                Display this help message

```

### Conda

![Conda Version](https://img.shields.io/conda/vn/bioconda/lrge)

![Conda Platform](https://img.shields.io/conda/pn/bioconda/lrge)

![Conda Downloads](https://img.shields.io/conda/dn/bioconda/lrge)

```sh

conda install -c bioconda lrge

```

### Cargo

![Crates.io Version](https://img.shields.io/crates/v/lrge)

![Crates.io Total Downloads](https://img.shields.io/crates/d/lrge)

```sh

cargo install lrge

```

### Container

Docker images are hosted on the GitHub Container registry.

#### Apptainer

Prerequisite: [`apptainer`][apptainer] (previously Singularity)

```shell

$ URI="docker://ghcr.io/mbhall88/lrge:latest"

$ apptainer exec "$URI" lrge --help

```

The above will use the latest version. If you want to specify a version then use a

[tag][ghcr] like so.

```shell

$ VERSION="0.1.3"

$ URI="docker://ghcr.io/mbhall88/lrge:${VERSION}"

```

#### Docker

Prerequisite: [`docker`][docker]

```shell

$ docker pull ghcr.io/mbhall88/lrge:latest

$ docker run ghcr.io/mbhall88/lrge:latest lrge --help

```

You can find all the available tags [here][ghcr].

### Build from source

```shell

$ git clone https://github.com/mbhall88/lrge.git

$ cd lrge

$ cargo build --release

$ target/release/lrge -h

```

---

## Usage

> [!IMPORTANT]  

> The default values were calibrated from bacterial genomes, so you may need to adjust them if you are working with larger

genomes. See below for more details.

Estimate the genome size of a set of *Mycobacterium tuberculosis* ONT [reads](https://www.ebi.ac.uk/ena/browser/view/SRR28370649) 

([true genome size](https://www.ebi.ac.uk/ena/browser/view/CP149484): 4.40 Mbp / 4405449 bp).

```

$ wget -O reads.fq.gz "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR283/049/SRR28370649/SRR28370649_1.fastq.gz"

$ lrge -t 8 reads.fq.gz

[2024-11-22T03:49:53Z INFO  lrge] Running two-set strategy with 10000 target reads and 5000 query reads

[2024-11-22T03:50:10Z INFO  lrge] Estimated genome size: 4.43 Mbp (IQR: 3.16 Mbp - 4.99 Mbp)

4426642

[2024-11-22T03:50:10Z INFO  lrge] Done!

```

The size estimate is printed to stdout, but you can also save it to a file with the `-o` flag.

```

$ lrge -t 8 reads.fq.gz -o size.txt

[2024-11-22T03:49:53Z INFO  lrge] Running two-set strategy with 10000 target reads and 5000 query reads

[2024-11-22T03:50:10Z INFO  lrge] Estimated genome size: 4.43 Mbp (IQR: 3.16 Mbp - 4.99 Mbp)

[2024-11-22T03:50:10Z INFO  lrge] Done!

$ cat size.txt

4426642

```

By default, LRGE uses the [two-set strategy](#two-set-strategy) with 10,000 target reads (`-T`) and 5,000 query reads 

(`-Q`). You can use the [all-vs-all strategy](#all-vs-all-strategy) by specifying the number of reads to use with the `-n` flag.

In [the paper][doi], we ran LRGE on three eukaoryotic genomes: *Arabidopsis thaliana* (125 Mbp), *Drosophila melanogaster* 

(143 Mbp), and *Saccharomyces cerevisiae* (12 Mbp). We used 50,000 query and 100,000 target reads for *A. thaliana* and 

*D. melanogaster*, and 10,000 query and 20,000 target reads for *S. cerevisiae*.

### Library

You can also use the `liblrge` library in your Rust projects. This allows you to estimate genome size within your own 

applications - without needing to call out to `lrge`. For more details on how to use the library, see the [documentation](https://www.docs.rs/liblrge) or the 

[source code](./liblrge).

### Standard options

```

$ lrge -h

Genome size estimation from long read overlaps

Usage: lrge [OPTIONS] 

Arguments:

    Input FASTQ file

Options:

  -o, --output       Output file for the estimate [default: -]

  -T, --target          Target number of reads to use (for two-set strategy; default) [default: 10000]

  -Q, --query           Query number of reads to use (for two-set strategy; default) [default: 5000]

  -n, --num             Number of reads to use (for all-vs-all strategy)

  -P, --platform   Sequencing platform of the reads [default: ont] [possible values: ont, pb]

  -t, --threads         Number of threads to use [default: 1]

  -C, --keep-temp            Don't clean up temporary files

  -D, --temp            Temporary directory for storing intermediate files

  -s, --seed            Random seed to use - making the estimate repeatable

  -q, --quiet...             `-q` only show errors and warnings. `-qq` only show errors. `-qqq` shows nothing

  -v, --verbose...           `-v` show debug output. `-vv` show trace output

  -h, --help                 Print help (see more with '--help')

  -V, --version              Print version

```

### Full usage

Estimate genome size of PacBio reads

```

$ lrge -P pb -t 8 reads.fq

```

Don't remove the intermidiate read and overlap files

```

$ lrge -C reads.fq

```

Use the [all-vs-all strategy](#all-vs-all-strategy) with 10,000 reads

```

$ lrge -n 10000 reads.fq

```

Fix the seed so that subsequent runs return the same size estimate

```

$ lrge -s 123 reads.fq

```

By default, we take the median of the *finite* estimates to get the final genome size estimate. If you want to include 

infinite estimates in the calculation

```

$ lrge -8 reads.fq

```

If you don't want the estimate to be rounded to the nearest integer 🤓

```

$ lrge --float-my-boat reads.fq

```

In [the paper][doi], we suggest using the 15th and 65th percentiles of the estimates to get a ~92% confidence interval. 

However, you can change these

```

$ lrge --q1 0.25 --q3 0.75 reads.fq

```

If you want to see the estimate for each read, turn on trace level logging

```

$ lrge -vv reads.fq

```

By default, the intermediate files are stored in a temporary directory. You can specify a different temporary 

directory

```

$ lrge -D ./mytemp/ reads.fq

```

If you have Illumina data, try GenomeScope2 or Mash (see [alternatives](#alternatives) for more details).

---

```

$ lrge --help

Genome size estimation from long read overlaps

Usage: lrge [OPTIONS] 

Arguments:

  

          Input FASTQ file

Options:

  -o, --output 

          Output file for the estimate

          [default: -]

  -T, --target 

          Target number of reads to use (for two-set strategy; default)

          [default: 10000]

  -Q, --query 

          Query number of reads to use (for two-set strategy; default)

          [default: 5000]

  -n, --num 

          Number of reads to use (for all-vs-all strategy)

  -P, --platform 

          Sequencing platform of the reads

          [default: ont]

          [possible values: ont, pb]

  -t, --threads 

          Number of threads to use

          [default: 1]

  -C, --keep-temp

          Don't clean up temporary files

  -D, --temp 

          Temporary directory for storing intermediate files

  -s, --seed 

          Random seed to use - making the estimate repeatable

  -8, --inf

          Take the estimate as the median of all estimates, *including infinite estimates*

  -f, --float-my-boat

          I neeeeeed that precision! Output the estimate as a floating point number

      --q1 

          The lower quantile to use for the estimate

          [default: 0.15]

      --q3 

          The upper quantile to use for the estimate

          [default: 0.65]

  -q, --quiet...

          `-q` only show errors and warnings. `-qq` only show errors. `-qqq` shows nothing

  -v, --verbose...

          `-v` show debug output. `-vv` show trace output

  -h, --help

          Print help (see a summary with '-h')

  -V, --version

          Print version

```

## Method

For a full description of the method, see the [paper][doi].

### Two-set strategy

The two-set strategy is the default method used by LRGE. It involves randomly selecting a two distinct subsets of reads 

from the input. One subset is deemed the target set ($T$) and the other the query set ($Q$). Each read $q_i$ in $Q$ is overlapped 

against $T$ and a genome size ($\textbf{GS}$) estimate is generated for that read ($\textbf{GS}_{T,q_i}$). The estimate is calculated based on 

the number of overlaps of $q_i$ with reads in $T$ ($\lvert \textbf{ov}(T \setminus q_i,q_i \rvert$), according to the formula:

```math

\textbf{GS}_{T,q_i} \approx \lvert T \setminus q_i \rvert \frac{\ell_{q_i} + \overline{\ell}_{T \setminus q_i} - 2 \cdot \textbf{OT}}{\lvert \textbf{ov}(T \setminus q_i,q_i) \rvert}

```

where $\vert T \setminus q_i \vert$ is the total size of the target set minus the read $q_i$, $\ell_{q_i}$ is the length of read $q_i$, $\overline{\ell}_{T \setminus q_i}$ is 

the average length of reads in $T$ minus $q_i$, and $\textbf{OT}$ is the overlap threshold (minimum chain score in minimap2, which 

defaults to 100 for overlaps). See [the paper][doi] for more formal/rigorous definitions.

Ultimately, the genome size estimate is the median of the finite estimates for each read in $Q$.

We use this strategy as the default as it is the most computationally efficient and the accuracy is comparable to the 

all-vs-all strategy. We suggest a smaller number of query reads than target reads, as this will speed things up and as 

we take the median of the estimates, the number of query reads (over a certain point) should not affect the accuracy of 

the estimate all that much.

### All-vs-all strategy

The all-vs-all strategy involves overlapping some random subset (`-n`) of reads in the input against each other. The 

genome size estimate for each read is calculated as above.

This strategy is *generally* more computationally expensive than the two-set strategy, but it can be more accurate. Though 

we did not find the difference to be statistically significant in our tests.

## Results

We compared LRGE to three other methods: GenomeScope2, Mash, and Raven ([see below](#alternatives) for more info). We ran 

each method on 3370 read sets from PacBio or ONT data. Each of these samples is associated with a RefSeq assembly, so the 

true size was taken as the size of the RefSeq assembly. You can find the metadata for the samples [here](./paper/config/bacteria_lr_runs.filtered.tsv).

The full results are available in the [paper][doi] and [here](./paper/results/estimates/estimates.tsv). Here is a brief summary of how LRGE compares to other methods.

![Results](./paper/results/figures/method_absolute_relative_error.png)

This compares the absolute relative error as a percentage. The relative error ($\epsilon_{\text{rel}}$) is calculated as:

```math

    \epsilon_{\text{rel}} = \frac{\hat{G} - G}{G} \cdot 100

```

where $G$ is the true genome size, and $\hat{G}$ is the estimated genome size. For example, a $\epsilon_{\text{rel}}$ of 50% 

is out (higher or lower) by 50% of the true genome size. So if the true genome size is 1 Mbp, a $\epsilon_{\text{rel}}$ of 50% 

would be 1.5 Mbp or 0.5 Mbp. 

The following figure shows the (non-absolute) relative error for the same methods to give an 

indication of which methods tend to over or underestimate.

![Results](./paper/results/figures/platform_relative_error.png)

## Benchmark

For the full details of the methods benchmarked, see the [paper][doi]. However, here is a brief summary of the results.

![Benchmark](./paper/results/figures/method_cpu_memory.png)

The statistical annotations above the violins are coloured by the method which has the lowest mean value for the given 

metric.

## Alternatives

The methods we compare against are:

[GenomeScope2](https://github.com/tbenavi1/genomescope2.0): to get estimates from GenomeScope2, you need to first generate 

a k-mer spectrum. We used [KMC](https://github.com/refresh-bio/KMC) for this. You can find a Python script that takes 

reads, generates a k-mer spectrum, and estimates genome size in [`genomescope.py`](./paper/workflow/scripts/genomescope.py). The list of parameters used 

can also be found in the [workflow config](./paper/config/config.yaml).

[Mash](https://github.com/marbl/Mash): we used `mash sketch` on the reads, which prints out the estimated genome size in 

the logging output. You can find the options used in the [workflow config](./paper/config/config.yaml).

[Raven](https://github.com/lbcb-sci/raven): Raven essentially just assembles the reads - *REALLLLY* fast 🚀

You can find the full details of how we compared methods in the [workflow](./paper/workflow/rules/estimate.smk).

## Citation

If you use LRGE in your research, please cite the following [paper][doi]:

```bibtex

@article{hall_genome_2024,

	title = {Genome size estimation from long read overlaps},

	url = {https://biorxiv.org/content/early/2024/12/02/2024.11.27.625777.abstract},

	doi = {10.1101/2024.11.27.625777},

	journal = {bioRxiv},

	author = {Hall, Michael B and Coin, Lachlan J M},

	month = jan,

	year = {2024},

	pages = {2024.11.27.625777},

}

```

[apptainer]: https://github.com/apptainer/apptainer

[docker]: https://docs.docker.com/

[doi]: https://doi.org/10.1101/2024.11.27.625777

[ghcr]: https://github.com/mbhall88/lrge/pkgs/container/lrge

[liblrge]: https://www.docs.rs/liblrge

[quay.io]: https://quay.io/repository/mbhall88/lrge
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mbhall88/lrge

Awesome Lists containing this project

README