{"id":22519540,"url":"https://github.com/mbhall88/lrge","last_synced_at":"2025-04-06T05:15:59.001Z","repository":{"id":264584639,"uuid":"875878949","full_name":"mbhall88/lrge","owner":"mbhall88","description":"Genome size estimation from long read overlaps","archived":false,"fork":false,"pushed_at":"2024-12-10T06:18:57.000Z","size":96206,"stargazers_count":51,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-29T04:29:13.442Z","etag":null,"topics":["bioinformatics","estimate","genome-size","genomics","library","overlap","size"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1101/2024.11.27.625777","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mbhall88.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-21T02:54:15.000Z","updated_at":"2025-03-20T01:14:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"b3349b25-948e-46fb-aadd-667261cbcf4a","html_url":"https://github.com/mbhall88/lrge","commit_stats":null,"previous_names":["mbhall88/lrge"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Flrge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Flrge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Flrge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Flrge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mbhall88","download_url":"https://codeload.github.com/mbhall88/lrge/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247436285,"owners_count":20938533,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","estimate","genome-size","genomics","library","overlap","size"],"created_at":"2024-12-07T04:20:59.923Z","updated_at":"2025-04-06T05:15:58.916Z","avatar_url":"https://github.com/mbhall88.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LRGE\n\n[![check](https://github.com/mbhall88/lrge/actions/workflows/check.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/check.yml)\n[![test](https://github.com/mbhall88/lrge/actions/workflows/test.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/test.yml)\n[![DOI:10.1101/2024.11.27.625777](https://img.shields.io/badge/citation-10.1101/2024.11.27.625777-blue)][doi]\n\n**L**ong **R**ead-based **G**enome size **E**stimation from overlaps\n\nLRGE (pronounced \"large\") is a command line tool for estimating genome size from long read overlaps. The tool is built \non top of the [`liblrge`][liblrge] Rust library, which is also available as a standalone library for use in other projects.\n\n\u003e Hall, M. B.; Coin, L. J. M. Genome Size Estimation from Long Read Overlaps. bioRxiv 2024, 2024.11.27.625777. doi:[10.1101/2024.11.27.625777][doi].\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Usage](#usage)\n- [Method](#method)\n- [Results](#results)\n- [Benchmark](#benchmark)\n- [Alternatives](#alternatives)\n- [Citation](#citation)\n \n\n## Installation\n\n- [Precompiled binary](#precompiled-binary)\n- [Conda](#conda)\n- [Cargo](#cargo)\n- [Container](#container)\n  - [Apptainer](#apptainer)\n  - [Docker](#docker)\n- [Build from source](#build-from-source)\n\n### Precompiled binary\n\n![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/mbhall88/lrge/total)\n![GitHub Release](https://img.shields.io/github/v/release/mbhall88/lrge)\n\n```shell\ncurl -sSL lrge.mbh.sh | sh\n# or with wget\nwget -nv -O - lrge.mbh.sh | sh\n```\n\nYou can also pass options to the script like so\n\n```\n$ curl -sSL lrge.mbh.sh | sh -s -- --help\ninstall.sh [option]\n\nFetch and install the latest version of lrge, if lrge is already\ninstalled it will be updated to the latest version.\n\nOptions\n        -V, --verbose\n                Enable verbose output for the installer\n\n        -f, -y, --force, --yes\n                Skip the confirmation prompt during installation\n\n        -p, --platform\n                Override the platform identified by the installer [default: apple-darwin]\n\n        -b, --bin-dir\n                Override the bin installation directory [default: /usr/local/bin]\n\n        -a, --arch\n                Override the architecture identified by the installer [default: aarch64]\n\n        -B, --base-url\n                Override the base URL used for downloading releases [default: https://github.com/mbhall88/lrge/releases]\n\n        -h, --help\n                Display this help message\n```\n\n### Conda\n\n![Conda Version](https://img.shields.io/conda/vn/bioconda/lrge)\n![Conda Platform](https://img.shields.io/conda/pn/bioconda/lrge)\n![Conda Downloads](https://img.shields.io/conda/dn/bioconda/lrge)\n\n```sh\nconda install -c bioconda lrge\n```\n\n### Cargo\n\n![Crates.io Version](https://img.shields.io/crates/v/lrge)\n![Crates.io Total Downloads](https://img.shields.io/crates/d/lrge)\n\n```sh\ncargo install lrge\n```\n\n### Container\n\nDocker images are hosted on the GitHub Container registry.\n\n#### Apptainer\n\nPrerequisite: [`apptainer`][apptainer] (previously Singularity)\n\n```shell\n$ URI=\"docker://ghcr.io/mbhall88/lrge:latest\"\n$ apptainer exec \"$URI\" lrge --help\n```\n\nThe above will use the latest version. If you want to specify a version then use a\n[tag][ghcr] like so.\n\n```shell\n$ VERSION=\"0.1.3\"\n$ URI=\"docker://ghcr.io/mbhall88/lrge:${VERSION}\"\n```\n\n#### Docker\n\nPrerequisite: [`docker`][docker]\n\n```shell\n$ docker pull ghcr.io/mbhall88/lrge:latest\n$ docker run ghcr.io/mbhall88/lrge:latest lrge --help\n```\n\nYou can find all the available tags [here][ghcr].\n\n### Build from source\n\n```shell\n$ git clone https://github.com/mbhall88/lrge.git\n$ cd lrge\n$ cargo build --release\n$ target/release/lrge -h\n```\n\n---\n\n## Usage\n\n\u003e [!IMPORTANT]  \n\u003e The default values were calibrated from bacterial genomes, so you may need to adjust them if you are working with larger\ngenomes. See below for more details.\n\nEstimate the genome size of a set of *Mycobacterium tuberculosis* ONT [reads](https://www.ebi.ac.uk/ena/browser/view/SRR28370649) \n([true genome size](https://www.ebi.ac.uk/ena/browser/view/CP149484): 4.40 Mbp / 4405449 bp).\n\n```\n$ wget -O reads.fq.gz \"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR283/049/SRR28370649/SRR28370649_1.fastq.gz\"\n$ lrge -t 8 reads.fq.gz\n[2024-11-22T03:49:53Z INFO  lrge] Running two-set strategy with 10000 target reads and 5000 query reads\n[2024-11-22T03:50:10Z INFO  lrge] Estimated genome size: 4.43 Mbp (IQR: 3.16 Mbp - 4.99 Mbp)\n4426642\n[2024-11-22T03:50:10Z INFO  lrge] Done!\n```\n\nThe size estimate is printed to stdout, but you can also save it to a file with the `-o` flag.\n\n```\n$ lrge -t 8 reads.fq.gz -o size.txt\n[2024-11-22T03:49:53Z INFO  lrge] Running two-set strategy with 10000 target reads and 5000 query reads\n[2024-11-22T03:50:10Z INFO  lrge] Estimated genome size: 4.43 Mbp (IQR: 3.16 Mbp - 4.99 Mbp)\n[2024-11-22T03:50:10Z INFO  lrge] Done!\n$ cat size.txt\n4426642\n```\n\nBy default, LRGE uses the [two-set strategy](#two-set-strategy) with 10,000 target reads (`-T`) and 5,000 query reads \n(`-Q`). You can use the [all-vs-all strategy](#all-vs-all-strategy) by specifying the number of reads to use with the `-n` flag.\n\nIn [the paper][doi], we ran LRGE on three eukaoryotic genomes: *Arabidopsis thaliana* (125 Mbp), *Drosophila melanogaster* \n(143 Mbp), and *Saccharomyces cerevisiae* (12 Mbp). We used 50,000 query and 100,000 target reads for *A. thaliana* and \n*D. melanogaster*, and 10,000 query and 20,000 target reads for *S. cerevisiae*.\n\n\n### Library\n\nYou can also use the `liblrge` library in your Rust projects. This allows you to estimate genome size within your own \napplications - without needing to call out to `lrge`. For more details on how to use the library, see the [documentation](https://www.docs.rs/liblrge) or the \n[source code](./liblrge).\n\n### Standard options\n\n```\n$ lrge -h\nGenome size estimation from long read overlaps\n\nUsage: lrge [OPTIONS] \u003cINPUT\u003e\n\nArguments:\n  \u003cINPUT\u003e  Input FASTQ file\n\nOptions:\n  -o, --output \u003cOUTPUT\u003e      Output file for the estimate [default: -]\n  -T, --target \u003cINT\u003e         Target number of reads to use (for two-set strategy; default) [default: 10000]\n  -Q, --query \u003cINT\u003e          Query number of reads to use (for two-set strategy; default) [default: 5000]\n  -n, --num \u003cINT\u003e            Number of reads to use (for all-vs-all strategy)\n  -P, --platform \u003cPLATFORM\u003e  Sequencing platform of the reads [default: ont] [possible values: ont, pb]\n  -t, --threads \u003cINT\u003e        Number of threads to use [default: 1]\n  -C, --keep-temp            Don't clean up temporary files\n  -D, --temp \u003cDIR\u003e           Temporary directory for storing intermediate files\n  -s, --seed \u003cINT\u003e           Random seed to use - making the estimate repeatable\n  -q, --quiet...             `-q` only show errors and warnings. `-qq` only show errors. `-qqq` shows nothing\n  -v, --verbose...           `-v` show debug output. `-vv` show trace output\n  -h, --help                 Print help (see more with '--help')\n  -V, --version              Print version\n```\n\n### Full usage\n\nEstimate genome size of PacBio reads\n\n```\n$ lrge -P pb -t 8 reads.fq\n```\n\nDon't remove the intermidiate read and overlap files\n\n```\n$ lrge -C reads.fq\n```\n\nUse the [all-vs-all strategy](#all-vs-all-strategy) with 10,000 reads\n\n```\n$ lrge -n 10000 reads.fq\n```\n\nFix the seed so that subsequent runs return the same size estimate\n\n```\n$ lrge -s 123 reads.fq\n```\n\nBy default, we take the median of the *finite* estimates to get the final genome size estimate. If you want to include \ninfinite estimates in the calculation\n\n```\n$ lrge -8 reads.fq\n```\n\nIf you don't want the estimate to be rounded to the nearest integer 🤓\n\n```\n$ lrge --float-my-boat reads.fq\n```\n\nIn [the paper][doi], we suggest using the 15th and 65th percentiles of the estimates to get a ~92% confidence interval. \nHowever, you can change these\n\n```\n$ lrge --q1 0.25 --q3 0.75 reads.fq\n```\n\nIf you want to see the estimate for each read, turn on trace level logging\n\n```\n$ lrge -vv reads.fq\n```\n\nBy default, the intermediate files are stored in a temporary directory. You can specify a different temporary \ndirectory\n\n```\n$ lrge -D ./mytemp/ reads.fq\n```\n\nIf you have Illumina data, try GenomeScope2 or Mash (see [alternatives](#alternatives) for more details).\n\n---\n\n```\n$ lrge --help\nGenome size estimation from long read overlaps\n\nUsage: lrge [OPTIONS] \u003cINPUT\u003e\n\nArguments:\n  \u003cINPUT\u003e\n          Input FASTQ file\n\nOptions:\n  -o, --output \u003cOUTPUT\u003e\n          Output file for the estimate\n\n          [default: -]\n\n  -T, --target \u003cINT\u003e\n          Target number of reads to use (for two-set strategy; default)\n\n          [default: 10000]\n  -Q, --query \u003cINT\u003e\n          Query number of reads to use (for two-set strategy; default)\n\n          [default: 5000]\n\n  -n, --num \u003cINT\u003e\n          Number of reads to use (for all-vs-all strategy)\n\n  -P, --platform \u003cPLATFORM\u003e\n          Sequencing platform of the reads\n\n          [default: ont]\n          [possible values: ont, pb]\n\n  -t, --threads \u003cINT\u003e\n          Number of threads to use\n\n          [default: 1]\n\n  -C, --keep-temp\n          Don't clean up temporary files\n\n  -D, --temp \u003cDIR\u003e\n          Temporary directory for storing intermediate files\n\n  -s, --seed \u003cINT\u003e\n          Random seed to use - making the estimate repeatable\n\n  -8, --inf\n          Take the estimate as the median of all estimates, *including infinite estimates*\n\n  -f, --float-my-boat\n          I neeeeeed that precision! Output the estimate as a floating point number\n\n      --q1 \u003cFLOAT\u003e\n          The lower quantile to use for the estimate\n\n          [default: 0.15]\n\n      --q3 \u003cFLOAT\u003e\n          The upper quantile to use for the estimate\n\n          [default: 0.65]\n\n  -q, --quiet...\n          `-q` only show errors and warnings. `-qq` only show errors. `-qqq` shows nothing\n\n  -v, --verbose...\n          `-v` show debug output. `-vv` show trace output\n\n  -h, --help\n          Print help (see a summary with '-h')\n\n  -V, --version\n          Print version\n```\n\n\n## Method\n\nFor a full description of the method, see the [paper][doi].\n\n### Two-set strategy\n\nThe two-set strategy is the default method used by LRGE. It involves randomly selecting a two distinct subsets of reads \nfrom the input. One subset is deemed the target set ($T$) and the other the query set ($Q$). Each read $q_i$ in $Q$ is overlapped \nagainst $T$ and a genome size ($\\textbf{GS}$) estimate is generated for that read ($\\textbf{GS}_{T,q_i}$). The estimate is calculated based on \nthe number of overlaps of $q_i$ with reads in $T$ ($\\lvert \\textbf{ov}(T \\setminus q_i,q_i \\rvert$), according to the formula:\n\n```math\n\\textbf{GS}_{T,q_i} \\approx \\lvert T \\setminus q_i \\rvert \\frac{\\ell_{q_i} + \\overline{\\ell}_{T \\setminus q_i} - 2 \\cdot \\textbf{OT}}{\\lvert \\textbf{ov}(T \\setminus q_i,q_i) \\rvert}\n```\n\nwhere $\\vert T \\setminus q_i \\vert$ is the total size of the target set minus the read $q_i$, $\\ell_{q_i}$ is the length of read $q_i$, $\\overline{\\ell}_{T \\setminus q_i}$ is \nthe average length of reads in $T$ minus $q_i$, and $\\textbf{OT}$ is the overlap threshold (minimum chain score in minimap2, which \ndefaults to 100 for overlaps). See [the paper][doi] for more formal/rigorous definitions.\n\nUltimately, the genome size estimate is the median of the finite estimates for each read in $Q$.\n\nWe use this strategy as the default as it is the most computationally efficient and the accuracy is comparable to the \nall-vs-all strategy. We suggest a smaller number of query reads than target reads, as this will speed things up and as \nwe take the median of the estimates, the number of query reads (over a certain point) should not affect the accuracy of \nthe estimate all that much.\n\n### All-vs-all strategy\n\nThe all-vs-all strategy involves overlapping some random subset (`-n`) of reads in the input against each other. The \ngenome size estimate for each read is calculated as above.\n\nThis strategy is *generally* more computationally expensive than the two-set strategy, but it can be more accurate. Though \nwe did not find the difference to be statistically significant in our tests.\n\n## Results\n\nWe compared LRGE to three other methods: GenomeScope2, Mash, and Raven ([see below](#alternatives) for more info). We ran \neach method on 3370 read sets from PacBio or ONT data. Each of these samples is associated with a RefSeq assembly, so the \ntrue size was taken as the size of the RefSeq assembly. You can find the metadata for the samples [here](./paper/config/bacteria_lr_runs.filtered.tsv).\n\nThe full results are available in the [paper][doi] and [here](./paper/results/estimates/estimates.tsv). Here is a brief summary of how LRGE compares to other methods.\n\n![Results](./paper/results/figures/method_absolute_relative_error.png)\n\nThis compares the absolute relative error as a percentage. The relative error ($\\epsilon_{\\text{rel}}$) is calculated as:\n\n```math\n    \\epsilon_{\\text{rel}} = \\frac{\\hat{G} - G}{G} \\cdot 100\n```\n\nwhere $G$ is the true genome size, and $\\hat{G}$ is the estimated genome size. For example, a $\\epsilon_{\\text{rel}}$ of 50% \nis out (higher or lower) by 50% of the true genome size. So if the true genome size is 1 Mbp, a $\\epsilon_{\\text{rel}}$ of 50% \nwould be 1.5 Mbp or 0.5 Mbp. \n\nThe following figure shows the (non-absolute) relative error for the same methods to give an \nindication of which methods tend to over or underestimate.\n\n![Results](./paper/results/figures/platform_relative_error.png)\n\n\n## Benchmark\n\nFor the full details of the methods benchmarked, see the [paper][doi]. However, here is a brief summary of the results.\n\n![Benchmark](./paper/results/figures/method_cpu_memory.png)\n\nThe statistical annotations above the violins are coloured by the method which has the lowest mean value for the given \nmetric.\n\n## Alternatives\n\nThe methods we compare against are:\n\n[GenomeScope2](https://github.com/tbenavi1/genomescope2.0): to get estimates from GenomeScope2, you need to first generate \na k-mer spectrum. We used [KMC](https://github.com/refresh-bio/KMC) for this. You can find a Python script that takes \nreads, generates a k-mer spectrum, and estimates genome size in [`genomescope.py`](./paper/workflow/scripts/genomescope.py). The list of parameters used \ncan also be found in the [workflow config](./paper/config/config.yaml).\n\n[Mash](https://github.com/marbl/Mash): we used `mash sketch` on the reads, which prints out the estimated genome size in \nthe logging output. You can find the options used in the [workflow config](./paper/config/config.yaml).\n\n[Raven](https://github.com/lbcb-sci/raven): Raven essentially just assembles the reads - *REALLLLY* fast 🚀\n\nYou can find the full details of how we compared methods in the [workflow](./paper/workflow/rules/estimate.smk).\n\n## Citation\n\nIf you use LRGE in your research, please cite the following [paper][doi]:\n\n```bibtex\n@article{hall_genome_2024,\n\ttitle = {Genome size estimation from long read overlaps},\n\turl = {https://biorxiv.org/content/early/2024/12/02/2024.11.27.625777.abstract},\n\tdoi = {10.1101/2024.11.27.625777},\n\tjournal = {bioRxiv},\n\tauthor = {Hall, Michael B and Coin, Lachlan J M},\n\tmonth = jan,\n\tyear = {2024},\n\tpages = {2024.11.27.625777},\n}\n```\n\n[apptainer]: https://github.com/apptainer/apptainer\n[docker]: https://docs.docker.com/\n[doi]: https://doi.org/10.1101/2024.11.27.625777\n[ghcr]: https://github.com/mbhall88/lrge/pkgs/container/lrge\n[liblrge]: https://www.docs.rs/liblrge\n[quay.io]: https://quay.io/repository/mbhall88/lrge","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbhall88%2Flrge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmbhall88%2Flrge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbhall88%2Flrge/lists"}