{"id":28458887,"url":"https://github.com/gatb/mindthegap","last_synced_at":"2025-10-15T07:33:07.170Z","repository":{"id":88841297,"uuid":"56671921","full_name":"GATB/MindTheGap","owner":"GATB","description":"MindTheGap is a SV caller for short read sequencing data dedicated to insertion variants (all sizes and types). It can also be used as a local assembly tool.","archived":false,"fork":false,"pushed_at":"2022-04-20T14:46:02.000Z","size":1412,"stargazers_count":37,"open_issues_count":1,"forks_count":12,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-06-17T06:36:29.656Z","etag":null,"topics":["bioinformatics","debruijn-graph","gatb","genomics","structural-variants"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GATB.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-04-20T08:59:52.000Z","updated_at":"2025-02-13T06:54:43.000Z","dependencies_parsed_at":"2023-06-19T05:10:16.393Z","dependency_job_id":null,"html_url":"https://github.com/GATB/MindTheGap","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/GATB/MindTheGap","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GATB%2FMindTheGap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GATB%2FMindTheGap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GATB%2FMindTheGap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GATB%2FMindTheGap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GATB","download_url":"https://codeload.github.com/GATB/MindTheGap/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GATB%2FMindTheGap/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263111297,"owners_count":23415421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","debruijn-graph","gatb","genomics","structural-variants"],"created_at":"2025-06-07T00:40:23.841Z","updated_at":"2025-10-15T07:33:07.039Z","avatar_url":"https://github.com/GATB.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MindTheGap \n\n| **Linux** | **Mac OSX** |\n|-----------|-------------|\n[![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap-gitlab/job/tool-mindthegap-build-debian7-64bits-gcc-4.7-gitlab/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-debian7-64bits-gcc-4.7/) | [![Build Status](https://ci.inria.fr/gatb-core/view/MindTheGap-gitlab/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1-gitlab/badge/icon)](https://ci.inria.fr/gatb-core/view/MindTheGap/job/tool-mindthegap-build-macos-10.9.5-gcc-4.2.1/)\n\n[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mindthegap/README.html)\n\n[![License](http://img.shields.io/:license-affero-blue.svg)](http://www.gnu.org/licenses/agpl-3.0.en.html)     \n\n# What is MindTheGap ?\n\nMindTheGap  performs detection and assembly of **DNA insertion variants** in NGS read datasets with respect to a reference genome. It is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. It takes as input a set of reads and a reference genome. It outputs two sets of FASTA sequences: one is the set of breakpoints of detected insertion sites, the other is the set of assembled insertions for each breakpoint.\n\n**New !** MindTheGap can also be used as a **genome assembly finishing tool**: it can fill the gaps between a set of input contigs without any a priori on their relative order and orientation. It outputs the results in a gfa file. It is notably integrated as an essential step in the targeted assembly tool **MinYS** (MineYourSymbiont in metagenomics datasets, see [https://github.com/cguyomar/MinYS](https://github.com/cguyomar/MinYS)).\n\nMindTheGap is a [Genscale](http://team.inria.fr/genscale/) tool, built upon the [GATB](http://gatb.inria.fr/) C++ library, and developed by:\n* Claire Lemaitre\n* Cervin Guyomar\n* Wesley Delage\n* Guillaume Rizk\n* Former developers: Rayan Chikhi, Pierre Marijon. \n\n# Installation instructions\n\n## Requirements\n\nCMake 3.1+; see http://www.cmake.org/cmake/resources/software.html\n\nC++/11 capable compiler (e.g. gcc 4.7+, clang 3.5+, Apple/clang 6.0+)\n\n## Getting the latest source code with git\n\n    # get a local copy of MindTheGap source code\n    git clone --recursive https://github.com/GATB/MindTheGap.git\n    \n    # compile the code\n    cd MindTheGap\n    sh INSTALL\n    # the binary file is located in directory build/bin/\n    ./build/bin/MindTheGap -help\n\nNote: when updating your local repository with `git pull`, if you see that thirdparty/gatb-core has changed, you have to run also : `git submodule update`. \n\n## Installing a stable release\n\nRetrieve a binary archive file from one of the official MindTheGap releases (see \"Releases\" tab on the Github web page); file name is `MindTheGap-vX.Y.Z-bin-Linux.tar.gz` (for Linux) or `MindTheGap-vX.Y.Z-bin-Darwin.tar.gz` (for MacOs).\n\n    tar -zxf MindTheGap-vX.Y.Z-bin-Darwin.tar.gz\n    cd MindTheGap-vX.Y.Z-bin-Darwin\n    chmod u+x bin/MindTheGap\n    ./bin/MindTheGap -help\n\nIn case the software does not run appropriately on your system, you should consider to install it from its source code. Retrieve the source archive file `MindTheGap-vX.Y.Z-Source.tar.gz`.\n\n    tar -zxf MindTheGap-vX.Y.Z-Source.tar.gz\n    cd MindTheGap-vX.Y.Z-Source\n    sh INSTALL\n    # the binary file is located in directory build/bin/\n    ./build/bin/MindTheGap -help\n\n## Using conda or docker\n\nMindTheGap is also distributed as a [Bioconda package](https://anaconda.org/bioconda/mindthegap):\n\n    conda install -c bioconda mindthegap\n\nOr pull the docker image of MindTheGap (warning: need to be updated with latest releases):\n\n    docker pull clemaitr/mindthegap\n\n## Small run example\n\n```\nMindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example\nMindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example\n```\n\n\n\n# USER MANUAL\t \n\n## Description\n\nMindTheGap is a software that performs integrated detection and assembly of **genomic insertion variants** in NGS read datasets with respect to a reference genome. It is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. \n\nAlternatively and since release 2.1.0, MindTheGap can also be used as a **genome assembly finishing tool**. It is integrated as an essential step in the **targeted assembly** tool [MinYS (MineYourSymbiont in metagenomics datasets)](https://github.com/cguyomar/MinYS). It takes also part of a gap-filling pipeline dedicated to linked-read data (10X Genomics):  [MTG-link](https://github.com/anne-gcd/MTG-Link).\n\n**Insertion variant detection**\n\nIt takes as input a set of reads and a reference genome. Its main output is a VCF file, giving for each insertion variant, its insertion site location on the reference genome, a single insertion sequence or a set of candidate insertion sequences (when there are assembly ambiguities), and its genotype in the sample. \n\nFor a detailed user manual specific to insertion variants see [doc/MindTheGap_insertion_caller.md](doc/MindTheGap_insertion_caller.md).\n\n**Genome assembly gap-filling** (New feature !)\n\nWhen given a set of reads and a set of contigs as input, MindTheGap tries to fill the gaps between all pairs of contigs by de novo local assembly without any a priori on their relative order and orientation. It outputs the results in gfa file. \n\nFor a detailed user manual specific to contig gap-filling see [doc/MindTheGap_assembly.md](doc/MindTheGap_assembly.md).\n\n**Performances**\n\nMindTheGap performs de novo assembly using the [GATB](http://gatb.inria.fr) C++ library and inspired from algorithms from Minia. Hence, the computational resources required to run MindTheGap are significantly lower than that of other assemblers (for instance it uses less than 6GB of main memory for analyzing a full human NGS dataset).\n\n\nFor more details on the method and some recent results, see the [web page](http://gatb.inria.fr/software/mind-the-gap/).\n\t\n## Usage and examples\n\nMindTheGap is composed of two main modules : breakpoint detection (`find` module) and the local assembly of insertions or gaps (`fill` module). Both steps are implemented in a single executable, MindTheGap, and can be run independently by specifying the module name as follows :\n\n    MindTheGap \u003cmodule\u003e [module options] \n\n1. **Basic command lines**\n\n        #Find module:\n        MindTheGap find (-in \u003creads.fq\u003e | -graph \u003cgraph.h5\u003e) -ref \u003creference.fa\u003e [options]\n        #To get help:\n        MindTheGap find -help\n\t    \n        #Fill module:\n        MindTheGap fill (-in \u003creads.fq\u003e | -graph \u003cgraph.h5\u003e) (-bkpt \u003cbreakpoints.fa\u003e | -contig \u003ccontigs.fa\u003e) [options]\n        #To get help:\n        MindTheGap fill -help\n\n2. **Examples**\n\n   These examples can be run with the small datasets in directory `data/`\n   \n\t**Example for insertion variant calling:**\n   \n\t    #find\n\t    build/bin/MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example\n\t    # 3 files are generated: \n\t    #   example.h5 (de bruijn graph), \n\t    #   example.othervariants.vcf (SNPs and deletion variants), \n\t    #   example.breakpoints (breakpoints of insertion variants).\n\t    \n\t    #fill\n\t    build/bin/MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example\n\t    # 3 files are generated:\n\t    #   example.insertions.fasta (insertion sequences)\n\t    #   example.insertions.vcf (insertion variants)\n\t    #   example.info.txt (log file)\n\t\n\t**Example for gap-filling between contigs:**\n\t\n\t```\n\tbuild/bin/MindTheGap fill -in data/contig-reads.fasta.gz -contig data/contigs.fasta -abundance-min 3 -out contig_example\n\t# 4 files are generated\n\t#   contig_example.h5 (de bruijn graph)\n\t#   contig_example.insertions.fasta (gap-filling sequences)\n\t#   contig_example.gfa (genome graph)\n\t#   contig_example.info.txt (log file)\n\t```\n\t\n\tThe usage of the `fill` module is a little bit different depending on the type of gap-filling : assembling insertion variants (using the `-bkpt`option with a breakpoint file) or gap-filling between contigs (using the `-contig` option with a contig fasta file). \n\n## Details\n\n1. **Input sequencing read data**\n\t\n\tFor both modules, read dataset(s) are first indexed in a De Bruijn graph. The input format of read dataset(s) is either the read files themselves (option `-in`), or the already computed de bruijn graph in hdf5 format (.h5) (option `-graph`).   \n\tNOTE: options `-in` and `-graph` are mutually exclusive, and one of these is mandatory.\n\t\n\tIf the input is composed of several read files, they can be provided as a list of file paths separated by a comma or as a \"file of file\" (fof), that is a text file containing on each line the path to each read file. All read files will be treated as if concatenated in a single sample. The read file format can be fasta, fastq or gzipped. \n\t\n2. **de Bruijn graph creation options**\n\n   In addition to input read set(s), the de Bruijn graph creation uses two main parameters, `-kmer-size` and `-abundance-min`: \n\n   * `-kmer-size`: the k-mer size [default '31']. By default, the largest kmer-size allowed is 128. To use k\u003e128, you will need to re-compile MindTheGap as follows: \n\n        ```\n     cd build/\n     cmake -DKSIZE_LIST=\"32 64 96 256\" ..\n     make\n     ```\n\n     To go back to default, replace 256 by 128. Note that increasing the range between two consecutive kmer-sizes in the list can have an impact on the size of the output h5 files (but none on the results).\n\n   * `-abundance-min`: the minimal abundance threshold, k-mers having less than this number of occurrences are discarded from the graph [default 'auto', ie. automatically inferred from the dataset]. \n\n   * `-abundance-max`: the maximal abundance threshold, k-mers having more than this number of occurrences are discarded from the graph [default '2147483647' ie. no limit].\n\n3. **Computational resources options**\n\n    Additional options are related to computational runtime and memory:\n    \n    * `-nb-cores`: number of cores to be used for computation [default '0', ie. all available cores will be used].\n    * `-max-memory`: max RAM memory for the graph creation (in MBytes)  [default '2000']. Increasing the memory will speed up the graph creation phase.\n    * `-max-disk`: max usable disk space for the graph creation (in MBytes)  [default '0', ie. automatically set]. Kmers are counted by writing temporary files on the disk, to speed up the counting you can increase the usable disk space.\n    \n4. **MindTheGap Output**\n\n    All the output files are prefixed either by a default name: \"MindTheGap_Expe-[date:YY:MM:DD-HH:mm]\" or by a user defined prefix (option `-out` of MindTheGap).\n    \n    The main results files are output by the Fill module, these are:\n    \n    * an **insertion variant file** (`.insertions.vcf`) in vcf format, in the case of insertion variant detection (for insertions \u003e2 bp).\n\n    * an **assembly graph file** (`.gfa`) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).\n\n    Additional output files are:\n    \n\t* a graph file (`.h5`), output by both MindTheGap modules. This is a binary file containing the de Bruijn graph data structure. To obtain information stored in it, you can use the utility program `dbginfo` located in your bin directory or in ext/gatb-core/bin/.\n  \n    * Files output specifically by `MindTheGap find`:\n    \n    \t* a breakpoint file (`.breakpoints`) in fasta format. \n    \n\t\t* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs, deletions and very small insertions (1-2 bp).\n  \n    * Files output specifically by `MindTheGap fill`:\n    \n\t\t* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences (for insertions \u003e2 bp) or contig gap-fills that were successfully assembled. \n  \n\t\t* a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill. \n  \n\t\t* with option `-extend`, an additional sequence file (`.extensions.fasta`) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.\n  \n  ​    \n\nOther optional parameters and details on input and output file formats are given in [doc/MindTheGap_insertion_caller.md](doc/MindTheGap_insertion_caller.md) and [doc/MindTheGap_assembly.md](doc/MindTheGap_assembly.md), depending on the usage.\n\n\n\n## Utility programs\n\nEither in your `bin/` directory or in `ext/gatb-core/bin/`, you can find additional utility programs :\n* `dbginfo` : to get information about a graph stored in a .h5 file\n* `dbgh5` : to build a graph from read set(s) and obtain a .h5 file\n* `h5dump` : to extract data stored in a .h5 file\n\n\n\n## Reference\n\nIf you use MindTheGap, please cite: \n\nMindTheGap: integrated detection and assembly of short and long insertions. Guillaume Rizk, Anaïs Gouin, Rayan Chikhi and Claire Lemaitre. Bioinformatics 2014 30(24):3451-3457. http://bioinformatics.oxfordjournals.org/content/30/24/3451\n\n[Web page](https://gatb.inria.fr/software/mind-the-gap/) with some updated results.\n\nMindTheGap was also evaluated in a recent benchmark exploring many different genomic features (size, nature, repeat context, junctional homology at breakpoints) of human insertion variants. Among other tested SV callers, MindTheGap was the only tool able to output sequence-resolved insertions for many types of insertions. Read more: [Towards a better understanding of the low recall of insertion variants with short-read based variant callers.](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07125-5) Delage W, Thevenon J, Lemaitre C. *BMC Genomics* **2020**, 21(1):762.\n\n\n# Contact\n\nTo contact a developer, request help, or for any feedback on MindTheGap, please use the issue form of github: https://github.com/GATB/MindTheGap/issues\n\nYou can see all issues concerning MindTheGap [here](https://github.com/GATB/MindTheGap/issues) and GATB [here](https://www.biostars.org/t/GATB/).\n\nIf you do not have any github account, you can also send an email to claire dot lemaitre at inria dot fr\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgatb%2Fmindthegap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgatb%2Fmindthegap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgatb%2Fmindthegap/lists"}