{"id":22165034,"url":"https://github.com/lumc/hutspot","last_synced_at":"2025-07-26T11:31:52.628Z","repository":{"id":45051208,"uuid":"175206933","full_name":"LUMC/hutspot","owner":"LUMC","description":"Multisample DNAseq variant calling pipeline for use in diagnostics","archived":false,"fork":false,"pushed_at":"2022-01-12T12:21:49.000Z","size":6273,"stargazers_count":3,"open_issues_count":2,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-03-26T16:20:32.659Z","etag":null,"topics":["biocontainers","bioinformatics","ngs-analysis","ngs-pipeline","singularity","snakemake","snakemake-workflows"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LUMC.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-12T12:31:07.000Z","updated_at":"2024-03-26T16:20:32.660Z","dependencies_parsed_at":"2022-08-29T21:50:21.766Z","dependency_job_id":null,"html_url":"https://github.com/LUMC/hutspot","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LUMC%2Fhutspot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LUMC%2Fhutspot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LUMC%2Fhutspot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LUMC%2Fhutspot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LUMC","download_url":"https://codeload.github.com/LUMC/hutspot/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227674005,"owners_count":17802303,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biocontainers","bioinformatics","ngs-analysis","ngs-pipeline","singularity","snakemake","snakemake-workflows"],"created_at":"2024-12-02T05:12:42.766Z","updated_at":"2024-12-02T05:12:46.094Z","avatar_url":"https://github.com/LUMC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3251553.svg)](https://doi.org/10.5281/zenodo.3251553)\n[![Continuous Integration](https://github.com/LUMC/hutspot/actions/workflows/ci.yml/badge.svg)](https://github.com/LUMC/hutspot/actions/workflows/ci.yml)\n\n# Hutspot\n\nThis is a multi sample DNA variant calling pipeline based on Snakemake, bwa and\nthe GATK HaplotypeCaller.\n\n## Features\n* Any number of samples is supported\n* Whole-genome calling, regardless of wet-lab library preparation.\n* Follows modern best practices\n    * Each sample is individually called as as a GVCF.\n    * A VCF is then produced by genotyping the individual GVCFs separately\n      for each sample.\n* Data parallelization for calling and genotyping steps.\n    * Using the `scatter_size` setting in the configuration file, the reference\n      genome is split into chunks, and each chunk can be processed\n      independenly. The default value of 1 billon will scatter the human\n      reference genoom into 6 chunks.\n* Reasonably fast.\n    * 96 exomes in \u003c 24 hours.\n* No unnecessary jobs\n* Calculate coverage metrics if a `bedfile` is specified.\n* Fully containerized rules through singularity and biocontainers. Legacy\nconda environments are no long available.\n\n# Installation\n\nThis repository contains a [conda](https://conda.io/docs/)\nenvironment file that you can use to install all dependencies in a\nconda environment:\n\n```bash\nconda env create -f environment.yml\n```\n\n## Singularity\n\nWe highly recommend the user of the containerized rules through\n[singularity](https://www.sylabs.io/singularity/).\n\nThis option does require you to install singularity on your system. As this\nusually requires administrative privileges, singularity is not contained\nwithin our provided conda environment file.\n\nIf you want to use singularity, make sure you install version 3 or higher.\n\n### Debian\nIf you happen to use Debian buster, singularity 3.0.3 comes straight out\nof the box with a simple:\n\n```bash\nsudo apt install singularity-container\n```\n\n### Docker\n\nYou can run singularity within a docker container. Please note that\nthe container **MUST** run in privileged mode for this to work.\n\nWe have provided our own container that includes singularity and snakemake\n[here](https://hub.docker.com/r/lumc/singularity-snakemake).\n\n### Manual install\n\nIf you don't use Debian buster and cannot run a privileged docker container,\nyou - unfortunately :-( - will have to install singularity manually.\nPlease see the installation instructions\n[here](https://github.com/sylabs/singularity/blob/master/INSTALL.md) on how\nto do that.\n\n\n## Operating system\n\nHutspot was tested on Ubuntu 16.04 only.\nIt should reasonably work on most modern Linux distributions.\n\n# Requirements\n\nFor every sample you wish to analyze, we require one or more paired end\nreadgroups in fastq format. They must be compressed with either `gzip` or\n`bgzip`.\n\nThe configuration must be passed to the pipeline through a configuration file.\nThis is a json file listing the samples and their associated readgroups\nas well as the other settings to be used.\nAn example config json can be found [here](config/example.json), and a\njson schema describing the configuration file can be found [here](config/schema.json).\nThis json schema can also be used to validate your configuration file.\n\n## Reference files\n\nThe following reference files **must** be provided in the configuration:\n\n1. `reference`: A reference genome, in fasta format. Must be indexed with\n   `samtools faidx`.\n2. `dbsnp`: A dbSNP VCF file\n3. `known_sites`: One ore more VCF files with known sites for base\n    recalibration\n\nThe following reference files **may** be provided:\n\n1. `targetsfile`: Bed file of the targets of the capture kit. Used to calculate coverage.\n2. `baitsfile`: Bed file of the baits of the capture kit. Used to calculate picard HsMetric.\n3. `refflat`: A refFlat file to calculate coverage over transcripts.\n4. `scatter_size`: Size of the chunks to split the variant calling into.\n5. `female_threshold`: Fraction of reads between X and the autosomes to call as\n    female.\n\n\n# How to run\n\nAfter installing and activating the main conda environment, as described above,\nthe pipeline can be started with:\n\n```bash\nsnakemake -s Snakefile \\\n--use-singularity \\\n--configfile tests/data/config/sample_config.json\n```\n\nThis would start all jobs locally. Obviously this is not what one would\nregularly do for a normal pipeline run. How to submit jobs on a cluster is\ndescribed later. Let's first move on to the necessary configuration values.\n\n## Configuration values\nThe required and optional outputs are specified in the json schema located in\n`config/schema.json`. Before running, the content of the `--configfile` is\nvalidated against this schema.\n\nThe following configuration values are **required**:\n\n| configuration | description |\n| ------------- | ----------- |\n| `reference` | Absolute path to fasta file |\n| `samples` | One or more samples, with associated fastq files |\n| `dbsnp` | Path to dbSNP VCF file|\n| `known_sites` | Path to one or more VCF files with known sites. Can be the same as the `dbsnp` file|\n\n\nThe following configuration options are **optional**:\n\n| configuration | description |\n| ------------- | ----------- |\n| `targetsfile` | Bed file of the targets of the capture kit. Used to calculate coverage |\n| `baitsfile` | Bed file of the baits of the capture kit. Used to calculate picard HsMetrics |\n| `female_threshold` | Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6 |\n| `scatter_size` | The size of chunks to divide the reference into for parallel execution. Default = 1000000000 |\n| `coverage_threshold` | One or more threshold coverage values. For each value, a sample specific bed file will be created that contains the regions where the coverage is above the threshold |\n| `restrict_BQSR` | Restrict GATK BaseRecalibration to a single chromosome. This is faster, but the recalibration is possibly less reliable |\n| `multisample_vcf` | Create a true multisample VCF file, in addition to the regular per-sample VCF files |\n\n\n## Cluster configuration\n\nTo run on a cluster, snakemake needs to be called with some extra arguments.\nAdditionally, it needs a cluster yaml file describing resources per job.\n\nIf you run on a cluster with drmaa support,an environment variable named\n`DRMAA_LIBRARY_PATH` must be in the executing shell environment. This variable\npoints to the `.so` file of the DRMAA library.\n\nAn sge-cluster.yml is bundled with this pipeline in the cluster directory.\nIt is optimized for SGE clusters, where the default vmem limit is 4G.\nIf you run SLURM, or any other cluster system, you will have to write your own\ncluster yaml file. Please see the [snakemake documentation](http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration)\nfor details on how to do so. Given the provided sge-cluster.yml, activating the\ncluster mode can be done as follows:\n\n```bash\nsnakemake -s Snakefile \\\n--cluster-config cluster/sge-cluster.yml\n--drmaa ' -pe \u003cPE_NAME\u003e {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \\\n```\n\n## Limitations\nSample names should be unique, and not overlap (such as `sample1` and\n`sample10`). This is due to the way output files are parsed by multiQC,\nwhen sample names overlap, the json output for picard DuplicationMetrics cannot\nbe parsed unambiguously.\n\n## Binding additional directories under singularity\n\nIn singularity mode, snakemake binds the location of itself in the container.\nThe current working directory is also visible directly in the container.\n\nIn many cases, this is not enough, and will result in `FileNotFoundError`s.\nE.g., suppose you run your pipeline in `/runs`, but your fastq files live in\n`/fastq` and your reference genome lives in `/genomes`. We would have to bind\n`/fastq` and `/genomes` in the container.\n\nThis can be accomplished with `--singularity-args`, which accepts a simple\nstring of arguments passed to singularity. E.g. in the above example,\nwe could do:\n\n```bash\nsnakemake -S Snakefile \\\n--use-singularity  \\\n--singularity-args ' --bind /fastq:/fastq --bind /genomes:/genomes '\n```\n\n## Summing up\n\nTo sum up, a full pipeline run under a cluster would be called as:\n\n```bash\nsnakemake -s Snakefile \\\n--use-singularity \\\n--singularity-args ' --bind /some_path:/some_path ' \\\n--cluster-config cluster/sge-cluster.yml \\\n--drmaa ' -pe \u003cPE_NAME\u003e {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \\\n--rerun-incomplete \\\n--jobs 200 \\\n-w 120 \\\n--max-jobs-per-second 30 \\\n--restart-times 2 \\\n--configfile config.json\n```\n\n# Graph\n\nBelow you can see the rule graph of the pipeline. The main variant calling flow\nis highlighted in red. This only shows dependencies\nbetween rules, and not between jobs. The actual job graph is considerably\nmore complex, as nearly all rules are duplicated by sample and some\n(the scatter jobs) additionally by chunk.\n\nAs a rough estimate of the total number of jobs in pipeline you can use\nthe following formula:\n\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=n_{jobs}\u0026space;=\u0026space;4\u0026plus;(21*n_{samples})\u0026plus;(n_{samples}*n_{beds})\u0026plus;(n_{samples}*n_{chunks})\u0026plus;n_{chunks}\" target=\"_blank\"\u003e\u003cimg src=\"https://latex.codecogs.com/gif.latex?n_{jobs}\u0026space;=\u0026space;4\u0026plus;(21*n_{samples})\u0026plus;(n_{samples}*n_{beds})\u0026plus;(n_{samples}*n_{chunks})\u0026plus;n_{chunks}\" title=\"n_{jobs} = 4+(21*n_{samples})+(n_{samples}*n_{beds})+(n_{samples}*n_{chunks})+n_{chunks}\" /\u003e\u003c/a\u003e\n\n\u003c!---\nNote: math doesn't work on github. The following _does_ work in gitlab\n```math\njobs = 4+(21*n_{samples})+(1*n_{samples}*n_{beds})+(1*n_{samples}*n_{chunks})+(1*n_{chunks})\n```\n---\u003e\n\nThis gives about 12,000 jobs for a 96-sample run with 2 bed files and 100 chunks.\n\nNOTE: the graph will only render if your markdown viewer supports `plantuml`.\nHaving trouble viewing the graph? See [this](img/rulegraph.svg) static SVG instead.\n\n```plantuml\ndigraph snakemake_dag {\n    graph[bgcolor=white, margin=0];\n    node[shape=box, style=rounded, fontname=sans,                 fontsize=10, penwidth=2];\n    edge[penwidth=2, color=grey];\n\t0[label = \"all\", color = \"0.30 0.6 0.85\", style=\"rounded\"];\n\t1[label = \"multiqc\", color = \"0.60 0.6 0.85\", style=\"rounded\"];\n\t2[label = \"merge_stats\", color = \"0.17 0.6 0.85\", style=\"rounded\"];\n\t3[label = \"bai\", color = \"0.09 0.6 0.85\", style=\"rounded\"];\n\t4[label = \"genotype_gather\\nsample: micro\", color = \"0.06 0.6 0.85\", style=\"rounded\"];\n\t5[label = \"gvcf_gather\\nsample: micro\", color = \"0.32 0.6 0.85\", style=\"rounded\"];\n\t6[label = \"fastqc_raw\\nsample: micro\", color = \"0.00 0.6 0.85\", style=\"rounded\"];\n\t7[label = \"fastqc_merged\", color = \"0.11 0.6 0.85\", style=\"rounded\"];\n\t8[label = \"fastqc_postqc\", color = \"0.02 0.6 0.85\", style=\"rounded\"];\n\t9[label = \"stats_tsv\", color = \"0.45 0.6 0.85\", style=\"rounded\"];\n\t10[label = \"collectstats\", color = \"0.24 0.6 0.85\", style=\"rounded\"];\n\t11[label = \"vcfstats\\nsampel: micro\", color = \"0.52 0.6 0.85\", style=\"rounded\"];\n\t12[label = \"markdup\", color = \"0.47 0.6 0.85\", style=\"rounded\"];\n\t13[label = \"scatterregions\", color = \"0.56 0.6 0.85\", style=\"rounded\"];\n\t14[label = \"merge_r1\\nsample: micro\", color = \"0.65 0.6 0.85\", style=\"rounded\"];\n\t15[label = \"merge_r2\\nsample: micro\", color = \"0.26 0.6 0.85\", style=\"rounded\"];\n\t16[label = \"cutadapt\", color = \"0.22 0.6 0.85\", style=\"rounded\"];\n\t17[label = \"fqcount_preqc\", color = \"0.37 0.6 0.85\", style=\"rounded\"];\n\t18[label = \"fqcount_postqc\", color = \"0.58 0.6 0.85\", style=\"rounded\"];\n\t19[label = \"mapped_reads_bases\", color = \"0.43 0.6 0.85\", style=\"rounded\"];\n\t20[label = \"unique_reads_bases\", color = \"0.34 0.6 0.85\", style=\"rounded\"];\n\t21[label = \"fastqc_stats\", color = \"0.13 0.6 0.85\", style=\"rounded\"];\n\t22[label = \"covstats\", color = \"0.39 0.6 0.85\", style=\"rounded\"];\n\t23[label = \"align\", color = \"0.49 0.6 0.85\", style=\"rounded\"];\n\t24[label = \"create_markdup_tmp\", color = \"0.41 0.6 0.85\", style=\"rounded,dashed\"];\n\t25[label = \"sickle\", color = \"0.19 0.6 0.85\", style=\"rounded\"];\n\t26[label = \"genome\", color = \"0.62 0.6 0.85\", style=\"rounded\"];\n\t1 -\u003e 0\n\t2 -\u003e 0\n\t3 -\u003e 0\n\t4 -\u003e 0\n\t5 -\u003e 0\n\t6 -\u003e 0\n\t7 -\u003e 0\n\t8 -\u003e 0\n\t9 -\u003e 1\n\t10 -\u003e 2\n\t11 -\u003e 2\n\t12 -\u003e 3\n\t13 -\u003e 4\n\t13 -\u003e 5\n\t14 -\u003e 7\n\t15 -\u003e 7\n\t16 -\u003e 8\n\t2 -\u003e 9\n\t17 -\u003e 10\n\t18 -\u003e 10\n\t19 -\u003e 10\n\t20 -\u003e 10\n\t21 -\u003e 10\n\t22 -\u003e 10\n\t4 -\u003e 11\n\t23 -\u003e 12\n\t24 -\u003e 12\n\t25 -\u003e 16\n\t14 -\u003e 17\n\t15 -\u003e 17\n\t16 -\u003e 18\n\t23 -\u003e 19\n\t12 -\u003e 20\n\t7 -\u003e 21\n\t8 -\u003e 21\n\t12 -\u003e 22\n\t26 -\u003e 22\n\t16 -\u003e 23\n\t24 -\u003e 23\n\t14 -\u003e 25\n\t15 -\u003e 25\n}\n```\n\nLICENSE\n=======\n\nAGPL-3.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flumc%2Fhutspot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flumc%2Fhutspot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flumc%2Fhutspot/lists"}