{"id":19434487,"url":"https://github.com/mlin/glnext","last_synced_at":"2025-02-25T06:24:42.746Z","repository":{"id":221598285,"uuid":"421269634","full_name":"mlin/GLnext","owner":"mlin","description":"Scalable gVCF merging and joint variant calling","archived":false,"fork":false,"pushed_at":"2024-02-29T22:41:39.000Z","size":15319,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-03-02T22:26:05.539Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-10-26T03:45:34.000Z","updated_at":"2024-02-13T21:12:28.000Z","dependencies_parsed_at":"2024-03-02T22:37:19.823Z","dependency_job_id":null,"html_url":"https://github.com/mlin/GLnext","commit_stats":null,"previous_names":["mlin/glnext"],"tags_count":44,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FGLnext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FGLnext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FGLnext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlin%2FGLnext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlin","download_url":"https://codeload.github.com/mlin/GLnext/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240614702,"owners_count":19829396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T14:46:35.872Z","updated_at":"2025-02-25T06:24:42.655Z","avatar_url":"https://github.com/mlin.png","language":"Kotlin","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GLnext\n\n**NOTICE: this project is public for our collaborators; it's not yet ready for general use!**\n\nGLnext is a scalable tool for gVCF merging and joint variant calling in population-scale sequencing. It's a successor to [GLnexus](https://github.com/dnanexus-rnd/GLnexus), but shares no code and:\n\n* runs on Apache Spark at scale\n* simplifies the project VCF (pVCF) to represent only one ALT allele per line\n* generates [spVCF](https://github.com/mlin/spVCF) natively (decodes to standard pVCF)\n\n## Building\n\n**First check our [Releases](https://github.com/mlin/GLnext/releases) for a prebuilt JAR file!**\n\nRequirements: x86-64 platform, Linux or macOS, JDK 11+, Apache Maven.\n\n```\ngit clone --recursive https://github.com/mlin/GLnext.git\ncd GLnext\nmvn package\n```\n\nand find the JAR file under `target/`.\n\nTo run some basic tests,\n\n```\nexport SPARK_HOME=/path/to/spark-3.3.4-bin-hadoop3\nprove -v test/dv1KGP.t\n```\n\n## Running GLnext\n\nGeneral requirements: x86-64 platform, Linux or macOS, Java 11+, Spark 3.3.x\n\nCompatibility with other Spark versions is not assured. Also, the JAR uses native libraries for x86-64 only.\n\n### Local\n\n```\nexport SPARK_HOME=/path/to/spark-3.3.4-bin-hadoop3\n\n_JAVA_OPTIONS=\"\n    -Dspark.default.parallelism=$(nproc)\n    -Dspark.sql.shuffle.partitions=$(nproc)\n\" $SPARK_HOME/bin/spark-submit --master 'local[*]' \\\n    GLnext-XXXX.jar --config DeepVariant.WGS \\\n    /path/to/sample1.gvcf.gz /path/to/sample2.gvcf.gz ... \\\n    /path/to/outputs/myCohort\n```\n\nThe output spvcf.gz files, one per chromosome, are saved to the output directory set in the last argument. Its last path component (myCohort) is used in the individual output filenames.\n\nTo decode the spvcf.gz files to standard vcf.gz, download the [spvcf](https://github.com/mlin/spVCF) utility and run each file through `bgzip -dc myCohort_XXXX.spvcf.gz | spvcf decode | bgzip -@4 \u003e myCohort_XXXX.vcf.gz`. Or all at once:\n\n```\ncd /path/to/outputs/myCohort\nwget https://github.com/mlin/spVCF/releases/download/v1.3.2/spvcf\nchmod +x spvcf\nls -1 *.spvcf.gz | parallel -t '\n    bgzip -dc {} | ./spvcf decode -q | bgzip \u003e $(basename {} .spvcf.gz).vcf.gz\n'\n```\n\nHowever, spVCF decoding is usually fast enough to run on-the-fly, piping into downstream analysis tools, instead of storing the much larger vcf.gz files.\n\nIf you have a lot of input samples, then prepare a manifest file with one gVCF path per line, and pass GLnext `--manifest manifestFile.txt` instead of the individual paths. And, you'll probably hit out-of-memory errors until you edit the `_JAVA_OPTIONS` to increase the partitioning or (as always with Spark) tune [many other settings](https://spark.apache.org/docs/3.3.4/configuration.html#memory-management).\n\n### Google Cloud Dataproc\n\nFirst, upload to Google Cloud Storage:\n\n1. GLnext JAR file\n1. gvcf.gz input files\n1. gVCF manifest file with one gs:// URI per line\n\nThen:\n\n```\ngcloud dataproc batches submit spark \\\n    --region=us-west1 --version=1.1 \\\n    --jars=gs://MYBUCKET/GLnext-XXXX.jar \\\n    --class=net.mlin.GLnext.SparkApp \\\n    --properties=spark.default.parallelism=256,spark.sql.shuffle.partitions=256,spark.reducer.fetchMigratedShuffle.enabled=true \\\n    -- \\\n    --config DeepVariant.WGS \\\n    --manifest gs://MYBUCKET/in/gvcf_manifest.txt \\\n    gs://MYBUCKET/out/myCohort\n```\n\nThe spvcf.gz files are saved to the storage folder set in the last argument. You may then decide how and when to `spvcf decode` them to standard VCF; perhaps piping into downstream analysis tools, or using your preferred batch workflow runner.\n\n### DNAnexus\n\nBuild the [DNAnexus Spark App](https://documentation.dnanexus.com/developer/apps/developing-spark-apps):\n\n```\ndx build dx/GLnext\n```\n\nAnd see [dx/GLnext/README.md](dx/GLnext/README.md) for detailed usage instructions.\n\n## Default pVCF representation\n\nIn the GLnext \\[s\\]pVCF, all \"sites\" (lines) represent only one ALT allele, written in [normalized](https://genome.sph.umich.edu/wiki/Variant_Normalization) form. Distinct overlapping ALT alleles are presented on nearby lines. In a genotype entry, if the sample has one or more copies of an overlapping ALT allele *other than* the one presented on the current line, then the `GT` is either half-called or non-called (`./0` `./1` or `./.`) and the `OL` field is set to the overlapping ALT copy number.\n\nExperience has shown that this approach is closer to the typical practice of statistical analyses on large cohorts, compared to [multiallelic sites](https://github.com/dnanexus-rnd/GLnexus/wiki/Reading-GLnexus-pVCFs). It's less optimized for family-level analyses focused on compound heterozygote genotypes (1/2 etc.).\n\nBy default, GLnext keeps only `GT` and `DP` in pVCF entries deriving from gVCF reference bands. Other QC fields like `GQ`, `PL`, etc. are not very meaningful *when derived from reference bands*, and omitting them reduces the file size considerably. The tool can be reconfigured to propagate them if needed (see below). Beyond that choice, GLnext generates spVCF losslessly, *without* the rounding of `DP` values in spVCF's \"squeeze\" feature. If that's palatable, then the GLnext spVCF can be re-encoded with `spvcf decode | spvcf encode --squeeze` to reduce its size further.\n\n## Options\n\n**Configurations.** Available settings of `--config`:\n\n* `DeepVariant.WGS`\n* `DeepVariant.AllQC.WGS`\n* `DeepVariant.WES`\n* `DeepVariant.AllQC.WES`\n\nThe WGS and WES settings provide different calibrations for the joint genotype revision calculations (identical to GLnexus).\n\nThe AllQC configurations keep all QC values from reference bands, as discussed above. This should be paired with the `spvcf decode --with-missing-fields` argument to make all FORMAT fields explicit.\n\n**Allele quality filtering.** Unlike GLnexus, GLnext does not apply any variant quality filters by default: any ALT allele with at least one copy called is included in the output. Compared to traditional multiallelic pVCF, the impact of many lower-quality variants is mitigated by the combination of our biallelic representation and spVCF encoding. \n\nNonetheless, quality filters may be practically desirable at a certain scale, and can be enabled by setting Java options/properties:\n\n```\n-Dconfig.override.discovery.minQUAL1=10 -Dconfig.override.discovery.minQUAL2=5\n```\n\nThese thresholds include alleles with at least one copy called with Phred QUAL≥10, *or* at least two copies with QUAL≥5. (Analogous to GLnexus min_AQ1 and min_AQ2.)\n\n**Region filter.** To limit the output spVCF to variants contained within given regions:\n\n* BED file: `--filter-bed myExomeKit.bed`\n* Contigs: `--filter-contigs chr1,chr2,chr3,chr4`\n\nIf both are supplied, then they're intersected: only variants in *both* a BED region and a filter contig will be called.\n\n**Output file splitting.** By default, the app generates one spvcf.gz output file per contig. For larger cohorts where per-contig files are themselves unwieldy, a BED file can be given to guide further splitting of the spVCF output files with: `--split-bed GRCh38_60Mbp_shards.bed`. The BED regions must fully cover the contigs to be processed without any gaps or overlaps.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fglnext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlin%2Fglnext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlin%2Fglnext/lists"}