{"id":20810138,"url":"https://github.com/lh3/chm-eval","last_synced_at":"2026-03-14T21:03:53.503Z","repository":{"id":139699580,"uuid":"58572845","full_name":"lh3/CHM-eval","owner":"lh3","description":null,"archived":false,"fork":false,"pushed_at":"2020-06-24T15:25:22.000Z","size":537,"stargazers_count":53,"open_issues_count":3,"forks_count":8,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-01-18T14:29:08.729Z","etag":null,"topics":["bioinformatics","genomics","variant-calling"],"latest_commit_sha":null,"homepage":null,"language":"TeX","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-05-11T19:03:16.000Z","updated_at":"2024-11-30T09:03:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"1bc66c3d-ede9-47ab-b280-404881fc7178","html_url":"https://github.com/lh3/CHM-eval","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2FCHM-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2FCHM-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2FCHM-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2FCHM-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/CHM-eval/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243158284,"owners_count":20245652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genomics","variant-calling"],"created_at":"2024-11-17T20:19:54.861Z","updated_at":"2025-12-24T21:33:22.180Z","avatar_url":"https://github.com/lh3.png","language":"TeX","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Getting Started\n```sh\n# Download and install evaluation suite (Linux only)\ncurl -L https://github.com/lh3/CHM-eval/releases/download/v0.4/CHM-evalkit-20180221.tar \\\n    | tar xf -\n# Call CHM1-CHM13 variants in the GRCh37 coordinate (will take a while...)\nwget -qO- ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR134/ERR1341796/CHM1_CHM13_2.bam \\\n    | freebayes -f hs37.fa - \u003e CHM1_CHM13_2.raw.vcf\n# Filter (use your own filters if you like)\nCHM-eval.kit/run-flt -o CHM1_CHM13_2.flt CHM1_CHM13_2.raw.vcf\n# Distance-based evaluation\nCHM-eval.kit/run-eval -g 37 CHM1_CHM13_2.flt.vcf.gz | sh\nmore CHM1_CHM13_2.flt.summary\n# Evaluating allele and genotype accuracy (Java required)\nCHM-eval.kit/rtg format -o hs37.sdf hs37.fa   # if you haven't done this before\nCHM-eval.kit/run-eval -g 37 -s hs37.sdf CHM1_CHM13_2.flt.vcf.gz | sh\nmore CHM1_CHM13_2.flt.rtg.summary\n```\n\n## Introduction\n\nCHM-eval, aka Syndip, is a benchmark dataset for evaluating the accuracy of\nsmall variant callers. It is constructed from the PacBio assembilies of two\nindependent [CHM][CHM] cell lines using procedures largely orthogonal to the\nmethodology used for short-read variant calling, which makes it more\ncomprehensive and less biased in comparison to existing benchmark datasets. The\nfollowing figure briefly explains how this dataset was generated:\n\n![](CHM-workflow.png)\n\nThe truth data can be downloaded from the [release page][release]. The package\ncontains the list of confident regions, *phased* variant calls including\nthousands of long insertions/deletions, and evaluation scripts (see below).\nIllumina short reads sequenced from the two cell lines and from the\nexperimental mixture of the two cell lines are availble via project\n[PRJEB13208][ena] at ENA.\n\n```\nCHM-eval.kit\n|-- 00README.md              # this file\n|-- 01ori\n|   |-- func-37d5.bed.gz -\u003e func-37m.bed.gz\n|   |-- func-37m.bed.gz      # coding and conserved regions (from EnsEMBL) in GRCh37\n|   |-- func-38.bed.gz       # coding and conserved regions in GRCh38\n|   |-- syndip.m37d5.bed.gz  # confident regions including poly-A (for alignment against GRCh37+decoy)\n|   |-- syndip.m37m.bed.gz   # for alignment against GRCh37 primary assembly without decoy\n|   `-- syndip.m38.bed.gz    # for alignment against GRCh38 primary assembly\n|-- RTG-LICENSE.txt\n|-- RTG.jar                  # rtg-tools v3.8.4 (for evaluating allele/genotype accuracy)\n|-- full.37d5.bed.gz         # whole-genome confident regions excluding poly-A (against GRCh37+decoy)\n|-- full.37d5.vcf.gz         # whole-genome phased variant calls, including filtered\n|-- full.37m.bed.gz          # for alignment against GRCh37 without decoy\n|-- full.37m.vcf.gz\n|-- full.38.bed.gz           # for alignment against GRCh38\n|-- full.38.vcf.gz\n|-- func.37d5.bed.gz         # intersection of full.37d5.bed.gz and 01src/func-37d5.bed.gz\n|-- func.37m.bed.gz\n|-- func.38.bed.gz\n|-- hapdip.js                # script for evaluating distance-based accuracy\n|-- htsbox                   # htsbox-r345; auxiliary tool\n|-- k8                       # k8 javascript shell, for running hapdip.js\n|-- rtg                      # rtg portal script\n|-- rtg.cfg\n|-- run-eval                 # key evaluation script\n|-- sdust30-37d5.bed.gz      # low-complexity regions identified with SDUST at T=30\n|-- sdust30-37m.bed.gz -\u003e sdust30-37d5.bed.gz\n|-- sdust30-38.bed.gz\n|-- um35-hs37d5.bed.gz       # universal mask for GRCh37+decoy; for 35bp reads (Mallick et al, 2016)\n`-- um75-hs37d5.bed.gz       # for 75bp or longer reads\n```\n\nIf you use this dataset, please cite:\n\n\u003e Li H, Bloom JM, Farjoun Y, Fleharty M, Gauthier L, Neale B, MacArthur D\n\u003e (2018) A synthetic-diploid benchmark for accurate variant-calling\n\u003e evaluation. *Nat Methods*, **15**:595-597. [PMID:30013044]\n\n[CHM]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4729092/\n[ena]: https://www.ebi.ac.uk/ena/data/view/PRJEB13208\n[release]: https://github.com/lh3/CHM-eval/releases\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fchm-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fchm-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fchm-eval/lists"}