{"id":13752561,"url":"https://github.com/lh3/hickit","last_synced_at":"2025-08-11T05:10:03.866Z","repository":{"id":139699667,"uuid":"128789419","full_name":"lh3/hickit","owner":"lh3","description":"TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C","archived":false,"fork":false,"pushed_at":"2021-02-04T01:47:43.000Z","size":4723,"stargazers_count":107,"open_issues_count":17,"forks_count":11,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-05-07T08:12:15.588Z","etag":null,"topics":["bioinformatics","genomics","hi-c"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-04-09T15:00:38.000Z","updated_at":"2025-04-28T03:01:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"44d83220-7edb-4f0b-bada-b93778775720","html_url":"https://github.com/lh3/hickit","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/lh3/hickit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fhickit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fhickit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fhickit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fhickit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/hickit/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fhickit/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269832883,"owners_count":24482330,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-11T02:00:10.019Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genomics","hi-c"],"created_at":"2024-08-03T09:01:07.551Z","updated_at":"2025-08-11T05:10:03.821Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"## \u003ca name=\"start\"\u003e\u003c/a\u003eGetting Started\n```sh\n# Download precompiled binaries for Linux\ncurl -L https://github.com/lh3/hickit/releases/download/v0.1/hickit-0.1_x64-linux.tar.bz2 | tar -jxf -\ncd hickit-0.1_x64-linux\n\n# Map Dip-C reads and extract contacts (skip if you use your own pipeline)\n./seqtk mergepe read1.fq.gz read2.fq.gz | ./pre-dip-c - | bwa mem -5SP -p hs37d5.fa - | gzip \u003e aln.sam.gz\n./k8 hickit.js vcf2tsv phased.vcf \u003e phased_SNP.tsv   # extract phased SNPs from VCF\n./k8 hickit.js sam2seg -v phased_SNP.tsv aln.sam.gz | ./k8 hickit.js chronly - | ./k8 hickit.js bedflt par.bed - | gzip \u003e contacts.seg.gz # for male\n#./k8 hickit.js sam2seg -v phased_SNP.tsv aln.sam.gz | ./k8 hickit.js chronly -y - | gzip \u003e contacts.seg.gz # for female\n./hickit -i contacts.seg.gz -o - | bgzip \u003e contacts.pairs.gz  # optional\n\n# Impute phases (-i also works with contacts.seg.gz)\n./hickit -i contacts.pairs.gz -u -o - | bgzip \u003e impute.pairs.gz\n./hickit -i contacts.pairs.gz --out-val=impute.val     # estimate imputation accuracy by holdout\n# Infer 3D structure\n./hickit -i impute.pairs.gz -Sr1m -c1 -r10m -c5 -b4m -b1m -b200k -D5 -b50k -D5 -b20k -O imput.3dg\n\n# 2D contact map in PNG (bin size determined by the image width)\n./hickit -i impute.pairs.gz --out-png impute.png\n# Compute CpG density (optional)\n./hickit.js gfeat -r hs37d5.fa.gz imput.3dg | gzip \u003e imput.cpg.3dg.gz\n# Visualize 3D structure (requiring a graphical card)\n./hickit-gl -I imput.cpg.3dg.gz --view\n```\n\n## Table of Contents\n\n* [Getting Started](#start)\n* [Introduction](#intro)\n* [Installation](#install)\n* [Users' Guide](#guide)\n  - [Terminologies](#term)\n  - [File formats](#format)\n    - [The pairs format](#pairs)\n    - [The 3dg format](#3dg)\n    - [The seg format](#seg)\n  - [Generating contacts in the pairs format](#gen-pairs)\n    - [Aligning Hi-C reads](#aln-hic)\n    - [Extracting contact pairs](#extract-pairs)\n  - [Imputing missing phases (diploid single-cell Hi-C only)](#impute)\n  - [Inferring 3D structures (single-cell only)](#infer-3d)\n* [Related Projects](#related)\n* [Limitations](#limit)\n\n## \u003ca name=\"intro\"\u003e\u003c/a\u003eIntroduction\n\nHickit is a set of tools initially developed to process diploid single-cell\nHi-C data. It extracts contact pairs from read alignment, identifies phases of\ncontacts overlapping with SNPs of known phases, imputes missing phases, infers\nthe 3D structure of a single cell and visualizes the structure. Part of the\nhickit functionality also works with bulk Hi-C data. In particular, hickit\nimplements a fast (untested) binning-free TAD calling algorithm and an\nefficient neighboring contacts counter which can be adapted to ultrafast loop\ncalling.\n\n## \u003ca name=\"install\"\u003e\u003c/a\u003eInstallation\n\nHickit depends on [zlib][zlib]. The command-line tools can be compiled by\ntyping `make` in the source code directory. The 3D viewer further requires\nOpenGL and GLUT and can be compiled with `make gl=1`.\n\n## \u003ca name=\"guide\"\u003e\u003c/a\u003eUsers' Guide\n\nHickit keeps one list of contacts and sometimes one 3D structure in memory. It\nhas two types of command-line switches: actions and settings. An action switch\nmodifies the in-memory bulk data or outputs them; a setting switch changes\nparameters but doesn't modify bulk data. Hickit applies switches sequentially\nas they appear on the command line. As such, **the order of command-line switches\noften affects the final result**.\n\nThe following command line does imputation and multiple rounds of 3D\nreconstruction altogether:\n```sh\nhickit -i in.pairs -u -o imput.pairs -Sr1m -c1 -r10m -c5 -b4m -b1m -b200k -D5 -b50k -D5 -b20k -O out.3dg\n```\nIt reads input contacts (action `-i`), imputes missing phases (action `-u`) and\noutputs imputed contacts (action `-o`), which are still stored in memory. Then\nhickit separates the two homologous chromosomes (action `-S`), filters isolated\ncontacts in two rounds (`-r1m -c1 -r10m -c5`, where `-r` is a setting and `-c`\nis an action), and applies five rounds of 3D modeling (the four `-b` actions)\nwith each at a higher resolution. The final resolution is at 20kb and written\nto file `out.3dg` (action `-O`).\n\nThis long command line can be decomposed into shorter ones by keeping more\nintermediate files:\n```sh\nhickit -i in.pairs -u -o imput.pairs\nhickit -i imput.pairs -Sr1m -c1 -r10m -c5 -o imput.flt.pairs\nhickit -i imput.flt.pairs -b4m -b1m -b200k -O coarse.3dg\nhickit -i imput.flt.pairs -I coarse.3dg -D5 -b50k -D5 -b20k -O out.3dg\n```\nIt is also possible to output intermediate files by using more output actions\nin the long command line.\n\n### \u003ca name=\"term\"\u003e\u003c/a\u003eTerminologies\n\nA *contact* is a pair of chromosomal coordinates that are supposed to be close\nto each other, inferred from Hi-C or other 3C technologies. A *contact pair* or\nsometimes a *pair* is taken as a synonym of contact. A *leg* is one of the two\nchromosomal coordinates in a contact pair.\n\n### \u003ca name=\"format\"\u003e\u003c/a\u003eFile formats\n\n#### \u003ca name=\"pairs\"\u003e\u003c/a\u003eThe pairs format\n\nHickit takes the [pairs format][pairs-fmt] as the primary data format to store\nraw contact pairs, binned pairs and phasing information. It uses `phase1` and\n`phase2` columns to store phasing. For example,\n```txt\n#columns: readID chr1 pos1 chr2 pos2 strand1 strand2 phase1 phase2\n.   1   3194588 1   4266988 -   +   .   0\n.   1   3195262 1   7393633 +   +   .   .\n.   1   3201962 1   6016262 +   -   1   .\n```\nPhase imputation estimates the probablity of the four possible phases in a\ndiploid genome, which are written to the p00, p01, p10 and p11 columns like\n```txt\n#columns: readID chr1 pos1 chr2 pos2 strand1 strand2 p00 p01 p10 p11\n.   1   3194588 1   4266988 -   +   0.990   0.000   0.010   0.000\n.   1   3195262 1   7393633 +   +   0.605   0.005   0.005   0.385\n.   1   3201962 1   6016262 +   -   0.000   0.000   0.010   0.990\n```\n\n#### \u003ca name=\"3dg\"\u003e\u003c/a\u003eThe 3dg format\n\nHickit describes the 3D genomc coordinates in the following format:\n```txt\n1a  3360000 0.377249    -0.280691   -0.861085   0.030120\n1a  3560000 0.406092    -0.173746   -0.795618   0.032090\n1a  3580000 0.429502    -0.092491   -0.822528   0.027910\n```\nwhere each line consists of chr name, start position, X, Y and Z coordinates.\nThe 6th column optionally stores a feature value (CpG density in this example).\nHickit's 3D viewer may color chromosomes by this column if present.\n\n#### \u003ca name=\"seg\"\u003e\u003c/a\u003eThe seg format\n\nThis is an intermediate format used by hickit to store raw contacts directly\ninferred from read alignment. It is generally adviced to convert this format to\npairs with:\n```sh\n./hickit --dup-dist=0 --min-leg-dist=0 -i contacts.seg.gz -o contacts.pairs\n```\n\n### \u003ca name=\"gen-pairs\"\u003e\u003c/a\u003eGenerating contacts in the pairs format\n\nIf you have your own pipeline to produce contact pairs, please ignore this\nsection.\n\n#### \u003ca name=\"aln-hic\"\u003e\u003c/a\u003eAligning Hi-C reads\n\nIf have normal Hi-C reads, you can align directly with [bwa-mem][bwa]:\n```sh\nbwa mem hs37d5.fa read1.fq.gz read2.fq.gz | gzip \u003e aln.sam.gz\n```\nNote that the hickit pipeline only works with bwa-mem or minimap2 because most\nother aligners do not produce chimeric alignments.\n\nIf you have Dip-C reads, you need to preprocess the reads with`pre-dip-c` from\nthe [pre-pe][pre-pe] and then align with [bwa-mem][bwa]:\n```sh\nseqtk mergepe read1.fq.gz read2.fq.gz | pre-dip-c - | bwa mem -p hs37d5.fa - | gzip \u003e aln.sam.gz\n```\n\n#### \u003ca name=\"extract-pairs\"\u003e\u003c/a\u003eExtracting contact pairs\n\nWhen you don't have phasing information, you can generate contact pairs with\n```sh\nhickit.js sam2seg aln.sam.gz | hickit.js chronly - | gzip \u003e contacts.seg.gz\nhickit -i contacts.seg.gz -o - | bgzip \u003e contacts.pairs.gz\n```\nWhen you have phased SNPs in VCF, you can generate contact pairs with the phase columns\n```sh\nhickit.js vcf2tsv NA12878_phased.vcf.gz \u003e phased_SNP.tsv\nhickit.js sam2seg -v phased_SNP.tsv aln.sam.gz | hickit.js chronly - | hickit.js bedflt par.bed - | gzip \u003e contacts.seg.gz\nhickit -i contacts.seg.gz -o - | bgzip \u003e contacts.pairs.gz\n```\nwhere `hickit.js chronly` filters out non-chromosomal contigs and\n`phased_SNP.tsv` keeps phased SNPs, which looks like\n```\nchr1    1010717 C       T\nchr1    1011531 T       C\nchr1    1013136 C       G\n```\nNote that the above is for **male** samples. Here the pseudoautosomal regions (PARs, coordinates supplied in `par.bed`) are excluded from analysis. For **female** samples, the part `hickit.js chronly - | hickit.js bedflt par.bed -` should be replaced by `hickit.js chronly -y -` to remove the Y chromosome instead.\n\n### \u003ca name=\"impute\"\u003e\u003c/a\u003eImputing missing phases (diploid single-cell Hi-C only)\n\nBecause SNPs are sparse, only a tiny fraction of contacts are fully phased. To\nimpute missing phases, you can\n```sh\nhickit -i contacts.pairs.gz -u -o - | bgzip \u003e impute.pairs.gz\n```\nThe output is still in the pairs format. The last four columns give the\npseudo-probability of four possible phases, inferred by an EM-like algorithm. A\nnumber 0.75 or above is generally considered reliable based on held-out\nvalidation, which can be performed with\n```sh\nhickit -i contacts.pairs.gz --out-val impute.val\n```\nThis command line holds out 10% of legs with known phases, impute them back\nfrom other contacts and then estimate the accuracy. The output is TAB-delimited\nwith each line consists of probability threshold, sensitivity of\nintra-chromosome contacts close to the diagonal, accuracy of such contacts,\nsensitivity of off-diagonal contacts, accuracy of such contacts, sensitivity of\nall contacts and accuracy of all contacts.\n\n### \u003ca name=\"infer-3d\"\u003e\u003c/a\u003eInferring 3D structures (single-cell only)\n\nThe following command line is used to infer the 3D structures of data published\nin the Dip-C paper.\n```sh\nhickit -i impute.pairs.gz -Sr1m -c1 -r10m -c5 -b4m -b1m -b200k -D5 -b50k -D5 -b20k -O out.3dg\n```\nIt filters isolated contacts and then iteratively infers structures in multiple\nround. Each round takes the previous structure as the base line and infers a\nstructure of higher resolution.\n\nTo check the crude quality of a 3D structure, we encourage to compute the CpG\ndensity with\n```sh\nhickit.js gfeat -r hs37d5.fa impute.3dg.gz | gzip \u003e impute.cpg.3dg.gz\n```\nFor PBMC and LCL cells, we typically see low-CpG regions placed at the\nperiphery, which leads to a magenta ball (on the left; image produced by the\n`--view` action of hickit). For these cell types, a problematic inference\noften has large areas of greens (high CpG density; on the bottom).\n\n\u003cimg src=\"doc/pbmc_05.png\" alt=\"pbmc_05\" /\u003e\n\nIt should be noted that although cells of the same type are generally\nassociated with some features (e.g. low-CpG regions at the periphery), the\nspacial adjacencies of chromosomes are often distinct. Don't be supprised if\nyou see the 3D structures of two cells look very different.\n\n## \u003ca name=\"related\"\u003e\u003c/a\u003eRelated Projects\n\n[Dip-c][dip-c-repo] is the primary pipeline used in the Dip-C paper (in\nreview) and has deeply influenced the development of hickit. Hickit in turn\noptimizes and simplifies multiple steps in the dip-c pipeline. It can reproduce\nseveral main conclusions in the paper and occasionally improve the structure.\nHickit also learns from [nuc\\_dynamics][nuc-dyn] on single-cell 3D genome\nmodeling.\n\n## \u003ca name=\"limit\"\u003e\u003c/a\u003eLimitations\n\nHickit was originally developed for single-cell diploid Hi-C data. Although\nsome of its functionality potentially works with bulk Hi-C, it is not well\ntested. Please raise issues or contact me if you want to try hickit on bulk\nHi-C and have troubles. I will really appreciate.\n\n[zlib]: http://zlib.net\n[pre-pe]: https://github.com/lh3/pre-pe\n[bwa]: https://github.com/lh3/bwa\n[pairs-fmt]: https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md\n[nuc-dyn]: https://github.com/tjs23/nuc_dynamics\n[dip-c-repo]: https://github.com/tanlongzhi/dip-c\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fhickit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fhickit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fhickit/lists"}