Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tanghaibao/allhic
Genome scaffolding based on HiC data in heterozygous and high ploidy genomes
https://github.com/tanghaibao/allhic
contigs genetic-algorithm genome-assembly genome-scaffolding genomics golang heterozygous hi-c inter-contig-links lachesis pipeline polyploid
Last synced: 2 months ago
JSON representation
Genome scaffolding based on HiC data in heterozygous and high ploidy genomes
- Host: GitHub
- URL: https://github.com/tanghaibao/allhic
- Owner: tanghaibao
- License: bsd-3-clause
- Created: 2017-11-06T22:19:28.000Z (about 7 years ago)
- Default Branch: main
- Last Pushed: 2021-05-24T04:39:23.000Z (over 3 years ago)
- Last Synced: 2024-10-13T02:24:32.167Z (3 months ago)
- Topics: contigs, genetic-algorithm, genome-assembly, genome-scaffolding, genomics, golang, heterozygous, hi-c, inter-contig-links, lachesis, pipeline, polyploid
- Language: Jupyter Notebook
- Homepage:
- Size: 28.6 MB
- Stars: 59
- Watchers: 4
- Forks: 12
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ALLHIC: Genome scaffolding based on Hi-C data
_ _____ _____ ____ ____ _____ ______
/ \ |_ _| |_ _| |_ || _||_ _|.' ___ |
/ _ \ | | | | | |__| | | | / .' \_|
/ ___ \ | | _ | | _ | __ | | | | |
_/ / \ \_ _| |__/ | _| |__/ | _| | | |_ _| |_\ `.___.'\
|____| |____||________||________||____||____||_____|`.____ .'[![Github
Actions](https://github.com/tanghaibao/allhic/workflows/build/badge.svg)](https://github.com/tanghaibao/allhic/actions)| | |
| ------- | ------------------------------------------------------------- |
| Authors | Haibao Tang ([tanghaibao](http://github.com/tanghaibao)) |
| | Xingtan Zhang ([tangerzhang](https://github.com/tangerzhang)) |
| Email | |
| License | [BSD](http://creativecommons.org/licenses/BSD/) |## Introduction
**We currently recommend only using this program in a scripted pipeline, as detailed
[here](https://github.com/tangerzhang/ALLHiC/wiki).**ALLHiC can be used to scaffold genomic contigs based on Hi-C data, which is
particularly effectively for auto-polyploid or heterozygous diploid genomes.## Installation
The easiest way to install allhic is to download the latest binary from
the [releases](https://github.com/tanghaibao/allhic/releases) and make sure to
`chmod +x` the resulting binary.If you are using [go](https://github.com/golang/go), you can build from source with:
```console
go get -u -t -v github.com/tanghaibao/allhic/...
go install github.com/tanghaibao/allhic/cmd/allhic
```## Usage
### Extract
Extract does a fair amount of preprocessing: 1) extract inter-contig links into a more compact form, specifically into `.clm`; 2) extract intra-contig links and build a distribution; 3) count up the restriction sites to be used in normalization (similar to LACHESIS); 4) bundles the inter-contig links into pairs of contigs.
```console
allhic extract tests/test.bam tests/seq.fasta.gz
```### Prune
This prune step is **optional** for typical inbreeding diploid genomes.
However, pruning will improve the quality of assembly of polyploid genomes.
Prune pairs file to remove allelic/cross-allelic links.```console
allhic prune tests/Allele.ctg.table tests/test.pairs.txt
```Please see help string of `allhic prune` on the formatting of
`Allele.ctg.table`.### Partition
Given a target `k`, number of partitions, the goal of the partitioning
is to separate all the contigs into separate clusters. As with all
clustering algorithm, there is an optimization goal here. The
LACHESIS algorithm is a hierarchical clustering algorithm using
average links, which is the same method used by ALLHIC.![networkbefore](images/graph-s.png)
![networkafter](images/graph-s.partitioned.png)```console
allhic partition tests/test.counts_GATC.txt tests/test.pairs.txt
```Critically, if you have applied the pruning step above, use the "pruned" pairs:
```console
allhic partition tests/test.counts_GATC.txt tests/test.pairs.prune.txt
```### Optimize
Given a set of Hi-C contacts between contigs, as specified in the
clmfile, reconstruct the highest scoring ordering and orientations
for these contigs.Optimize uses Genetic Algorithm (GA) to search for the best scoring solution.
GA has been successfully applied to genome scaffolding tasks in the past
(see ALLMAPS; [Tang et al. _Genome Biology_, 2015](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0573-1)).![ga](images/test-movie.gif)
```console
allhic optimize tests/test.counts_GATC.2g1.txt tests/test.clm
allhic optimize tests/test.counts_GATC.2g2.txt tests/test.clm
```### Build
Build genome release, including `.agp` and `.fasta` output.
```console
allhic build tests/test.counts_GATC.2g?.tour tests/seq.fasta.gz tests/asm-2g.chr.fasta
```### Plot
Use [d3.js](https://d3js.org/) to visualize the heatmap.
```console
allhic plot tests/test.bam tests/test.counts_GATC.2g1.tour
```![allhicplot](images/allhic-plot-s.png)
## Pipeline
Please see detailed steps in a scripted pipeline [here](https://github.com/tangerzhang/ALLHiC/wiki).
## WIP features
- [x] Add partition split inside "partition"
- [x] Use clustering when k = 1
- [x] Isolate matrix generation to "plot"
- [x] Add "pipeline" to simplify execution
- [x] Make "build" to merge subgroup tours
- [x] Provide better error messages for "file not found"
- [ ] Plot the boundary of the contigs in "plot" using genome.json
- [ ] Add dot plot to "plot"
- [ ] Compare numerical output with Lachesis
- [ ] Improve Ler0 results
- [ ] Translate "prune" from C++ code to golang
- [ ] Add test suites## Reference
Xingtan Zhang, Shengcheng Zhang, Qian Zhao, Ray Ming & Haibao Tang. Assembly of
allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. (2019) _Nature
Plants._ [link](https://www.nature.com/articles/s41477-019-0487-8)