Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lpryszcz/pyscaf
Genome assembly scaffolding using information from paired-end/mate-pair libraries, long reads, and synteny to closely related species.
https://github.com/lpryszcz/pyscaf
genome long-reads reference scaffolding synteny
Last synced: 3 days ago
JSON representation
Genome assembly scaffolding using information from paired-end/mate-pair libraries, long reads, and synteny to closely related species.
- Host: GitHub
- URL: https://github.com/lpryszcz/pyscaf
- Owner: lpryszcz
- License: gpl-3.0
- Created: 2016-03-18T08:54:14.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2018-12-06T14:32:57.000Z (about 6 years ago)
- Last Synced: 2024-12-09T03:44:20.219Z (13 days ago)
- Topics: genome, long-reads, reference, scaffolding, synteny
- Language: Python
- Size: 3.07 MB
- Stars: 24
- Watchers: 6
- Forks: 11
- Open Issues: 10
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
.. contents:: Table of Contents
pyScaf
======pyScaf orders contigs from genome assemblies utilising several types of information:
- paired-end (PE) and/or mate-pair libraries (`NGS-based mode <#NGS-based scaffolding>`_)
- long reads (`Scaffolding based on long reads <#Scaffolding based on long reads>`_)
- synteny to the genome of some related species (`Reference-based scaffolding <#Reference-based-scaffolding>`_)=================
Scaffolding modes
=================NGS-based scaffolding
~~~~~~~~~~~~~~~~~~~~~
This is under development... Stay tuned.Scaffolding based on long reads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this mode, pyScaf aligns long reads onto the contigs, identifies the reads the connects two or more contigs and join adjacent contigs.Long reads are aligned locally onto contigs, ignoring:
- matches not satisfying cut-offs (``--identity`` and ``--overlap``)
- suboptimal matches (only best match of each query to reference is kept)
- and removing overlapping matches on reference.**Note, this is experimental implementation.**
Reference-based scaffolding
~~~~~~~~~~~~~~~~~~~~~~~~~~~
In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.Contigs are aligned locally onto reference chromosomes, ignoring:
- matches not satisfying cut-offs (``--identity`` and ``--overlap``)
- suboptimal matches (only best match of each query to reference is kept)
- and removing overlapping matches on reference.In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on *C. parapsilosis* (13 Mb; CANPA) and *A. thaliana* (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (`Figures in dropbox `_, `CANPA table `_, `ARATH table `_).
Runs took ~0.5 min for CANPA on ``4 CPUs`` and ~2 min for ARATH on ``16 CPUs``.**Important remarks:**
- Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
- pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no *de novo assembler* / scaffolder is perfect...), which breaks synteny.
- pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
- pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. **Note however, this is experimental implementation!**
- Consider closing gaps after scaffolding.=====
Usage
=====
Dependencies
~~~~~~~~~~~~
- `LAST v700+ `_
- `FastaIndex `_Parameters
~~~~~~~~~~
Given reference genome, the program generates pairwise genome alignment (dotplots) by default.- Genral options:
-h, --help show this help message and exit
-f FASTA, --fasta FASTA
assembly FASTA file
-o OUTPUT, --output OUTPUT
output stream [scaffolds.fa]
-t THREADS, --threads THREADS
max no. of threads to run [4]
--log LOG output log to [stderr]
--dotplot
generate dotplot as [png]
--version show program's version number and exit- Reference-based scaffolding options:
-r REF, --ref REF, --reference REF
reference FastA file
--identity IDENTITY min. identity [0.33]
--overlap OVERLAP min. overlap [0.66]
-g MAXGAP, --maxgap MAXGAP
max. distance between adjacent contigs [0.01 * assembly_size]
--norearrangements high identity mode (rearrangements not allowed)- Long read-based scaffolding options (EXPERIMENTAL!):
-n LONGREADS, --longreads LONGREADS
FastQ/FastA file(s) with PacBio/ONT reads- NGS-based scaffolding options (!NOT IMPLEMENTED!):
-i FASTQ, --fastq FASTQ
FASTQ PE/MP files
-j JOINS, --joins JOINS
min pairs to join contigs [5]
-a LINKRATIO, --linkratio LINKRATIO
max link ratio between two best contig pairs [0.7]
-l LOAD, --load LOAD align subset of reads [0.2]
-q MAPQ, --mapq MAPQ min mapping quality [10]Test run
~~~~~~~~
To perform reference-based assembly, provide assembled contigs and reference genome in FastA format.
Dotplots of below runs can be found in `docs `_.
If you wish to skip dotplot generation (ie. no X11 on your system), provide ``--dotplot ''`` parameter... code-block:: bash
# scaffold homogenised assembly (reduced contigs)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa# scaffold reduced contigs using global mode (no norearrangements allowed)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.global.fa --norearrangements# scaffold heterozygous assembly (de novo assembled contigs)
./pyScaf.py -f test/contigs.fa -r test/ref.fa -o test/contigs.ref.fa# scaffold reduced contigs using long reads
## pacbio
./pyScaf.py -f test/contigs.reduced.fa -n test/pacbio.fq.gz -o test/contigs.reduced.pacbio.fa
## nanopore
./pyScaf.py -f test/contigs.reduced.fa -n test/nanopore.fa.gz -o test/contigs.reduced.nanopore.fa# generate dotplot
lastdb test/ref.fa
lastal -f TAB test/ref.fa test/contigs.reduced.pacbio.fa | last-dotplot - test/contigs.reduced.pacbio.fa.ref.png
lastal -f TAB test/ref.fa test/contigs.reduced.nanopore.fa | last-dotplot - test/contigs.reduced.nanopore.fa.ref.png# clean-up
#rm test/contigs.{,reduced.}fa.* test/ref.fa.* test/*.{nanopore,pacbio,ref}* test/*.log================
Proof of concept
================
pyScaf is under heavy development right now.
Nevertheless, both the reference-based mode and long-read mode are functional and produces meaningful assemblies.
pyScaf has been implemented in `Redundans `_.For more info, have a look in `workbook `_.