https://github.com/lpryszcz/pyscaf

Genome assembly scaffolding using information from paired-end/mate-pair libraries, long reads, and synteny to closely related species.
https://github.com/lpryszcz/pyscaf

genome long-reads reference scaffolding synteny

Last synced: 6 months ago
JSON representation

Genome assembly scaffolding using information from paired-end/mate-pair libraries, long reads, and synteny to closely related species.

Host: GitHub
URL: https://github.com/lpryszcz/pyscaf
Owner: lpryszcz
License: gpl-3.0
Created: 2016-03-18T08:54:14.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-12-06T14:32:57.000Z (over 6 years ago)
Last Synced: 2024-12-09T03:44:20.219Z (6 months ago)
Topics: genome, long-reads, reference, scaffolding, synteny
Language: Python
Size: 3.07 MB
Stars: 24
Watchers: 6
Forks: 11
Open Issues: 10
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

        .. contents:: Table of Contents

pyScaf

======

pyScaf orders contigs from genome assemblies utilising several types of information:

- paired-end (PE) and/or mate-pair libraries (`NGS-based mode <#NGS-based scaffolding>`_)

- long reads (`Scaffolding based on long reads <#Scaffolding based on long reads>`_)

- synteny to the genome of some related species (`Reference-based scaffolding <#Reference-based-scaffolding>`_)

=================

Scaffolding modes

=================

NGS-based scaffolding

~~~~~~~~~~~~~~~~~~~~~

This is under development... Stay tuned. 

Scaffolding based on long reads

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this mode, pyScaf aligns long reads onto the contigs, identifies the reads the connects two or more contigs and join adjacent contigs.  

Long reads are aligned locally onto contigs, ignoring:

- matches not satisfying cut-offs (``--identity`` and ``--overlap``)

- suboptimal matches (only best match of each query to reference is kept) 

- and removing overlapping matches on reference. 

**Note, this is experimental implementation.** 

Reference-based scaffolding

~~~~~~~~~~~~~~~~~~~~~~~~~~~

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned locally onto reference chromosomes, ignoring:

- matches not satisfying cut-offs (``--identity`` and ``--overlap``)

- suboptimal matches (only best match of each query to reference is kept) 

- and removing overlapping matches on reference. 

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on *C. parapsilosis* (13 Mb; CANPA) and *A. thaliana* (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (`Figures in dropbox `_, `CANPA table `_, `ARATH table `_).  

Runs took ~0.5 min for CANPA on ``4 CPUs`` and ~2 min for ARATH on ``16 CPUs``. 

**Important remarks:**

- Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.

- pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no *de novo assembler* / scaffolder is perfect...), which breaks synteny. 

- pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level. 

- pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. **Note however, this is experimental implementation!**

- Consider closing gaps after scaffolding. 

=====

Usage

=====

Dependencies

~~~~~~~~~~~~

- `LAST v700+ `_

- `FastaIndex `_

Parameters

~~~~~~~~~~

Given reference genome, the program generates pairwise genome alignment (dotplots) by default. 

- Genral options:

  -h, --help            show this help message and exit

  -f FASTA, --fasta FASTA

                        assembly FASTA file

  -o OUTPUT, --output OUTPUT

                        output stream [scaffolds.fa]

  -t THREADS, --threads THREADS

                        max no. of threads to run [4]

  --log LOG             output log to [stderr]

  --dotplot

                        generate dotplot as [png]

  --version             show program's version number and exit

- Reference-based scaffolding options:

  -r REF, --ref REF, --reference REF

                        reference FastA file

  --identity IDENTITY   min. identity [0.33]

  --overlap OVERLAP     min. overlap  [0.66]

  -g MAXGAP, --maxgap MAXGAP

                        max. distance between adjacent contigs [0.01 * assembly_size]

  --norearrangements    high identity mode (rearrangements not allowed)

- Long read-based scaffolding options (EXPERIMENTAL!): 

  -n LONGREADS, --longreads LONGREADS

                        FastQ/FastA file(s) with PacBio/ONT reads

- NGS-based scaffolding options (!NOT IMPLEMENTED!):

  -i FASTQ, --fastq FASTQ

                        FASTQ PE/MP files

  -j JOINS, --joins JOINS

                        min pairs to join contigs [5]

  -a LINKRATIO, --linkratio LINKRATIO

                        max link ratio between two best contig pairs [0.7]

  -l LOAD, --load LOAD  align subset of reads [0.2]

  -q MAPQ, --mapq MAPQ  min mapping quality [10]

Test run

~~~~~~~~

To perform reference-based assembly, provide assembled contigs and reference genome in FastA format.

Dotplots of below runs can be found in `docs `_.  

If you wish to skip dotplot generation (ie. no X11 on your system), provide ``--dotplot ''`` parameter.

.. code-block:: bash

    # scaffold homogenised assembly (reduced contigs)

    ./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa

    # scaffold reduced contigs using global mode (no norearrangements allowed)

    ./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.global.fa --norearrangements

    # scaffold heterozygous assembly (de novo assembled contigs)

    ./pyScaf.py -f test/contigs.fa -r test/ref.fa -o test/contigs.ref.fa

    # scaffold reduced contigs using long reads

    ## pacbio

    ./pyScaf.py -f test/contigs.reduced.fa -n test/pacbio.fq.gz -o test/contigs.reduced.pacbio.fa

    ## nanopore

    ./pyScaf.py -f test/contigs.reduced.fa -n test/nanopore.fa.gz -o test/contigs.reduced.nanopore.fa

    # generate dotplot

    lastdb test/ref.fa

    lastal -f TAB test/ref.fa test/contigs.reduced.pacbio.fa | last-dotplot - test/contigs.reduced.pacbio.fa.ref.png

    lastal -f TAB test/ref.fa test/contigs.reduced.nanopore.fa | last-dotplot - test/contigs.reduced.nanopore.fa.ref.png

    # clean-up

    #rm test/contigs.{,reduced.}fa.* test/ref.fa.* test/*.{nanopore,pacbio,ref}* test/*.log

================

Proof of concept

================

pyScaf is under heavy development right now.

Nevertheless, both the reference-based mode and long-read mode are functional and produces meaningful assemblies.

pyScaf has been implemented in `Redundans `_.

For more info, have a look in `workbook `_.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lpryszcz/pyscaf

Awesome Lists containing this project

README