Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/adamtaranto/corset-tools

Companion scripts for annotation of Corset generated transcript clusters.
https://github.com/adamtaranto/corset-tools

Last synced: 3 months ago
JSON representation

Companion scripts for annotation of Corset generated transcript clusters.

Host: GitHub
URL: https://github.com/adamtaranto/corset-tools
Owner: Adamtaranto
License: gpl-2.0
Created: 2015-04-29T04:06:54.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2020-08-07T01:18:24.000Z (over 4 years ago)
Last Synced: 2023-10-20T16:09:58.515Z (over 1 year ago)
Language: Python
Size: 42 KB
Stars: 6
Watchers: 3
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

#Corset-tools
Companion scripts for working with transcript clusters generated by Corset.

##Fetch Cluster Sequences
Given a list of clusters of interest, a transcript fasta, and a Corset mapping file,
**fetchClusterSeqs.py** will return a fasta file of transcripts belonging to those
clusters (with cluster names appended).

This script is intended for use, following differential expression analysis, to
extract and annotate transcript belonging to DE clusters.

###Usage
fetchClusterSeqs.py -i transcripts.fa -t myfavClusters.csv -o transcriptsWithclusterTags.fa -c clusters.txt

###Options
-i, --inFasta Multi fasta to extract subset from.

-t, --targetClust Comma delimited file with target cluster names in column one.

-c, --clustMap Corset transcript-to-cluster mapping file.

-h, --help Show help message and exit.

-o, --outFasta Directory for new sequence file to be written to. i.e. output/newfile.fa

-l --longest If set only report the longest transcript per cluster

## Cross Cluster Counter
When using Corset to cluster transcriptomes from two species/isolates/cultivars/etc.,
sequence variation is likely to exist between transcriptomes and it is important to check
that your Corset clustering has not excessively split clusters based on transcriptome set.

The **crossClustCount.py** script will take a corset cluster mapping (transcript --> cluster),
as well as fasta files for those transcriptomes, and calculate the number of members from
each source transcriptome in each cluster.

Beware of clusters that only contain transcripts from one transcript set when running
differential expression analyses - these may create a false differential expression signal.

If you have many clusters in which reads from only one transcriptome set are present, you
might consider adjusting your --score-min threshold in Bowtie2 to be more tolerant of
variation when mapping reads between transcriptomes.

###Usage
crossClustCount.py -X transcripts_X.fa -Y transcripts_Y.fa -x SetX_Label -y SetY_Label -o outfile.tab -c clusters.txt

###Arguments
-h, --help Show help message and exit.

-X, --transFastaX Fasta file for transcriptome X.

-Y, --transFastaY Fasta file for transcriptome Y.

-x, --transNameX Unique lable for transcriptome X.

-y, --transNameY Unique lable for transcriptome Y.

-o, --outFile Location for summary file to be written to.

-c, --clustMap Corset transcript-to-cluster mapping file.

## Transcriptome Distance Calculator
When using Corset to cluster transcriptomes from two different isolates/individuals/cultivars/species
natural sequence variation is likely to occur between input datasets. For equivalent transcripts
to cluster correctly it is essential that reads are able to cross-map to their eqivalent location
in the non-self transcriptome.

In Bowtie2, the minimum score for a 'valid' read alignment is set via the --score-min option.
The format is [type,intercept,slope], with the default setting -L,-0.6,-0.6.

The equation for a 100 bp read would be -0.6 + (-0.6 * 100) = -60.6

The default Bowtie 2 penalty for a mismatched base is -6. An alignment gap of length 2 receives a
penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension).

Therefore a 100bp read alignment is still considered valid with 5 mismatches (-5 * 5 = -30),
or one gap + one mismatch (-6 + (-5 + -3) = -14).

Failing to set a minimum score threshold that is below (more negative than) the expected distance between
homologous transcripts will likely result in transcript clusters that split by source transcriptome; or if
homologous clusters are predicted, lower mapping efficiency between sets may create a false differential
expression signal in later analyses (genes that are more variable between isolate two isolates may appear
to be differentially expressed when exposed to the same treatment condition.)

Conversely, setting --score-min to high will likely result in excessive collapse of paralogues.

The **Transcriptome Distance Calculator** compares equivalent transcript pairs from two transcriptomes
and determines the distance (gaps and mismatches) for each pair, and whether mapping would be successful
for a given read length and --score-min setting.

Transcript pairs may be provided as a tab delimited list; alternatively, the script can perform a reciprocal
best blast analysis if provided with tab formated blast results for SetA vs SetB, and SetB vs SetA.

*Pending feature*: Advise user of the minimum alignment score that must be tolerated in order to allow
cross-mapping for 95% of the transcript population.

###Usage
Pairs provided:
transDistCalc.py -r 100 -i -0.6 -s -0.6 -n pairnames.txt -a FASTA_A.fa -b FASTA_B.fa -g -5
-e -3 -m -6 -o alignmentreports.txt

Get pairs from reciprocal best blast:
transDistCalc.py -r 100 -i -0.6 -s -0.6 -x BLAST_AvsB.tab -y BLAST_BvsA.tab -a FASTA_A.fa -b FASTA_B.fa -g -5
-e -3 -m -6 -o alignmentreports.txt -w

###Options
-h, --help Show help message and exit.

-r, --readLength Length of reads intended for mapping to transcriptomes.
(after quality trimming).

-i, --scoreMinIntercept Intercept value for Bowtie2 --score-min function.

-s, --scoreMinSlope Slope value for Bowtie2 --score-min function.

-n, --pairNames Two column tab delimited file containing names for
sequences from fastaA and fastaB to be aligned.

-x, --blastAvB Blast tab result file for fastaA query against fastaB
subject.

-y, --blastBvA Blast tab result file for fastaB query against fastaA
subject.

-a, --fastaA Multifasta set A.

-b, --fastaB Multifasta set B.

-g, --gapOpen Penalty for opening a gap.

-e, --gapExtend Penalty for extending a gap.

-m, --mismatch Pentaly for a mismatched position.

-o, --outFile Write output table to file.

-w, --recipFile Write reciprocal blast pairs to file. True if flag is
set.