https://github.com/lkremer/tecov
A pipeline to to determine the repeat content of genes and their neighboring regions.
https://github.com/lkremer/tecov
bioinformatics-pipeline genome genomics repeatmasker transposable-elements
Last synced: 5 months ago
JSON representation
A pipeline to to determine the repeat content of genes and their neighboring regions.
- Host: GitHub
- URL: https://github.com/lkremer/tecov
- Owner: LKremer
- License: mit
- Created: 2017-06-02T13:24:35.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-06-02T13:25:14.000Z (over 8 years ago)
- Last Synced: 2025-03-31T21:27:58.204Z (7 months ago)
- Topics: bioinformatics-pipeline, genome, genomics, repeatmasker, transposable-elements
- Language: Python
- Size: 6.84 KB
- Stars: 7
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
TEcov, a pipeline to to determine the repeat content of genes and their neighboring regions.
=========Requires a gene annotation in ".gff" format and a [RepeatMasker](http://www.repeatmasker.org/)
repeat annotation in ".out" format. Reports the repeat content (% of base pairs covered) of- genic regions (exons, introns, untranslated regions)
- promoter regions (upstream regions)
- flanking regions (up- and downstream regions)The results can be used to study the influence of repetetive content such as
transposable elements (TEs) on gene expression or gene family evolution.Requirements
------------TEcov requires [Python3.4+](https://www.python.org/downloads/) and the Python package pyfaidx:
`pip3 install pyfaidx`
You also need to [install bedtools](http://bedtools.readthedocs.io/en/latest/content/installation.html).
Usage
------------```
usage: TEcov.py [-h] [-f FLANK_SIZE] [-j JOBNAME] [-m MIN_BASEPAIRS]
[--gene_gff_feat GENE_GFF_FEAT] [--cds_gff_feat CDS_GFF_FEAT]
species_gff te_annotation genomeA pipeline to determine the repeat content (e.g. content of transposable
elements) of genes and their neighboring regions. Requires a gene annotation
and a repeatmasker repeat annotation. Reports the repeat content (%bp covered)
of genic regions (exons, introns, UTRs), promoter (upstream) regions and
flanking (up- and downstream) regions.positional arguments:
species_gff path to the species' genome annotation (.gff, only
longest isoforms!)
te_annotation path to the species' TE annotation (repeatmasker .out
but NOT .gff)
genome path to the species' genome (.fasta).optional arguments:
-h, --help show this help message and exit
-f FLANK_SIZE, --flank_size FLANK_SIZE
how long the flanking regions should be (in bp,
default is 10.000 bp)
-j JOBNAME, --jobname JOBNAME
name a directory to which all intermediate files will
be dumped
-m MIN_BASEPAIRS, --min_basepairs MIN_BASEPAIRS
percentages get meaningless when the overlapped
sequences are very short. This value defines the
minimun sequence length (in bp). Shorter sequences
will be "NA" (default: 200 bp)
--gene_gff_feat GENE_GFF_FEAT
the name of the GFF feature that denotes the location
of genes (default: "gene")
--cds_gff_feat CDS_GFF_FEAT
the name of the GFF feature that denotes the location
of CDSs (default: "cds")
```