Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hexylena/gmodtools

Extraction of GMODTools from Chado repository
https://github.com/hexylena/gmodtools

Last synced: 16 days ago
JSON representation

Extraction of GMODTools from Chado repository

Awesome Lists containing this project

README

        

# GMOD Tools

GMODTools is a collection of perl scripts and modules for use with
GMOD Chado genome databases, primarily at this point for loading and
extracting sequences and annotations.

bulkfiles.pl is command-line program for Bio::GMOD::Bulkfiles
This program generates bulk genome annotation files from a Chado genome
database, including Fasta, GFF, DNA, Blast indices,
a standard genome webpage, and Chado database overview tables.

gff2biomart.pl creates tables for BioMart from genome GFF annotations.
It generates table .sql, .txt and .xml suited to BioMart. With such
a BioMart dataset you can select genome regions with the available
annotations, and exclude others, and download feature tables or
sequences of the selection set. For instance, select the regions
with Mosquito gene homologs, but no D. melanogaster gene homologs.
Or select regions with gene predictions but no known homolog.

Other sequence functions are in library module
GMOD::Chado::SeqUtils -- common methods for these chado seq scripts
with main programs in bin/ folder as summarized below.

## bulkfiles.pl

command-line program for Bio::GMOD::Bulkfiles

This program generates bulk genome annotation files from a Chado genome
database, including Fasta, GFF, DNA, Blast indices,
a standard genome webpage, Chado database overview tables.

```
# get bulkfiles software
curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.0.zip
unzip GMODTools-1.0.zip
-- OR from git repository --
git clone https://github.com/gmod/GMODTools

# load a genome chado db to Postgres database
curl -O http://sgdlite.princeton.edu/download/sgdlite/sgdlite.sql.gz
createdb sgdlite
(gunzip -c sgdlite.sql.gz | psql -d sgdlite -f - ) >& log.load

# extract bulk files from database
cd GMODTools
perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make

To create a new genome database release configuration see
conf/bulkfiles/bulkfiles_template.xml
```

For example output, see http://insects.eugenes.org/genome/

## Bio::GMOD::Bulkfiles

produce bulk sequence and feature files from Chado genome database for public
distribution.

This generates Fasta, GFF, DNA and other bulk genome annotation files at
ftp://flybase.net/genomes/Drosophila_melanogaster/current/ .. (and other
species) It works with several FlyBase chado dbs, and with SGDLite chado db

Bulkfiles is mostly self-contained, but uses a few BioPerl parts plus
XML::Simple for configuration files. All of the organism/database-specific
logic should be in these configuration files (see GMODTools/conf/bulkfiles/)

- fbbulk-r4.xml, sgdbulk1.xml .. -- organism/database/release specific options
- chadofeatsql.xml -- chado db sql calls to dump features
- chadofeatconv.xml -- feature conversion options

### Changelog

- 2008-June, version 1.2b

Rationalized -strand, revcomp usage for CDS-Exons.
Problems were caused in this by Bioperl version changes
(in Location/Split and Seq/LargPrimarySeq). Now Bulkfiles
handles -strand, multi-exon CDS in a direct way, hopefully
independent of Bioperl version changes.

- 2008-May, version 1.2
- adding (in progress) Genbank Submission table writer, 'bulkfiles
-format=genbanktbl', with output suited to submit to NCBI as per these
specifications http://www.ncbi.nlm.nih.gov/Genbank/eukaryotic_genome_submission.html
- adding Genbank2Chado database xml configs for testing (anogam,...)
- improvements (in progress) in chado sql efficiency (table joins mostly
can be a problem w/ large dbs).

- 2007-Oct, version 1.1
- No chromosome/scaffold/golden_path files. This change is needed to
handle partially assembled genomes with many (1000s to 100,000s) of
scaffolds. Flag no_csomesplit=1 to use this (should become default).
- Gene Ontology association file, see go_association tags in configurations
formatted as http://geneontology.org/GO.format.annotation.shtml
- Validate main variables in chado database: ${golden_path},
${seq_ontology} This step, on now by default, checks that database
contains values set in configuration for chromosome type, sequence
ontology CV name, and any other critical variables. If failed, db is
inspected for real values.
- Miscellany bugs cured and configuration updates. tables/overview now
again active.

### Outputs

- DNA files (full chromosomes) in raw and fasta formats
- GFF (v3) feature files
- FFF (v1) feature files (used in FlyBase, each complex feature one line using
GenBank/EMBL locations)
- Fasta sequence for each selected feature set, with headers from feature files
- BLAST Index files (NCBI)
- Chado database overview tables.
- A standard genome webpage for access to bulk data.

### Usage

```perl
use Bio::GMOD::Bulkfiles;

my $bulkfiles= Bio::GMOD::Bulkfiles->new(
configfile => 'fbbulk-r4', # data-release config file, required param
debug => 1, showconfig => 0,
failonerror => 1,
);

$bulkfiles->dumpFeatures(); # extract feature tables from chado sql db
$bulkfiles->sortNSplitByChromosome(); # collate and separate by chromosome
$bulkfiles->dumpChromosomeBases(); # extract chromosome dna files

# produce various output format bulk files from above
my $result= $bulkfiles->makeFiles(
formats => [qw(fff gff fasta blast gnomap)],
chromosomes => [qw(all)]
);
```

### Why Bulkfiles?

(rather than using other middleware layers to chado db - chadoxml, chadodbi,
bioperl, ...)

The general logic is

1. dump all chado db features using simple (and quick) sql, to common
intermediate table files, and chromosome dna to raw files. The feature info is
simple: type, location, name/id, and a few attributes (db_xrefs,..)

2. postprocess these table files to create the various public use formats (the
time-consuming and configurable part), organized into per-chromosome files.

Here are some reasons we take this approach:

1. using simple sql to dump all db features to intermediate table allows easy
checks that all features get to bulk files
2. simple sql dump is fast (30 - 60 min for full fly genome), reliable in
getting all mapped features by keeping logic simple
3. process table output in stages - better debugging of steps in process, and
can split processing among computers
4. the stages are loosely coupled - one can go back, tweek configurations and
get a new output w/o redoing the complete extraction process.
5. convert one common feature table + dna to several output formats in one
step, or repeatedly as needed.
6. combine features from several chado dbs (flybase now has 3 chado dbs for
d.mel genome features), and add other sources like flybase cytology features.
7. need fairly complex and data specific configurations - moving
that to config files keeps code reusable.
8. each genome chado database has different policy and choices with respect to
feature, vocabulary and other data. A highly configurable tool, with data
extraction and correction methods that are separate and tunable is needed to
adapt to such variation in genome databases.

## gff2biomart.pl

create tables for BioMart from genome GFF annotations

### Example BioMart : DroSpeGe

BioMart : DroSpeGe provides a tool for mining homologies and annotations of
twelve Drosophila species genomes. You can select genome regions with the
available annotations, and exclude others, and download tables or sequences of
the selection set. For instance, select the regions with Mosquito gene
homologs, but no D. melanogaster gene homologs. Or select regions with gene
predictions but no known homolog.

DroSpeGe's BioMart is built with the GMOD Tool gff2biomart.pl It converts most
genome GFF annotations into a BioMart datamine. Example data sets from this
tool are at http://insects.eugenes.org/BioMart/martview

### Outputs

The script generates table .sql, .txt and .xml suited to BioMart
(MySQL, BioMart version 0.3 tested). It is a simple script without
special requirements, basically a data transformer that writes new
files formatted for a BioMart database. Components created

1. chromosome region__main tables for biomart with chromosomes broken
into Kb bins/regions (5Kb default) Features that overlap each region are
tabulated.

2. per-feature feature__dm link tables store feature attributes
(id,dbxref,match stats,..) add __main column feature_bool to indicate
where features lie.

3. __chromosome__dm with dna residues for fasta output

4. main_biomart.xml and sequence_biomart.xml configurations for biomart
usage.

### In Biomart

- filter (include,exclude) features that exist in regions, including
joint filters (has homology to human but not to fly or worm genes; has
predicted gene but not homology; any such feature type comparison).
- output 4 kinds of attributes: a feature table, per-feature sequence,
region table, per-region sequence.

Please have installed and used BioMart before trying to load the outputs
of this.

### Version

This is a alpha-level, not yet fully tested.

## cgi-bin/genomepathmap.pl

This works with the genome directory generated by bulk files
to provide standard URL access to genome data. It transforms
/genome/species/release/{dna,protein,mrna,ncrna,feature}
web urls to bulk data file paths.
It supports gzip and plain data returns.

### GMOD::SeqUtils

GMOD::SeqUtils is based on GMOD release 0.001 (January 2004) and can
be expected to change or go obsolete. Besides contents of
this package, requirements to run include the perl modules
installed with GMOD release, esp. Bio::Chado::LoadDBI/AutoDBI and
its ancestor classes from CPAN - Class::DBI, Ima::DBI, etc.

GMOD::Chado::SeqUtils was written for use with the nascent Daphnia genome
database, wFleaBase, at http://eugenes.org/daphnia/ See also further
information at http://eugenes.org/daphnia/database

## gmod_init_db.pl

**from GMOD 0.001; needs updating to be useful**

A simple script to create the Chado database and tables

Load a file of miscellany sequences: cDNA, EST, microsats into chado db,
generating public IDs. Sequences are assumed to be non-genomic, not
located.

## gmod_dump_seq.pl

**from GMOD 0.001; needs updating to be useful**

Print sequences from ChadoDB

Dump sequence file of Chado features, with feature props, synonyms,
dbxrefs xSelect by organism, by 'pub' = input file, seq type, feat props
Need for nascent daphnia wFleaBase to use sequence public IDs

## gmod_list_db.pl

**from GMOD 0.001; needs updating to be useful**

Summarize feature entries in a Chado database, by publication (input
file), by Seq. Ontology type (cDNA, EST, etc.), by organism. Add other
categories as needed.

### Authors

Don Gilbert, Feb 2004