Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/carrascomj/gowsh

Homology searcher of Gene Ontologies based on webscrapping and heuristics.
https://github.com/carrascomj/gowsh

homology perl

Last synced: 12 days ago
JSON representation

Homology searcher of Gene Ontologies based on webscrapping and heuristics.

Awesome Lists containing this project

README

        


logo

# GOWSH

Perl homology searcher based on webscrapping and heuristic approaches. It's supposed to look up in HomoloGene,
Ensemble and [Inparanoid](http://inparanoid.sbc.su.se/cgi-bin/index.cgi) after running Bidirectional best hit algorithm (BDBH).

## Getting Started

Clone the repo on local:

git clone https://github.com/carrascomj/gowsh

Add script to path (on your bash initialization file; e.g., ~/bashrc):

export PATH=$PATH:"path/to/gowsh/bin"

The program requires additional packages that can be installed with [cpanm](https://metacpan.org/pod/cpanm), if not already done:

cpanm JSON Data::Dumper Bio::SeqIO LWP::Simple File::Basename Getopt::Long XML::Parser

Alternatively, one could install WebAPIsGOWSH as an usual perl package (on 'gowsh/' directory):

perl Makefile.PL
make
make install

Finally, formatdb and blast+ are both required.

## Usage

gowsh.pl is the main script. The program takes command-line arguments with
the following options:

gowsh.pl --gfile|go|glist "path_to_file|GOid|list" --tfile|torg "path_to_file|organism"
[--modelf|modelo] "path_to_file|organism" --out "outfile" --preserve

--gfile path_to_file: input, genes as multiFASTA
--go GOid: input, Genetic Ontology ID (as in AmiGO)
--glist list: input, blank separated list gene IDs
--tfile path_to_file: multiFASTA containing proteins of genome of target organism
--torg organism: target organism name (genus and specie)
--modfile path_to_file: optional, multiFASTA containing proteins of genome of model organism
--modorg organism: optional, model organism name (genus and specie)
--out "outfile": optional, name of output file; default "GOWSH_output.txt"
--preserve: optional, if it's added, (nearly) all files generated will be preserved.

## Running the test

The script can be tested wit the following command:

gowsh.pl --go 0048507 --modorg "arabidopsis thaliana" --torg "oryza sativa"

You can compare the output with the file "t/GOWSH_outputq1.tsv".

The program will then parse the input file, download both genomes from NCBI and try to match homologues.

## What I Learned

This code was developed as a project for one subjects of my BSc in Biotechnology (UPM). To sum up, I learned the following concepts:
* Webscrapping biological information using Perl and [mygene API](http://mygene.info/v3/api#/).
* Use of Entrez [E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25499/) programmatic access API from NCBI.
* Use of [Ensembl REST](http://www.ensembl.org/index.html) API.
* Run BLAST on local using [blast+](https://www.ncbi.nlm.nih.gov/pubmed/20003500?dopt=Citation).
* Heuristic algorithms to account for homology.
* How to build a Perl package.
* How to write a README.md.