https://github.com/carrascomj/gowsh

Homology searcher of Gene Ontologies based on webscrapping and heuristics.
https://github.com/carrascomj/gowsh

homology perl

Last synced: about 1 month ago
JSON representation

Homology searcher of Gene Ontologies based on webscrapping and heuristics.

Host: GitHub
URL: https://github.com/carrascomj/gowsh
Owner: carrascomj
License: mit
Created: 2019-01-25T22:16:10.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-02-05T17:27:18.000Z (over 6 years ago)
Last Synced: 2025-02-14T18:38:10.544Z (3 months ago)
Topics: homology, perl
Language: Perl
Homepage:
Size: 240 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: Changes
- License: LICENSE.txt

Awesome Lists containing this project

README

# GOWSH

Perl homology searcher based on webscrapping and heuristic approaches. It's supposed to look up in HomoloGene,
Ensemble and [Inparanoid](http://inparanoid.sbc.su.se/cgi-bin/index.cgi) after running Bidirectional best hit algorithm (BDBH).

## Getting Started

Clone the repo on local:

git clone https://github.com/carrascomj/gowsh

Add script to path (on your bash initialization file; e.g., ~/bashrc):

export PATH=$PATH:"path/to/gowsh/bin"

The program requires additional packages that can be installed with [cpanm](https://metacpan.org/pod/cpanm), if not already done:

cpanm JSON Data::Dumper Bio::SeqIO LWP::Simple File::Basename Getopt::Long XML::Parser

Alternatively, one could install WebAPIsGOWSH as an usual perl package (on 'gowsh/' directory):

perl Makefile.PL
make
make install

Finally, formatdb and blast+ are both required.

## Usage

gowsh.pl is the main script. The program takes command-line arguments with
the following options:

--gfile path_to_file: input, genes as multiFASTA
--go GOid: input, Genetic Ontology ID (as in AmiGO)
--glist list: input, blank separated list gene IDs
--tfile path_to_file: multiFASTA containing proteins of genome of target organism
--torg organism: target organism name (genus and specie)
--modfile path_to_file: optional, multiFASTA containing proteins of genome of model organism
--modorg organism: optional, model organism name (genus and specie)
--out "outfile": optional, name of output file; default "GOWSH_output.txt"
--preserve: optional, if it's added, (nearly) all files generated will be preserved.

## Running the test

The script can be tested wit the following command:

gowsh.pl --go 0048507 --modorg "arabidopsis thaliana" --torg "oryza sativa"

You can compare the output with the file "t/GOWSH_outputq1.tsv".

The program will then parse the input file, download both genomes from NCBI and try to match homologues.

## What I Learned

This code was developed as a project for one subjects of my BSc in Biotechnology (UPM). To sum up, I learned the following concepts:
* Webscrapping biological information using Perl and [mygene API](http://mygene.info/v3/api#/).
* Use of Entrez [E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25499/) programmatic access API from NCBI.
* Use of [Ensembl REST](http://www.ensembl.org/index.html) API.
* Run BLAST on local using [blast+](https://www.ncbi.nlm.nih.gov/pubmed/20003500?dopt=Citation).
* Heuristic algorithms to account for homology.
* How to build a Perl package.
* How to write a README.md.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/carrascomj/gowsh

Awesome Lists containing this project

README