Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/greenelab/django-genes

A Django package to represent genes
https://github.com/greenelab/django-genes

server tool web-server

Last synced: 2 months ago
JSON representation

A Django package to represent genes

Awesome Lists containing this project

README

        

Genes
*****

Genes is a Django app to represent genes.

Download and Install
====================

This package is registered as ``django-genes`` in PyPI and is pip
installable:

::

pip install django-genes

If any of the following dependency packages are not found on your
system, ``pip`` will install them too:

* ``django 1.8 or later`` (Django web framework)

* ``django-organisms`` (``Organisms`` model, which is required by
``Genes`` model)

* ``django-haystack`` (see ``Search Indexes and Data Template``
section.)

* ``django-fixtureless`` (for unittest, see ``tests.py``)

Quick Start
===========

1. Add **'genes'** and **'organisms'** to your ``INSTALLED_APPS``
setting like this:

::

INSTALLED_APPS = (
...
'organisms',
'genes',
)

2. Run ``python manage.py migrate`` command to create ``genes`` and
``organisms`` models.

3. **(Optional)** The following step is only needed if you have
django-tastypie installed to create a REST API for your project and
would like to have API endpoints for ``django-organisms`` and
``django-genes``.

Add the following to your project's ``urls.py`` file:

::

# There are probably already other imports here, such as:
# from django.conf.urls import url, patterns, include

# If you have not already done so, import the tastypie API:
from tastypie.api import Api

# Import the API Resources for Organisms and Genes:
from organisms.api import OrganismResource
from genes.api import GeneResource

# If you have not already done so, initialize your API and
# add the Organism and Gene Resources to it. You can also register
# the CrossRefResource and CrossRefDBResource if you want to have
# API endpoints for them as well.
v0_api = Api()
v0_api.register(OrganismResource())
v0_api.register(GeneResource())
v0_api.register(CrossRefResource())
v0_api.register(CrossRefDBResource())

# In the urlpatterns, include the urls for this api:
urlpatterns = patterns('',
...
(r'^api/', include(v0_api.urls))
)

Search Indexes and Data Template
================================

The module ``search_indexes.py`` can be used by **django haystack**
(https://github.com/django-haystack/django-haystack) to search genes.
It includes the Gene fields that should be included in the search
index, and how they should be weighted. The ``text`` field refers to a
document that is built for the search engine to index. The location of
data template for this document is:
``genes/templates/search/indexes/gene_text.txt``.

For more information, see:
http://django-haystack.readthedocs.org/en/latest/tutorial.html#handling-data

Usage of Management Commands
============================

This app includes five management commands in ``management/commands/``
sub-directory:

1. genes_add_xrdb
-----------------

..

This command adds cross-reference databases for genes. It **must**
be called for every new cross-reference database to populate the
gene and cross-reference objects in the database. It requires 2
arguments:

* name: the name of the database

* URL: the URL for that database, with the string '_REPL_' added at
the end of the URL

For example, this command adds Ensembl as a cross-reference
database:

::

python manage.py genes_add_xrdb --name=Ensembl --URL=http://www.ensembl.org/Gene/Summary?g=_REPL_

And this command adds MIM as a cross-reference database:

::

python manage.py genes_add_xrdb --name=MIM --URL=http://www.ncbi.nlm.nih.gov/omim/_REPL_

2. genes_load_geneinfo
----------------------

..

This command parses gene info file(s) and saves the corresponding
gene objects into the database. It takes 2 required arguments and 5
optional arguments:

* (Required) geneinfo_file: location of gene info file;

* (Required) taxonomy_id: taxonomy ID for organism for which genes
are being populated;

* (Optional) gi_tax_id: alternative taxonomy ID for some organisms
(such as S. cerevisiae);

* (Optional) symbol_col: symbol column in gene info file. Default
is 2;

* (Optional) systematic_col: systematic column in gene info file.
Default is 3;

* (Optional) alias_col: the column containing gene aliases. If a
hyphen '-' or blank space ' ' is passed, symbol_col will be used.
Default is 4.

* (Optional) put_systematic_in_xrdb: name of cross-reference
Database for which you want to use organism systematic IDs as
CrossReference IDs. This is useful for Pseudomonas, for example,
as systematic IDs are saved into PseudoCAP cross-reference
database.

The following example shows how to download a gzipped human gene
info file from NIH FTP server, and populate the database based on
this file.

::

# Create a temporary data directory:
mkdir data

# Download a gzipped human gene info file into data directory:
wget -P data/ -N ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

# Unzip downloaded file:
gunzip -c data/Homo_sapiens.gene_info.gz > data/Homo_sapiens.gene_info

# Call genes_load_geneinfo to populate the database:
python manage.py genes_load_geneinfo --geneinfo_file=data/Homo_sapiens.gene_info --taxonomy_id=9606 --systematic_col=3 --symbol_col=2

3. genes_load_uniprot.py
------------------------

..

This command can be used to populate database with UniProtKB
identifiers. It takes one argument:

* uniprot_file: location of a file mapping UniProtKB IDs to Entrez
and Ensembl IDs

**Important:** Before calling this command, please make sure that
both Ensembl and Entrez identifiers have been loaded into the
database.

After downloading the gzipped file, use ``zgrep`` command to get
the lines we need (the original file is quite large), then run this
command:

::

wget -P data/ -N ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz
zgrep -e "GeneID" -e "Ensembl" data/idmapping.dat.gz > data/uniprot_entrez_ensembl.txt
python manage.py genes_load_uniprot --uniprot_file=data/uniprot_entrez_ensembl.txt

4. genes_load_wb.py
-------------------

..

This command can be used to populate database with WormBase
identifiers. It takes 3 arguments:

* (Required) wb_url: URL of wormbase xrefs file;

* (Optional) db_name: the name of the cross-reference database,
default is 'WormBase'.

As is expected, the WormBase cross-reference database should be
populated using the ``genes_add_xrdb`` command (see command #1)
before this command to populate the WormBase identifiers. Here is
an example:

::

# Find latest version of WormBase here:
# http://www.wormbase.org/about/release_schedule#102--10-1
python manage.py genes_load_wb --wb_url=ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS243.xrefs.txt.gz

5. genes_load_gene_history.py
-----------------------------

..

This management command will read an input gene history file and
find all genes whose tax_id match input taxonomy ID. If the gene
already exists in the database, the Gene record in database will be
set as obsolete; if not, a new obsolete Gene record will be created
in the database.

The command accepts 2 required arguments and 3 optional arguments:

* (Required) gene_history_file: Input gene history file. A gzipped
example file can be found at:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene_history.gz

* (Required) tax_id: Taxonomy ID assigned by NCBI to a certain
organism. Genes of the other organisms in input file will be
skipped.

* (Optional) tax_id_col: column number of tax_id in input file.
Default is 1.

* (Optional) discontinued_id_col: column number of discontinued
GeneID in input file. Default is 3.

* (Optional) discontinued_symbol_col: column number of gene's
discontinued symbol in input file. Default is 4.

Note that column numbers in the last three arguments all start from
1, **not** 0.

For example, to add obsolete genes whose tax_id is 208964 in the
file "gene_history", we will use the command like this:

::

# Download file into your data directory:
cd /data_dir; wget ftp://ftp.ncbi.nih.gov/gene/DATA/gene_history.gz

# Unzip the downloaded file into "gene_history"
gunzip gene_history.gz

# Run management command:
python manage.py genes_load_gene_history /data_dir/gene_history 208964 --tax_id_col=1 --discontinued_id_col=3 --discontinued_symbol_col=4

(Here ``--tax_id_col=1 --discontinued_id_col=3
--discontinued_symbol_col=4`` are optional because they are using
default values.)