Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jorainer/mirtarbase-db

R package providing the validate miRNA-target gene interactions (MTIs) from the miRTarBase database.
https://github.com/jorainer/mirtarbase-db

Last synced: 25 days ago
JSON representation

R package providing the validate miRNA-target gene interactions (MTIs) from the miRTarBase database.

Awesome Lists containing this project

README

        


Table of Contents





---

# miRTarBase R package

This packages provides the miRNA-target gene interactions (MTI) from the miRTarBase.

## Installation

Clone the git repository and build the package using:

R CMD build mirtarbase-db

To install the package call:

R CMD INSTALL mirtarbase.db_*

After that you can use the package as any other R package. The package provides also a vignette with some examples that can be opened, after loading the library in R (`library(mirtarbase.db)`), by the `openVignette()` command.

## Create the database file

Using the code in this section we create the `SQLite` database for the package. The workflow is the following: first we have to download the XLS spreadsheet from the miRTarBase web site, which we have to open in Excel and export as tabulator delimited text file. We will then read this file into R in order format it for our requirements (i.e. rename the column names) and we store this file than into a `SQLite` database.

Assuming that the Excel file has been downloaded from and has been exported as a tabulator delimited txt file to `orig_files/txt`. Next we read this file, rename the columns and export it again as a tab delimited text file (over-writing the original one).

It is a little more complicated to process this file due to the following problems:

- Experiments column contains 5' RACE and 5" RACE and these will be recognized by `R` as quotes. Thus we have to read the file without specifying quotes.
- Experiments column contains *Weak* but also *WeaK*, will have to replace these.
- Some Entrezgene IDs are NA.

All in all, we hope that we don't encounter some more issues in the next release.

library( RSQLite )
## the XLS file contains ' and " inside cells, thus have to set qupte=""
MTI <- read.table( "orig_files/txt/miRTarBase_MTI.txt", sep="\t", as.is=TRUE, header=TRUE, check.names=FALSE, quote="" )
## re-formating column names.
CN <- tolower( colnames( MTI ) )
CN <- gsub( CN, pattern="\"", replacement="" )
CN <- gsub( CN, pattern="[(|)]", replacement="" )
CN <- gsub( CN, pattern=" ", replacement="_", fixed=TRUE )
colnames( MTI ) <- CN
## now I want to get rid of ALL \", ", ' in the table!
character.CN <- CN[ !CN %in% c( "target_gene_entrez_gene_id", "references_pmid" ) ]
## loop through all these columns and replace "\""
for( current.CN in character.CN ){
MTI[ , current.CN ] <- gsub( MTI[ , current.CN ], pattern="\"", replacement="" )
}
## now, replace all remaining " and ' in experiments
MTI[ , "experiments" ] <- gsub( MTI[ , "experiments" ], pattern="'", replacement="", fixed=TRUE )
## fixing the WeaK thing.
MTI[ , "support_type" ] <- gsub( MTI[ , "support_type" ], pattern="WeaK", replacement="Weak", fixed=TRUE )
## fix NA entrezids
which.NAs <- which( is.na( MTI$target_gene_entrez_gene_id ) )
## for all of these, check if there is another gene, same species.
for( idx in which.NAs ){
tmp <- MTI[ MTI$target_gene==MTI[ idx, "target_gene" ] &
MTI$species_target_gene==MTI[ idx, "species_target_gene" ], , drop=FALSE ]
if( length( tmp[ !is.na( tmp$target_gene_entrez_gene_id ), "target_gene_entrez_gene_id" ] ) > 0 ){
MTI[ idx, "target_gene_entrez_gene_id" ] <-
unique( tmp[ !is.na( tmp$target_gene_entrez_gene_id ), "target_gene_entrez_gene_id" ] )
}
}
## exporting the file again.
## write.table( MTI, file="orig_files/txt/miRTarBase_MTI.txt", sep="\t", row.names=FALSE )

We could also create the database in a cleaner way, i.e. as a *real* relational database with a table for the miRNA target gene interaction, a table with the evidence (Report) and one relating each other. The question remains whether that would improve it's performance.
Note that each publication can have more than one report.

## want to know whether the "report" is unique
## (i.e. pubmed id, experiments and support type) across all miRNAs are specific for a miRNA.
Report <- data.frame( matrix( ncol=4, nrow=0 ), stringsAsFactors=FALSE )
colnames( Report ) <- c( "report_id", "experiments", "support_type", "references_pmid" )
mti2report <- matrix( ncol=2, nrow=0 )
colnames( mti2report ) <- c( "mti_id", "report_id" )
for( i in 1:nrow( MTI ) ){
current.rep <- MTI[ i, c( "experiments", "support_type", "references_pmid" ) ]
## check if it's in Report
idx <- which( rownames( Report )==paste0( current.rep[ 1, ], collapse="-" ) )
if( length( idx ) > 0 ){
## have it already
#cat( "have already the report\n" )
report.id <- Report[ idx, "report_id" ]
}else{
rownames( current.rep ) <- paste0( current.rep[ 1, ], collapse="-" )
report.id <- paste0( "MTIR", sprintf( "%04d", ( nrow( Report )+1 ) ) )
current.rep <- cbind( report_id=report.id, current.rep, stringsAsFactors=FALSE )
Report <- rbind( Report, current.rep )
}
## add it to mti2report.
mti2report <- rbind( mti2report, c( MTI[ i, 1 ], report.id ) )
}
## at last define the MTI table.
MTI.table <- unique( MTI[ , !( colnames( MTI ) %in% colnames( Report ) ) ] )

## ok, now we do have all tables.
con <- dbConnect( dbDriver( "SQLite" ), dbname="inst/extdata/db/mirtarbase_rel.db" )
if( dbExistsTable( con, "mti" ) ){
dbRemoveTable( con, "mti" )
}
if( dbExistsTable( con, "report" ) ){
dbRemoveTable( con, "report" )
}
if( dbExistsTable( con, "mti2report" ) ){
dbRemoveTable( con, "mti2report" )
}
## mti
dbWriteTable( con, name="mti", MTI.table, row.names=FALSE )
dbGetQuery( con, "create index mirna_idx on mti (mirna);" )
dbGetQuery( con, "create index target_gene_idx on mti (target_gene);" )
dbGetQuery( con, "create index target_gene_entrez_idx on mti (target_gene_entrez_gene_id);" )
## report
dbWriteTable( con, name="report", Report, row.names=FALSE )
dbGetQuery( con, "create index report_id_idx on report (report_id);" )
## mti2report
dbWriteTable( con, name="mti2report", data.frame( mti2report, stringsAsFactors=FALSE ), row.names=FALSE )
dbGetQuery( con, "create index mti2report_report_idx on mti2report (report_id);" )
dbGetQuery( con, "create index mti2report_mti_idx on mti2report (mti_id);" )

dbDisconnect( con )

Next we create the database for the package. This requires a working installation of `SQLite`.

con <- dbConnect( dbDriver( "SQLite" ), dbname="inst/extdata/db/mirtarbase.db" )
if( dbExistsTable( con, "mirtarbase" ) ){
dbRemoveTable( con, "mirtarbase" )
}
dbWriteTable( con, name="mirtarbase", MTI, row.names=FALSE )
dbGetQuery( con, "create index mirna_idx on mirtarbase (mirna);" )
dbGetQuery( con, "create index target_gene_idx on mirtarbase (target_gene);" )
dbGetQuery( con, "create index target_gene_entrez_idx on mirtarbase (target_gene_entrez_gene_id);" )
dbDisconnect( con )

## Development

At present (<2014-07-28>), the miRTarBase can only be exported as a XLS spread sheet. This file contains one line per MTI and publication:
- MTI MIRT000140: online: 2 references, XLS sheet: one (one reference missing in XLS sheet).
- MTI MIRT001206: online: 8 references, XLS sheet: 3 rows (several rows missing).
- MTI MIRT003413: online: 4 references, XLS sheet: 2 rows (2 rows missing):
- PMID: 18328430: experiments: Luciferase reporter assay//Microarray//Western blot; evidence: Functional MTI. Online: also Other listed as experiment.
- PMID: 19422085: experiments: Luciferase reporter assay//Microarray//qRT-PCR//Western blot; evidence: Functional MTI. Online: also Other listed as experiment.

Thus, an MTI between the same miRNA and target gene (for the same species!) is stored in several rows in the XLS sheet. Each row seems however be related to one publication, and the field *experiments* seems to list all experiments performed in that publication.

One question is whether the XLS sheet should be stored as-is to the database, or whether a clean relational database should be created. The benefit from the latter approach would be to allow more specific queries, e.g. all MTIs based on a certain support type.

It would be possible to create a relational database with 3 tables, one describing the MTI, one listing the experiments performed in a publication to validate this MTI and one allowing for possible n:m relations (although it's not clear whether these really exist… yet).

### The MTI class

The MTI class represents a miRNA target gene interaction. There should only be one MTI for a miRNA target gene combination, with multiple evidences and eventually multiple publications. The unique identifier for a MTI is the identifier used in the miRTarBase (e.g. *MIRT001206*).

class MTI
L\_ class Report
L\_ experiments: lists all experiments that have been performed.
L\_ pmid: returns the (PMID) of the report.
L\_ supportedBy: list support type(s) (evidences).
L\_ reports: returns the Report(s) of the MTI.
L\_ experiments: returns all experiments (of the Report(s)).
L\_ pmids: returns the PMIDs of the Report(s).
L\_ supportedBy

## Changelog:

- version 0.2.2 (2014-08-12);
- `getMti` will use `mclapply` to query the database and process its results if the number of mature miRNA sequences is larger than 20.
- version 0.2.1 (2014-08-01);
- Added some functions to retrieve annotations from the mirbase package.
- Added the Vignette.
- version 0.2.0 (2014-08-01):
- Fixed some series problems in the XLS sheet from the miRTarBase. Missed some MTIs in the previous database versions.
- Removed functions `getMtiForGene` and `getMtiForMiRNA` and replaced both with `getMti`.
- version 0.1.0 (2014-07-29): it's done: the first *release*. Basic functionality, not Vignette, no S4 objects (yet).
- version 0.0.1 (2014-07-28): added database.

## TODOs

### DONE Create the sqlite database file from the tab delimited txt file.

- State "DONE" from "TODO" [2014-07-28 Mon 13:59]

### DONE Create a class representing an interaction (?).

- State "DONE" from "TODO" [2014-07-31 Thu 11:00]

What would be the benefits from this?

- Object oriented approach (which would be helpful if not the only function of the package would be to fetch data).

Did that. That's the `MTI` object.

### DONE Create some helper functions to retrieve information from the DB.

- State "DONE" from "TODO" [2014-07-28 Mon 15:20]

### DONE Create a txt file with all informations (version, date etc) from the miRTarBase and read it when calling mirtarbase()

- State "DONE" from "TODO" [2014-07-28 Mon 15:20]

### DONE Implement the function to get MTIs for a specified gene.

- State "DONE" from "TODO" [2014-07-29 Tue 10:52]

### DONE Create a vignette for this package.

- State "DONE" from "TODO" [2014-08-04 Mon 08:35]

### CANCELED Include also the phenomiR database?

- State "CANCELED" from "DONE" [2014-07-31 Thu 10:59]
miRTarBase contains also other databases and will (hopefully) be updated regularily.
- State "DONE" from "TODO" [2014-07-31 Thu 10:59]

This makes only sense, if the phenomiR does provide additional information.
Will not do that, since the guys from miRTarBase claim that they provide the most data and include also other databases.

### DONE implement a central `getMti` function.

- State "DONE" from "TODO" [2014-08-01 Fri 11:06]

This function should take either one or more gene or miRNA IDs as input and return a list of MTI objects.
Split the data.frame by miRTarBase IDs, make one MTI for each data.frame and nrow Report classes.

### DONE Implement functions to create MTI objects from a data.frame (and *vice versa*). [4/4]

- State "DONE" from "TODO" [2014-07-31 Thu 10:58]

These should be put into *convertfunctions.R*.

- [X] data.frame2report
- [X] report2data.frame
- [X] data.frame2mti
- [X] mti2data.frame

### DONE Implement all methods for the `Report` class [4/4].

- State "DONE" from "TODO" [2014-07-31 Thu 11:46]

- [X] show
- [X] experiments
- [X] supportedBy
- [X] pmid

### DONE Implement all methods for the `MTI` class [12/12].

- State "DONE" from "TODO" [2014-07-31 Thu 11:46]

- [X] show
- [X] id
- [X] reports
- [X] experiments
- [X] supportedBy
- [X] pmid
- [X] matureMirna
- [X] mirnaSpecies
- [X] targetGene
- [X] targetGeneSpecies
- [X] targetGeneEntrezid
- [X] countReports

### DONE Implement a function that retrieves additional miRNA annotations for a MTI.

- State "DONE" from "TODO" [2014-08-01 Fri 15:09]

The idea is to get the pre-miRNA or miRNA family for a given mature miRNA.
This should then also be exported if MTIs are exported as a `data.frame`.

### DONE Implent additional functions [2/2]

- State "DONE" from "TODO" [2014-08-01 Fri 12:03]
- [X] getAvailableExperiments: retrieves a unique list of experiments.
- [X] getPmids: retrieves a unique list of PubMed IDs.