Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/arnaudbelcour/rhea_mapper
[Not maintained and not updated] Draft metabolic network reconstruction using Rhea database, Uniprot and SPARQL queries.
https://github.com/arnaudbelcour/rhea_mapper
metabolic-network sparql-query
Last synced: 4 days ago
JSON representation
[Not maintained and not updated] Draft metabolic network reconstruction using Rhea database, Uniprot and SPARQL queries.
- Host: GitHub
- URL: https://github.com/arnaudbelcour/rhea_mapper
- Owner: ArnaudBelcour
- License: gpl-3.0
- Created: 2020-10-19T15:17:39.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2023-04-27T12:17:29.000Z (over 1 year ago)
- Last Synced: 2024-11-10T01:48:02.370Z (2 months ago)
- Topics: metabolic-network, sparql-query
- Language: Python
- Homepage:
- Size: 112 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# rhea_mapper
**WARNING:**
**This tool was a test during my PhD. Currently it is not working and I will not update it, so don't use it. It is kept for test purpose.**
Rhea_mapper is a Python3 (Python >= 3.6) tool to create draft metabolic network from genomes, sparql query on Rhea or Uniprot and from tsv files containing list of Rhea reactions or Uniprot protein IDs.
## Table of contents
- [rhea_mapper](#rhea_mapper)
- [Table of contents](#table-of-contents)
- [License](#license)
- [Requirements](#requirements)
- [Installation with pip](#installation-with-pip)
- [Usage](#usage)
- [database](#database)
- [genome](#genome)
- [sparql](#sparql)
- [taxonomy](#taxonomy)
- [from_file](#from_file)## License
This project is licensed under the GNU General Public License - see the [LICENSE](https://github.com/ArnaudBelcour/rhea_mapper/blob/master/LICENSE) file for details.
## Requirements
Python 3 (Python 3.6 is tested). rhea_mapper uses a certain number of Python dependencies:
* [biopython](https://github.com/biopython/biopython) to handle fasta files.
* [cobra](https://github.com/opencobra/cobrapy) to create sbml files.
* [rdflib](https://github.com/RDFLib/rdflib) to load rhea.rdf and query it.
* [sparqlwrapper](https://github.com/RDFLib/sparqlwrapper) to query Uniprot and Rhea sparql endpoint.And another tool:
* [OrthoFinder](https://github.com/davidemms/OrthoFinder) to find orthologs to uniprot with experimental evidence linked to rhea.## Installation with pip
```
pip install rhea_mapper
```## Usage
rhea_mapper has 4 main commands: genome, sparql, database and from_file.
````
usage: rhea_mapper [-h] [-v] {database,genome,sparql,taxonomy,from_file} ...Create Rhea sbml files from genomes. For specific help on each subcommand use:
rhea_mapper {cmd} --helpoptional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exitsubcommands:
valid subcommands:{database,genome,sparql,taxonomy,from_file}
database Create rhea_mapper database [internet connection
required]
genome metabolic network reconstruction from genome
sparql Using a SPARQL query for reaction ID on Rhea or
protein ID on Uniprot, create the corresponding
metabolic network
taxonomy With a taxonomy, query Uniprot to create the
corresponding metabolic network
from_file Using a tsv file with list of Rhea reaction or Uniprot
protein, create the corresponding metabolic networkRequires: Orthofinder for genome if you provide a fasta
````### database
This command downloads and creates the files needed by rhea_mapper to work. As it downlaods files, this funciton requires an internet connection.
By using the RDF file of Rhea, rhea_mapper will create a SBMl file, with some specificity:
- the direction of reactions can be: arbitrary, follow MetaCyc reaction direction or follow KEGG reaction direction.
- for the variable stoichiometric coefficient (n, n-1, n+1, 2n) in reaction, n has been set to 2.
- genes linked to Rhea reactions with experimental evidence are associated to these reactions.````
usage: rhea_mapper database [-h] -d DATABASECreate rhea_mapper database [internet connection required]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
Database folder, if not existing it will be created
````It will create an output_folder containing:
- ncbi_taxonomy.dmp: taxonomy file from the NCBI containing for each taxon ID the corresponding taxon name.
- rhea.rdf: the RDF file of Rhea database.
- rhea2ec.tsv: mapping between Rhea reaction and EC number.
- rhea2uniprot_sprot.tsv: mapping between Rhea reaction and Uniprot protein ID.
- uniprot_sprot.fasta: fasta containg all amino-acids sequences from Swissprot.
- uniprot_ec_evidence.tsv: tsv file containing Uniprot protein ID linked to EC number with experimental evidence.
- uniprot_go_evidence.tsv: tsv file containing Uniprot protein ID linked to GO terms with experimental evidence.
- uniprot_rhea_evidence.tsv: tsv file containing Uniprot protein ID linked to Rhea reaction with experimental evidence.
- uniprot_rhea_evidence.fasta: fasta file containing protein linked to Rhea reaction with experimental evidence.
- rhea.sbml: SBML file created by rhea_mapper using the rhea.rdf file and the uniprot_rhea_evidence.tsv (to have gene association).
- version.json: contains the version number of Rhea database and Uniprot database in this folder.Rhea files come from the [Rhea download page](ftp://ftp.expasy.org/databases/rhea/rdf/).
uniprot_sprot.fasta is downloaded from the [Uniprot download page](https://www.uniprot.org/downloads). It is the `Reviewed (Swiss-Prot)`.
uniprot_rhea_evidence.* files are created with a SPARQL query on the [Uniprot SPARQL endpoint](https://sparql.uniprot.org/sparql).
### genome
This genome takes as input folders with subfolders containing a tsv file with annotation (gene linked to EC number) and a fasta file (proteome of the species of interest).
Then it will map the EC to Rhea reactions.
in a second step it will take the proteome to search for orthologs against a database of Uniprot protein linked to Rhea reactions with experimental evidence.
Then a sbml file will be created.
````
usage: rhea_mapper genome [-h] [-f FOLDER] -o OUTPUT -d DATABASE [-c CPU]metabolic network reconstruction from genome
optional arguments:
-h, --help show this help message and exit
-f FOLDER, --folder FOLDER
Input folder containing either annotation, genomes or
both
-o OUTPUT, --output OUTPUT
Output folder
-d DATABASE, --database DATABASE
Database folder, if not existing it will be created
-c CPU, --cpu CPU cpu number for multi-process
````The structure of the input_folder is:
````
input_folder
├── organism_1
│ └── organism_1.gbk
├── organism_2
│ └── organism_2.tsv
│ └── organism_2.fasta
├── organism_3
│ └── organism_3.tsv
├── organism_4
│ └── organism_4.fasta
..
└── organism_n
└── organism_n.gbk
````GenBank files will be converted into the format use by `rhea_mapper genome` meaning a tsv file + a fasta file. It is also possible to give only a fasta file or a tsv file.
The fasta file must contain amino-acids sequences.
The tsv file must be structured like:
| gene | EC_Number |
|--------|-----------|
| gene_1 | X.X.X.X |
| ... | |
| gene_n | X.X.X.X |### sparql
Using a sparql query, rhea_mapper will query either a rhea rdf file from its database, the uniprot sparql endpoint or the rhea sparql endpoint.
A Uniprot query must contain the predicate of protein Id from uniprot (`?predicate a up:Protein`). The set of proteins find by the sparql query will then be used to create a draft metabolic network. Rhea_mapper will use the reconstruction method of `genome` to have a sbml file.
For the Rhea query, it must a set of Rhea reactions (`?predicate rdfs:subClassOf rh:Reaction`). Then using these IDs, rhea_mapper will create the corresponding sbml files.
````
usage: rhea_mapper sparql [-h] [-s SPARQL] -e ENDPOINT -d DATABASE -o OUTPUT
[-c CPU]Using a SPARQL query for reaction ID on Rhea or protein ID on Uniprot, create
the corresponding metabolic networkoptional arguments:
-h, --help show this help message and exit
-s SPARQL, --sparql SPARQL
SPARQL query file
-e ENDPOINT, --endpoint ENDPOINT
Endpoint either 'uniprot' or 'rhea' or 'rhea_endpoint
-d DATABASE, --database DATABASE
Database folder, if not existing it will be created
-o OUTPUT, --output OUTPUT
Output folder
-c CPU, --cpu CPU cpu number for multi-process
````### taxonomy
From an organism name or a taxonomy group, rhea_mapper will query Uniprot to retrieve the proteins associated to this organism/taxonomy and will create a draft metabolic network form it.
````
usage: rhea_mapper taxonomy [-h] -d DATABASE -o OUTPUT [-c CPU]
[--organism ORGANISM] [--taxonomy TAXONOMY]With a taxonomy, query Uniprot to create the corresponding metabolic network
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
Database folder, if not existing it will be created
-o OUTPUT, --output OUTPUT
Output folder
-c CPU, --cpu CPU cpu number for multi-process
--organism ORGANISM Name of organism or group of organism to query uniprot
for Protein
--taxonomy TAXONOMY TSV file with taxonomy name of organism
````### from_file
Using a tsv file with either Rhea reactions ID or Uniprot protein IDs, rhea_mapper will create draft metabolic network.
For the Rhea reactions, it will create a sbml file.
For the Uniprot protein, it will use the reconstruction method of `genome` to have a sbml file.
````
usage: rhea_mapper from_file [-h] -l LIST -d DATABASE -o OUTPUT [-c CPU] -r
REFERENCEUsing a tsv file with list of Rhea reaction or Uniprot protein, create the
corresponding metabolic networkoptional arguments:
-h, --help show this help message and exit
-l LIST, --l LIST File containing list of Rhea reacitons IDs or protein
IDs
-d DATABASE, --database DATABASE
Database folder, if not existing it will be created
-o OUTPUT, --output OUTPUT
Output folder
-c CPU, --cpu CPU cpu number for multi-process
-r REFERENCE, --reference REFERENCE
Database for reference either 'uniprot' or 'rhea'
````