Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/maastrichtu-ids/bio2rdf
⚕️ Bio2RDF is an open-source project that uses Semantic Web technologies to build and provide the largest network of Linked Data for the Life Sciences.
https://github.com/maastrichtu-ids/bio2rdf
bio2rdf linked-data rdf
Last synced: about 1 month ago
JSON representation
⚕️ Bio2RDF is an open-source project that uses Semantic Web technologies to build and provide the largest network of Linked Data for the Life Sciences.
- Host: GitHub
- URL: https://github.com/maastrichtu-ids/bio2rdf
- Owner: MaastrichtU-IDS
- Created: 2019-07-03T11:40:00.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-18T11:09:35.000Z (about 4 years ago)
- Last Synced: 2024-04-18T08:58:57.024Z (10 months ago)
- Topics: bio2rdf, linked-data, rdf
- Language: Common Workflow Language
- Homepage:
- Size: 44.2 MB
- Stars: 4
- Watchers: 5
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CWL workflows for RDF Bio2RDF conversion
The [Common Workflow Language](https://www.commonwl.org/) is used to describe workflows to transform heterogeneous structured data (CSV, TSV, RDB, XML, JSON) to the [Bio2RDF](http://bio2rdf.org/) RDF vocabularies. The user defines [SPARQL queries](https://github.com/MaastrichtU-IDS/bio2rdf/blob/master/mapping/pharmgkb/drugs.rq) to transform the generic RDF generated depending on the input data structure (AutoR2RML, xml2rdf) to the target Bio2RDF vocabulary.
* Install [Docker](https://docs.docker.com/install/) to run the modules.
* Install [cwltool](https://github.com/common-workflow-language/cwltool#install) to get cwl-runner to run workflows of Docker modules.
* Those workflows use Data2Services modules, see the [data2services-pipeline](https://github.com/MaastrichtU-IDS/data2services-pipeline) project.---
## Clone the bio2rdf project
```shell
git clone https://github.com/MaastrichtU-IDS/bio2rdf
```
## Go to the cloned project
```shell
cd bio2rdf
```## Define absolute path and dataset folder as environment variables
In terminal type:
```shell
export ABS_PATH=/path/to/your/bio2rdf/folderexport DATASET=dataset_name
```
The folders of the dataset in input, output, mapping, support folders should be the same as the dataset name variable.## Start services
[Apache Drill](https://github.com/amalic/apache-drill) and [GraphDB](https://github.com/MaastrichtU-IDS/graphdb/) services must be running before executing CWL workflows.
```shell
# Start Apache Drill sharing volume with this repository.
# Here shared locally at /data/bio2rdf
docker run -dit --rm -v ${ABS_PATH}:/data:ro -p 8047:8047 -p 31010:31010 --name drill umids/apache-drill
```## Run your choice of triple store
#### For GraphDB:
* GraphDB needs to be built locally, for this
* Download GraphDB as a stand-alone server free version (zip): https://ontotext.com/products/graphdb/.
* Put the downloaded `.zip` file in the GraphDB repository (cloned from [GitHub](https://github.com/MaastrichtU-IDS/graphdb/)).
* Run `docker build -t graphdb --build-arg version=CHANGE_ME .` in the GraphDB repository.```shell
# GraphDB needs to be downloaded manually and built.
# Here shared locally at /data/graphdb and /data/graphdb-import
docker build -t graphdb --build-arg version=8.11.0 .
docker run -d --rm --name graphdb -p 7200:7200 -v ${ABS_PATH}/graphdb:/opt/graphdb/home -v ${ABS_PATH}/graphdb-import:/root/graphdb-import graphdb
```#### For Virtuoso:
```shell
docker run --rm --name virtuoso \
-p 8890:8890 -p 1111:1111 \
-e DBA_PASSWORD=dba \
-e SPARQL_UPDATE=true \
-e DEFAULT_GRAPH=http://bio2rdf.com/bio2rdf_vocabulary: \
-v ${ABS_PATH}/virtuoso:/data \
-v /tmp:/tmp \
-d tenforce/virtuoso:1.3.2-virtuoso7.2.1
```---
## Prepare workflow
### For GraphDB:### Create a GraphDB repository
After running the docker of GraphDB, go to: http://localhost:7200/
Create new repository in GraphDB that will store your data.
### Modify the workflow config file
Navigate to the folder of the dataset you want to convert following this path: support/_dataset_name/graphdb-workflow/config-workflow.yml
edit the file config-workflow.yml to set the following parameter depending on your environment:
sparql_triplestore_url: url_of_running_graphdb
sparql_triplestore_repository: the_repository_name_you_created_in_previous_step
working_directory: the_path_of_bio2rdf_folder
### For Virtuoso:
Edit config in the folder of the dataset you want to convert following this path: support/_dataset_name/virtuoso-workflow/config-workflow.yml
* change the working directory to the bio2rdf cloned project
working_directory: the_path_of_bio2rdf_folder
* change these parameters according to your environment if you are using different virtuoso endpoint:
sparql_triplestore_url: http://virtuoso:8890/sparql
triple_store_password: dba
triple_store_username: dba## Run with [CWL](https://www.commonwl.org/)
* Go to the `bio2rdf` root folder (the root of the cloned repository)
* e.g. `/data/bio2rdf` to run the CWL workflows.* You will need to put the SPARQL mapping queries in `/mapping/$dataset_name` and provide 3 parameters:
* `--outdir`: the [output directory](https://github.com/MaastrichtU-IDS/bio2rdf/tree/master/output/pharmgkb) for files outputted by the workflow (except for the downloaded source files that goes automatically to `/input`).
* e.g. `${ABS_PATH}/output/${DATASET}`.
* The `.cwl` [workflow file](https://github.com/MaastrichtU-IDS/bio2rdf/blob/master/support/pharmgkb/virtuoso-workflow/workflow.cwl)
* e.g. `${ABS_PATH}/support/${DATASET}/virtuoso-workflow/workflow.cwl`
* The `.yml` [configuration file](https://github.com/MaastrichtU-IDS/bio2rdf/blob/master/support/pharmgkb/virtuoso-workflow/config-workflow.yml) with all parameters required to run the workflow
* e.g. `${ABS_PATH}/support/${DATASET}/virtuoso-workflow/config-workflow.yml`* 3 types of workflows can be run depending on the input data:
* Convert XML to RDF
* Convert CSV to RDF
* Convert CSV to RDF and split a property### Convert CSV/TSV with [AutoR2RML](https://github.com/amalic/autor2rml)
```shell
cwl-runner --outdir ${ABS_PATH}/output/${DATASET} ${ABS_PATH}/support/${DATASET}/virtuoso-workflow/workflow.cwl ${ABS_PATH}/support/${DATASET}/virtuoso-workflow/config-workflow.yml
```See [config file](https://github.com/MaastrichtU-IDS/bio2rdf/blob/master/support/sider/virtuoso-workflow/config-workflow.yml):
```yaml
working_directory: /media/apc/DATA1/Maastricht/bio2rdf-githubdataset: sider
# R2RML params
input_data_jdbc: "jdbc:drill:drillbit=drill:31010"
sparql_tmp_graph_uri: "https://w3id.org/data2services/graph/autor2rml/sider"
sparql_base_uri: "https://w3id.org/data2services/"
#autor2rml_column_header: "test1,test2,test3,test5"# RdfUpload params
sparql_triplestore_url: http://graphdb:7200
sparql_triplestore_repository: sidersparql_transform_queries_path: /data/mapping/sider/transform
sparql_insert_metadata_path: /data/mapping/sider/metadata
```### Convert CSV/TSV with [AutoR2RML](https://github.com/amalic/autor2rml) and split a property (Will be available shortly)
### Run in the background
Will write all terminal output to `nohup.out`.
```shell
cwl-runner --outdir ${ABS_PATH}/output/${DATASET} ${ABS_PATH}/support/${DATASET}/virtuoso-workflow/workflow.cwl ${ABS_PATH}/support/${DATASET}/virtuoso-workflow/config-workflow.yml
```