Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/refgenie/rg.databio.org

refgenieserver instance for testing
https://github.com/refgenie/rg.databio.org

Last synced: about 1 month ago
JSON representation

refgenieserver instance for testing

Host: GitHub
URL: https://github.com/refgenie/rg.databio.org
Owner: refgenie
Created: 2020-10-09T00:28:16.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-08-13T18:48:16.000Z (over 3 years ago)
Last Synced: 2023-03-10T06:25:57.899Z (almost 2 years ago)
Language: Python
Homepage: http://rg.databio.org
Size: 343 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# rg.databio.org server overview

This repository contains the files to build and archive genome assets to serve with [refgenieserver](https://github.com/refgenie/refgenieserver) at http://rg.databio.org.

The whole process is scripted, starting from this repository. From here, we do this basic workflow:

1. Download raw input files for assets (FASTA files, GTF files etc.)
2. Configure refgenie
3. Build assets with `refgenie build` in a local refgenie instance
4. Archive assets with `refgenieserver archive`
5. Upload archives to S3
6. Deploy assets to active server on AWS.

# Adding an asset to this server

## Overview of metadata structure

The metadata is located in the [asset_pep](asset_pep) folder, which contains a [PEP](https://pep.databio.org) with metadata for each asset. The contents are:

- `assets.csv` - The primary sample_table. Each each row is an asset.
- `recipe_inputs.csv` - The subsample_table. This provides a way to define each individual value passed to any of the 3 arguments of the `refgenie build` command: `--assets`, `--params`, and `--files`.
- `refgenie_build_cfg.yaml` -- config file that defines a subproject (which is used to download the input data) and additional project settings.

## Step 1: Add the asset to the asset table.

To add an asset, you will need to add a row in `assets.csv`. Follow these directions:

- `genome` - the human-readable genome (namespace) you want to serve this asset under
- `asset` - the human-readble asset name you want to serve this asset under. It is identical to the asset recipe. Use `refgenie list` to see [available recipes](http://refgenie.databio.org/en/latest/build/)

Your asset will be retrievable from the server with `refgenie pull {genome}/{asset_name}`.

## Step 2: Add any required inputs to the recipe_inputs table

Next, we need to add the source for each item required by your recipe. You can see what the recipe requires by using `-q` or `--requirements`, like this: `refgenie build {genome}/{recipe} -q`. If your recipe doesn't require any inputs, then you're done. If it requires any inputs (which can be one or more of the following: _assets_, _files_, _parameters_), then you need to specify these in the `recipe_inputs.csv` table.

For each required input, you add a row to `recipe_inputs.csv`. Follow these directions:

- `sample_name` - must match the `genome` and `asset` value in the `assets.csv` file. Format it this way: `-`. This is how we match inputs to assets.

Next you will need to fill in 3 columns:

- `input_type` which is one of the following: _files_, _params_ or _assets_
- `intput_id` must match the recipe requirement. Again, use `refgenie build / -q` to learn the ids
- `input_value` value for the input, e.g. URL in case of _files_

## Step 3: See if you did it well!

**Validate the PEP with [`eido`](http://eido.databio.org/en/latest/)**

The command below validates the PEP aginst a remote schema. Any PEP issues will result in a `ValidationError`:

```
eido validate refgenie_build_cfg.yaml -s http://schema.databio.org/refgenie/refgenie_build.yaml
```

# Deploying assets onto the server

## Setup

In this guide we'll use environment variables to keep track of where stuff goes.

- `BASEDIR` points to our parent folder where we'll do all the building/archiving
- `GENOMES` points to pipeline output (referenced in the project config)
- `REFGENIE_RAW` points to a folder where the downloaded raw files are kept
- `REFGENIE` points to the refgenie config file
- `REFGENIE_ARCHIVE` points to the location where we'll store the actual archives

```
export SERVERNAME=rg.databio.org
export BASEDIR=$PROJECT/deploy/$SERVERNAME
export GENOMES=$BASEDIR/genomes
export REFGENIE_RAW=/project/shefflab/www/refgenie_$SERVERNAME
export REFGENIE=$BASEDIR/$SERVERNAME/config/refgenie_config.yaml
export REFGENIE_ARCHIVE=$GENOMES/archive
mkdir $BASEDIR
cd $BASEDIR
```

To start, clone this repository:

```
git clone [email protected]:refgenie/$SERVERNAME.git
```

## Step 1: Download input files

Many of the assets require some input files, and we have to make sure we have those files locally. In the `recipe_inputs.csv` file, we have entered these files as remote URLs, so the first step is to download them. We have created a subproject called `getfiles` for this: To programmatically download all the files required by `refgenie build`, run from this directory using [looper](http://looper.databio.org):

```
cd $SERVERNAME
mkdir -p $REFGENIE_RAW
looper run asset_pep/refgenie_build_cfg.yaml -p local --amend getfiles --sel-attr asset --sel-incl fasta
```

Check the status with `looper check` / `looper check --itemized`

```
looper check asset_pep/refgenie_build_cfg.yaml --amend getfiles --sel-attr asset --sel-incl fasta
```

## Step 2: Refgenie genome configuration file initialization

This repository comes with files genome cofiguration file already defined in [`\config`](config) directory, but if you have not initialized refgenie yet or want to start over, then first you can initialize the config like this:

```
refgenie init -c $REFGENIE -f $GENOMES -u http://awspds.refgenie.databio.org/rg.databio.org/ -a $REFGENIE_ARCHIVE -b refgenie_config_archive.yaml
```

## Step 3: Build assets

Once files are present locally, we can run `refgenie build` on each asset specified in the sample_table (`assets.csv`). We have to submit fasta assets first:

### Option A: Leveraging [_MapReduce_](http://refgenie.databio.org/en/latest/build/#build-assets-concurrently) programming model for concurrent builds

Since we're about to build multiple assets concurrently we will first build the assets with `--map` option to store the metadata in a separate, newly created genome configuration file. This avoids any conflicts in concurrent asset builds.

Subsequently, we'll run `refgenie build` with `--reduce` option to combine the metadata into a single genome configuration file.

Refgenie _doesn't_ account for assets dependancy. Therefore, as we have assets that depend on other assets, we need to take care of the dependancies ourselves:

1. `refgenie build --map` all fasta assets to establish genome namespaces
2. Wait until jobs are completed, call `refgenie build --reduce`
3. `refgenie build --map` all other top-level assets, e.g. fasta_txome, gencode_gtf
4. Wait until jobs are completed, call `refgenie build --reduce`
5. `refgenie build --map` all derived assets, e.g. bowtie2_index, bwa_index
6. Wait until jobs are completed, call `refgenie build --reduce`

```
looper run asset_pep/refgenie_build_cfg.yaml -p bulker_slurm --sel-attr asset --sel-incl fasta
```

This will create one job for each _asset_. Monitor job progress with `looper check`:

```
looper check asset_pep/refgenie_build_cfg.yaml --sel-attr asset --sel-incl fasta --itemized
```

The _Reduce_ procedure is quick, so there's no need to submit the job to the cluster, just run it locally:

```
refgenie build --reduce
```

This takes care of the first two points, repeat the above steps for the other assets.

### Option B: Building _all_ assets with [Snakemake](https://snakemake.readthedocs.io/en/stable/)

Alternatively, you can use the Snakemake workflow in [`snakemake_workflow`](./snakemake_workflow) directory. This workflow uses the inherent Snakemake's rule dependancy property to encode the refgenie build asset dependancies.

#### Configuration

**Genome and assets**

By default all the genomes and all the assets specified in the asset PEP will be built. However, this can be restricted using a Snakemake workflow configuration file ([`config.yaml`](./snakemake_workflow/config.yaml)). Therefore you need to make sure the [`config.yaml`](./snakemake_workflow/config.yaml) is empty to build all.

To specify which genomes to build you need to specify them as a list in [`config.yaml`](./snakemake_workflow/config.yaml), like so:

```yaml
genomes_to_process:
- hg38
- mm10
```

To specify which assets to exclude from building you need to specify them as a list in [`config.yaml`](./snakemake_workflow/config.yaml), like so:

```yaml
assets_to_exclude:
- bwa_index
- ensembl_gtf
```

In addition to the config file, these values [can be overwritten via the command line](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#standard-configuration).

**Compute resources**

There is a pre-configured [SLURM Snakemake profile](https://github.com/Snakemake-Profiles/slurm) included in this repository, which specified the default SLURM settings, that are adjusted on-the-fly based on the asset/genome characteristics. To use it, you need to specify the profile with `--profile slurm` option.

Another thing to specify is the number of max cluster jobs running in parallel, which you need to specify with `--jobs`.

#### Inspect assets to be built

You can generate a [DAG](https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-4-indexing-read-alignments-and-visualizing-the-dag-of-jobs) of assets to be built with `snakemake --dag` command:

```
snakemake reduce_all --dag | dot -Tsvg > dag.svg
```

#### Execution

To execute the Snakemake workflow, which will submit the jobs to the cluster, run the following:

```
cd snakemake_workflow
snakemake reduce_all --profile slurm --jobs 8
```

where `reduce_all` is the name of the target rule to execute.

## Step 4. Archive assets

Assets are built locally now, but to serve them, we must archive them using `refgenieserver`. The general command is `refgenieserver archive -c `. Since the archive process is generally lengthy, it makes sense to submit this job to a cluster. We can use looper to do that.

To start over completely, remove the archive config file with:

```
rm config/refgenie_config_archive.yaml
```

Then submit the archiving jobs with `looper run`

```
looper run asset_pep/refgenieserver_archive_cfg.yaml -p bulker_local --sel-attr asset --sel-incl fasta
```

Check progress with `looper check`:

```
looper check asset_pep/refgenieserver_archive_cfg.yaml --sel-attr asset --sel-incl fasta
```

## Step 5. Upload archives to S3

Now the archives should be built, so we'll sync them to AWS. Use the refgenie credentials (here added with `--profile refgenie`, which should be preconfigured with `aws configure`)

```
aws s3 sync $REFGENIE_ARCHIVE s3://awspds.refgenie.databio.org/rg.databio.org/ --profile refgenie
```

## Step 6. Deploy server

Now everything is ready to deploy. If using refgenieserver directly, you'll run `refgenieserver serve config/refgenieserver_archive_cfg`. We're hosting this repository on AWS and use GitHub Actions to trigger trigger deploy jobs to push the updates to AWS ECS whenever a change is detected in the config file.

```
ga -A; gcm "Deploy to ECS"; gpoh
```