https://github.com/databio/databio_genomes

A list of lab genome assets to be built with refgenie
https://github.com/databio/databio_genomes

Last synced: 4 months ago
JSON representation

A list of lab genome assets to be built with refgenie

Host: GitHub
URL: https://github.com/databio/databio_genomes
Owner: databio
Created: 2019-08-05T16:28:42.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2020-06-18T17:02:30.000Z (about 6 years ago)
Last Synced: 2025-01-15T15:08:42.934Z (over 1 year ago)
Size: 110 KB
Stars: 0
Watchers: 8
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Databio genomes overview

This repository contains the files to build and archive our labs's reference genome assets to serve with [`refgenieserver`](https://github.com/databio/refgenieserver) at http://refgenomes.databio.org.

The whole process is scripted, starting from this repository. From here, we download the input data (FASTA files, GTF files etc.), use `refgenie build` to create all of these assets in a local refgenie instance, and then use `refgenieserver archive` to build the server archives, and finally serve them with a refgenieserver instance by calling `refgenieserver serve`.

# Asset PEP

The [asset_pep](asset_pep) folder contains a [PEP](https://pep.databio.org) with metadata for each asset. The contents are:

- `assets.csv` - The primary sample_table. Each each row is an asset.
- `recipe_inputs.csv` - The subsample_table. This provides a way to define each individual value passed to any of the 3 arguments of the `refgenie build` command: `--assets`, `--params`, and `--files`.
- `refgenie_build_cfg.yaml` -- config file that defines a subproject (which is used to download the input data) and additional project settings.

Below are instructions for: 1) adding a new asset to this PEP, which will deploy that asset at http://refgenomes.databio.org; 2) processing this PEP to build, archive, and deploy on the server.

## Adding an asset to this PEP

### Step 1: Add the asset to the asset table.

To add an asset, you will need to add a row in `assets.csv`. Follow these directions:

- `genome` - the human-readable genome (namespace) you want to serve this asset under
- `asset` - the human-readble asset name you want to serve this asset under. It is identical to the asset recipe. Use `refgenie list` to see [available recipes](http://refgenie.databio.org/en/latest/build/)

Your asset will be retrievable from the server with `refgenie pull {genome}/{asset_name}`.

### Step 2: Add any required inputs to the recipe_inputs table

Next, we need to add the source for each item required by your recipe. You can see what the recipe requires by using `-q` or `--requirements`, like this: `refgenie build {genome}/{recipe} -q`. If your recipe doesn't require any inputs, then you're done. If it requires any inputs (which can be one or more of the following: *assets*, *files*, *parameters*), then you need to specify these in the `recipe_inputs.csv` table.

For each required input, you add a row to `recipe_inputs.csv`. Follow these directions:
- `sample_name` - must match the `genome` and `asset` value in the `assets.csv` file. Format it this way: `-`. This is how we match inputs to assets.

Next you will need to fill in 3 columns:
- `input_type` which is one of the following: *files*, *params* or *assets*
- `intput_id` must match the recipe requirement. Again, use `refgenie build / -q` to learn the ids
- `input_value` value for the input, e.g. URL in case of *files*

### Step 3: See if you did it well!

**Validate the PEP with [`eido`](http://eido.databio.org/en/latest/)**

The command below validates the PEP aginst a remote schema. Any PEP issues will result in a `ValidationError`:

```
eido validate refgenie_build_cfg.yaml -s http://schema.databio.org/refgenie/refgenie_build.yaml
```

## Building assets using this PEP

### Step 1: Download input files

Many of the assets require some input files, and we have to make sure we have those files locally. In the `recipe_inputs.csv` file, we have entered these files as remote URLs, so the first step is to download them. We have created a subproject called `getfiles` for this: To programmatically download all the files required by `refgenie build`, run from this directory using [looper](http://looper.databio.org):

```
looper run refgenie_build_cfg.yaml -p local --amend getfiles
```

### Step 2: Build assets

Once files are present locally, we can run `refgenie build` on each asset specified in the sample_table (`assets.csv`):

```
looper run refgenie_build_cfg.yaml
```

This will create one job for each *asset*.

### Step 3. Archive assets

Assets are built locally now, but to serve them, we must archive them using `refgenieserver`. The command is simple:

```
refgenieserver archive -c
```

Since the archivization process is generally lengthy, it makes sense to submit this job to the cluster. Since you have [divvy](http://divvy.databio.org/en/latest/) installed (with looper), you can easily create a SLURM submission script with `divvy write`:

```
divvy write -o archive_job.sbatch --code 'refgenieserver archive -c ' ...
```
for example:
```
divvy write -o archive_job.sbatch \
--code 'refgenieserver archive -c $PROJECT/genomes_staging/genomes.yaml' \
--mem 12000 \
--cores 8 \
--logfile $HOME/refgenieserver_archive.log \
--jobname refgenieserver_archive \
--time 01-00:00:00
```
and submit it with:
```
sbatch archive_job.sbatch
```

### Step 4. Serve assets

```
refgenieserver serve genomes.yaml
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/databio/databio_genomes

Awesome Lists containing this project

README