{"id":22104438,"url":"https://github.com/refgenie/aws_igenomes","last_synced_at":"2025-03-24T02:41:25.074Z","repository":{"id":41987321,"uuid":"398050348","full_name":"refgenie/aws_igenomes","owner":"refgenie","description":"configuration for an iGenomes refgenieserver","archived":false,"fork":false,"pushed_at":"2022-05-03T11:13:31.000Z","size":584,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-29T08:43:26.581Z","etag":null,"topics":["reference-data","refgenie","server"],"latest_commit_sha":null,"homepage":"http://igenomes.databio.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/refgenie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-08-19T19:14:55.000Z","updated_at":"2022-04-11T20:11:23.000Z","dependencies_parsed_at":"2022-08-12T01:31:08.420Z","dependency_job_id":null,"html_url":"https://github.com/refgenie/aws_igenomes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refgenie%2Faws_igenomes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refgenie%2Faws_igenomes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refgenie%2Faws_igenomes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refgenie%2Faws_igenomes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/refgenie","download_url":"https://codeload.github.com/refgenie/aws_igenomes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245200683,"owners_count":20576673,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["reference-data","refgenie","server"],"created_at":"2024-12-01T06:31:49.889Z","updated_at":"2025-03-24T02:41:25.056Z","avatar_url":"https://github.com/refgenie.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"**Unreleased requirements:**\n\n- attamp@dev\n- ubiquerg@dev\n- peppy@dev\n- looper@dev\n- pypiper@dev\n- refgenconf@dev\n- refgenie@dev\n\n# rg.databio.org server overview\n\nThis repository contains the files to build and archive genome assets to serve with [refgenieserver](https://github.com/refgenie/refgenieserver) at http://rg.databio.org.\n\nThe whole process is scripted, starting from this repository. From here, we do this basic workflow:\n\n1. Download raw input files for assets (FASTA files, GTF files etc.)\n2. Configure refgenie\n3. Build assets with `refgenie build` in a local refgenie instance\n4. Archive assets with `refgenieserver archive`\n5. Upload archives to S3\n6. Deploy assets to active server on AWS.\n\n# Adding an asset to this server\n\n## Overview of metadata structure\n\nThe metadata is located in the [asset_pep](asset_pep) folder, which contains a [PEP](https://pep.databio.org) with metadata for each asset. The contents are:\n\n- `assets.csv` - The primary sample_table. Each each row is an asset.\n- `recipe_inputs.csv` - The subsample_table. This provides a way to define each individual value passed to any of the 3 arguments of the `refgenie build` command: `--assets`, `--params`, and `--files`.\n- `refgenie_build_cfg.yaml` -- config file that defines a subproject (which is used to download the input data) and additional project settings.\n\n## Step 1: Add the asset to the asset table.\n\nTo add an asset, you will need to add a row in `assets.csv`. Follow these directions:\n\n- `genome` - the human-readable genome (namespace) you want to serve this asset under\n- `asset` - the human-readble asset name you want to serve this asset under. It is identical to the asset recipe. Use `refgenie list` to see [available recipes](http://refgenie.databio.org/en/latest/build/)\n\nYour asset will be retrievable from the server with `refgenie pull {genome}/{asset_name}`.\n\n## Step 2: Add any required inputs to the recipe_inputs table\n\nNext, we need to add the source for each item required by your recipe. You can see what the recipe requires by using `-q` or `--requirements`, like this: `refgenie build {genome}/{recipe} -q`. If your recipe doesn't require any inputs, then you're done. If it requires any inputs (which can be one or more of the following: _assets_, _files_, _parameters_), then you need to specify these in the `recipe_inputs.csv` table.\n\nFor each required input, you add a row to `recipe_inputs.csv`. Follow these directions:\n\n- `sample_name` - must match the `genome` and `asset` value in the `assets.csv` file. Format it this way: `\u003cgenome\u003e-\u003casset\u003e`. This is how we match inputs to assets.\n\nNext you will need to fill in 3 columns:\n\n- `input_type` which is one of the following: _files_, _params_ or _assets_\n- `intput_id` must match the recipe requirement. Again, use `refgenie build \u003cgenome\u003e/\u003casset\u003e -q` to learn the ids\n- `input_value` value for the input, e.g. URL in case of _files_\n\n## Step 3: See if you did it well!\n\n**Validate the PEP with [`eido`](http://eido.databio.org/en/latest/)**\n\nThe command below validates the PEP aginst a remote schema. Any PEP issues will result in a `ValidationError`:\n\n```\neido validate refgenie_build_cfg.yaml -s http://schema.databio.org/refgenie/refgenie_build.yaml\n```\n\n# Deploying assets onto the server\n\n## Setup\n\nIn this guide we'll use environment variables to keep track of where stuff goes.\n\n- `BASEDIR` points to our parent folder where we'll do all the building/archiving\n- `GENOMES` points to pipeline output (referenced in the project config)\n- `REFGENIE_RAW` points to a folder where the downloaded raw files are kept\n- `REFGENIE` points to the refgenie config file\n- `REFGENIE_ARCHIVE` points to the location where we'll store the actual archives\n\n```\nexport SERVERNAME=aws_igenomes\nexport BASEDIR=$PROJECT/deploy/$SERVERNAME\nexport GENOMES=$BASEDIR/genomes\nexport REFGENIE_RAW=/project/shefflab/www/refgenie_$SERVERNAME\nexport REFGENIE=$BASEDIR/$SERVERNAME/config/refgenie_config.yaml\nexport REFGENIE_ARCHIVE=$GENOMES/archive\nmkdir $BASEDIR\ncd $BASEDIR\n```\n\nTo start, clone this repository:\n\n```\ngit clone git@github.com:refgenie/$SERVERNAME.git\n```\n\n## Step 1: Download input files\n\nMany of the assets require some input files, and we have to make sure we have those files locally. In the `recipe_inputs.csv` file, we have entered these files as remote URLs, so the first step is to download them. We have created a subproject called `getfiles` for this: To programmatically download all the files required by `refgenie build`, run from this directory using [looper](http://looper.databio.org):\n\n```\ncd $SERVERNAME\nmkdir -p $REFGENIE_RAW\nlooper run asset_pep/refgenie_build_cfg.yaml -p local --amend getfiles --sel-attr asset --sel-incl fasta\n```\n\nCheck the status with `looper check` / `looper check --itemized`\n\n```\nlooper check asset_pep/refgenie_build_cfg.yaml --amend getfiles --sel-attr asset --sel-incl fasta\n```\n\n## Step 2: Refgenie genome configuration file initialization\n\nThis repository comes with files genome cofiguration file already defined in [`\\config`](config) directory, but if you have not initialized refgenie yet or want to start over, then first you can initialize the config like this:\n\n```\nrefgenie init -c $REFGENIE -f $GENOMES -u http://awspds.refgenie.databio.org/aws_igenomes/ -a $REFGENIE_ARCHIVE -b refgenie_config_archive.yaml\n```\n\n## Step 3: Add recipes and asset classes\n\nAdd asset classes and recipes for assets to be built.\n\nFor example, clone the recipes repository and iterate over the files and add them one by one:\n\n```bash\ngit clone https://github.com/refgenie/recipes.git $CODE\n\nfor file in `ls $CODE/recipes/asset_classes`; do\n    refgenie asset_class add --source $CODE/recipes/asset_classes/$file --force\ndone\n\nfor file in `ls $CODE/recipes/recipes`; do\n    refgenie recipe add --source $CODE/recipes/recipes/$file --force\ndone\n```\n\n## Step 4: Build assets\n\nOnce files are present locally, we can run `refgenie build` on each asset specified in the sample_table (`assets.csv`). We have to submit fasta assets first:\n\n### Leveraging [_MapReduce_](http://refgenie.databio.org/en/latest/build/#build-assets-concurrently) programming model for concurrent builds\n\nSince we're about to build multiple assets concurrently we will first build the assets with `--map` option to store the metadata in a separate, newly created genome configuration file. This avoids any conflicts in concurrent asset builds.\n\nSubsequently, we'll run `refgenie build` with `--reduce` option to combine the metadata into a single genome configuration file.\n\nRefgenie _doesn't_ account for assets dependancy. Therefore, as we have assets that depend on other assets, we need to take care of the dependancies ourselves:\n\n1. `refgenie build --map` all fasta assets to establish genome namespaces\n2. Wait until jobs are completed, call `refgenie build --reduce`\n3. `refgenie build --map` all other top-level assets, e.g. fasta_txome, gencode_gtf\n4. Wait until jobs are completed, call `refgenie build --reduce`\n5. `refgenie build --map` all derived assets, e.g. bowtie2_index, bwa_index\n6. Wait until jobs are completed, call `refgenie build --reduce`\n\n```\nlooper run asset_pep/refgenie_build_cfg.yaml -p bulker_slurm --sel-attr asset --sel-incl fasta\n```\n\nThis will create one job for each _asset_. Monitor job progress with `looper check`:\n\n```\nlooper check asset_pep/refgenie_build_cfg.yaml --sel-attr asset --sel-incl fasta --itemized\n```\n\nThe _Reduce_ procedure is quick, so there's no need to submit the job to the cluster, just run it locally:\n\n```\nrefgenie build --reduce\n```\n\nThis takes care of the first two points, repeat the above steps for the other assets.\n\n\u003c!-- ```\ngrep CANCELLED ../genomes/submission/*.log\nll ../genomes/submission/*.log\ngrep error ../genomes/submission/*.log\ngrep maximum ../genomes/submission/*.log\n\nll ../genomes/data/*/*/*/_refgenie_build/*.flag\nll ../genomes/data/*/*/*/_refgenie_build/*failed.flag\nll ../genomes/data/*/*/*/_refgenie_build/*completed.flag\nll ../genomes/data/*/*/*/_refgenie_build/*running.flag\nll ../genomes/data/*/*/*/_refgenie_build/*completed.flag | wc -l\ncat ../genomes/submission/*.log\n```\n\nTo run all the asset types:\n\n```\nlooper run asset_pep/refgenie_build_cfg.yaml -p bulker_slurm\n``` --\u003e\n\n## Step 5. Archive assets\n\nAssets are built locally now, but to serve them, we must archive them using `refgenieserver`. The general command is `refgenieserver archive -c \u003cpath/to/genomes.yaml\u003e`. Since the archive process is generally lengthy, it makes sense to submit this job to a cluster. We can use looper to do that.\n\nTo start over completely, remove the archive config file with:\n\n```\nrm config/refgenie_config_archive.yaml\n```\n\nThen submit the archiving jobs with `looper run`\n\n```\nlooper run asset_pep/refgenieserver_archive_cfg.yaml -p bulker_local --sel-attr asset --sel-incl fasta\n```\n\nCheck progress with `looper check`:\n\n```\nlooper check asset_pep/refgenieserver_archive_cfg.yaml --sel-attr asset --sel-incl fasta\n```\n\n\u003c!--\n```\nll ../genomes/archive_logs/submission/*.log\ngrep Wait ../genomes/archive_logs/submission/*.log\ngrep Error ../genomes/archive_logs/submission/*.log\ncat ../genomes/archive_logs/submission/*.log\n``` --\u003e\n\n## Step 6. Upload archives to S3\n\nNow the archives should be built, so we'll sync them to AWS. Use the refgenie credentials (here added with `--profile refgenie`, which should be preconfigured with `aws configure`)\n\n```\naws s3 sync $REFGENIE_ARCHIVE s3://awspds.refgenie.databio.org/aws_igenomes/ --profile refgenie\n```\n\n## Step 7. Deploy server\n\nNow everything is ready to deploy. If using refgenieserver directly, you'll run `refgenieserver serve config/refgenieserver_archive_cfg`. We're hosting this repository on AWS and use GitHub Actions to trigger trigger deploy jobs to push the updates to AWS ECS whenever a change is detected in the config file.\n\n```\nga -A; gcm \"Deploy to ECS\"; gpoh\n```\n\nlooper run asset_pep/refgenie_build_cfg_auto.yaml -p bulker_slurm --sel-attr asset --sel-incl star_index --command-extra=\"-R\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frefgenie%2Faws_igenomes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frefgenie%2Faws_igenomes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frefgenie%2Faws_igenomes/lists"}