An open API service indexing awesome lists of open source software.

https://github.com/nextstrain/forecasts-ncov

SARS-CoV-2 variant growth rates and frequency forecasts
https://github.com/nextstrain/forecasts-ncov

bioinformatics forecasts nextstrain pango-lineages pathogen sars-cov-2 sars-cov-2-variants

Last synced: 9 months ago
JSON representation

SARS-CoV-2 variant growth rates and frequency forecasts

Awesome Lists containing this project

README

          

# Forecasts SARS-CoV-2

> :warning: **WARNING: This is an alpha release.** Output file format and address may change at any time

This repo forms the basis of our continually-updated modelling of SARS-CoV-2 variant frequencies.
Broadly speaking, the moving pieces in this repo are:

- Data ingest, which produces TSV files of sequence counts. See [./ingest/README](https://github.com/nextstrain/forecasts-ncov/blob/main/ingest/README.md) for more details.
- Variant modelling, which is detailed in this README. The models themselves are defined in the [evofr](https://github.com/blab/evofr) repo.
- The `./viz/` directory contains a web-app which visualises the latest model outputs. See [./viz/README](https://github.com/nextstrain/forecasts-ncov/blob/main/viz/README.md) for more details. Currently this web-app is available at [nextstrain.github.io/forecasts-ncov/](https://nextstrain.github.io/forecasts-ncov/).

## Automated pipeline

The automated pipeline runs daily based on a scheduled jobs and triggers from upstream data ingests.
We use GitHub actions to schedule these jobs, often with one job triggering another upon completion.

- Case counts are fetched from [external data sources](./ingest/README.md#data-sources) daily at 8 AM PST
- Raw metadata/sequences are fetched and cleaned via [nextstrain/ncov-ingest].
- See [GISAID](https://github.com/nextstrain/ncov-ingest/blob/master/.github/workflows/fetch-and-ingest-gisaid-master.yml) and [open](https://github.com/nextstrain/ncov-ingest/blob/master/.github/workflows/fetch-and-ingest-genbank-master.yml) data workflows for their daily scheduled times
- The [nextstrain/ncov-ingest] pipelines trigger the clade counts jobs once the latest curated data has been uploaded to S3
- The GISAID and open data ingest pipelines have different run times, so their clade counts jobs are triggered at different times.
- Clade counts jobs trigger the model runs once the counts data has been uploaded to S3
- Model results are uploaded to S3 as dated files where the date indicates the ***run*** date

### Inputs

See [available counts files](./ingest/README.md#outputs) for the input case counts and clade counts files.

### Outputs

The model results for GISAID data are stored at `s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid`.
The model results for open (GenBank) data are stored at `s3://nextstrain-data/files/workflows/forecasts-ncov/open`.

The latest results are stored as `latest_results.json` and previously uploaded results can be found as `_results.json`.

#### Summary of Available files

| Data Provenance | Variant Classification | Geographic Resolution | Model | Address |
| --------------- | ---------------------- | --------------------- | ------ | -------------------------------------------------------------------------------------------------------------------- |
| GISAID | Nextstrain clades | Global | MLR | `https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/nextstrain_clades/global/mlr/latest_results.json` |
| | Pango lineages | | | `https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json` |
| open (GenBank) | Nextstrain clades | | | `https://data.nextstrain.org/files/workflows/forecasts-ncov/open/nextstrain_clades/global/mlr/latest_results.json` |
| | Pango lineages | | | `https://data.nextstrain.org/files/workflows/forecasts-ncov/open/pango_lineages/global/mlr/latest_results.json` |

## Installation

Please follow [installation instructions](https://docs.nextstrain.org/en/latest/install.html#installation-steps) for Nextstrain's software tools.

## Usage

To run pipeline for all available data generated by ingest:

```bash
nextstrain build .
```

To run the pipeline for specific data provenance, variant classification and geo resolution (e.g. gisaid, nextstrain_clades and global only):

```bash
nextstrain build . --configfile config/config.yaml --config data_provenances=gisaid variant_classifications=nextstrain_clades geo_resolutions=global
```

### Optional uploads

To run the pipeline that uploads the model results to S3 and sends Slack notifications:

```bash
nextstrain build . --configfile config/config.yaml config/optional.yaml
```

OR

Run the GitHub Action workflow named "Run models" to run the pipeline on AWS Batch.

## Configuration

The `data_provenances`, `variant_classifications` and `geo_resolutions` are required configs for the pipeline.

The current available options for `data_provenances` are

- `gisaid`
- `open`

The current available options for `variant_classifications` are

- `nextstrain_clades`
- `pango_lineages`

The current available options for `geo_resolutions` are

- `global`
- `usa`

### Data Prep Configurations

The `prepare_data` params in `config/config.yaml` are used to subset the full
case counts and clades counts data to specific date range, locations, and clades.

### Model configurations

The specific model configurations are housed in separate config YAML files or each model.
These separate config files must be provided in the main config as `mlr_config` and `renewal_config` in order to run the models.
By default, the model config files used are `config/mlr-config.yaml` and `config/renewal-config.yaml`.
Note the inputs and outputs for the models are overridden in the Snakemake pipeline to conform to the Snakemake input/output framework.

### Clade and Lineage colours

Model JSONs are post processed by `./scripts/modify-lineage-colours-and-order.py`.
For `nextstrain_clades` this sets the colours and display names.
For `pango_lineages` this orders lineages based on their full (unaliased) pango designation, and sets colours based on the associated nextstrain clade.

When new clades are added please modify the `CLADES` definitions in the script accordingly.

### Environment variables

No environment variables are required for open data.
However, the following environment variables are required for the gisaid data:

- `AWS_DEFAULT_REGION`
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`

#### Uploads

If running pipeline with uploads to S3, the following environment variables are required (regardless of data provenance):

- `AWS_DEFAULT_REGION`
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`

#### Slack notifications

If running pipeline with Slack notifications, the following environment variables are required:

- `SLACK_CHANNELS`
- `SLACK_TOKEN`

[nextstrain/ncov-ingest]: https://github.com/nextstrain/ncov-ingest

## USA-specific models

Using a modified version of this workflow, we produce USA-specific clade frequency estimates to contribute to [the SARS-CoV-2 variant nowcast hub](https://github.com/reichlab/variant-nowcast-hub/).
To run this version of the workflow, provide the additional `config/variant_hub.yaml` configuration file as shown below and specify the `push_all_hub_submission` workflow target.
These additional configuration details tell the workflow to run models for states in the USA, produce a parquet file with posterior samples of clade frequencies per location and date, and push the resulting file to a new branch in [the Nextstrain organization's fork of the variant-nowcast-hub repository](https://github.com/nextstrain/variant-nowcast-hub/).

To run this USA-specific workflow locally up through the preparation of each model's parquet file, run the following command.

``` bash
nextstrain build \
--docker \
. \
--configfile config/config.yaml config/variant_hub.yaml \
-p \
--forceall \
prepare_all_hub_submissions
```

To run the full workflow which pushes model parquet files to the Nextstrain organization's fork of the hub repository, run the workflow manually through GitHub Actions.