https://github.com/nextstrain/forecasts-ncov

SARS-CoV-2 variant growth rates and frequency forecasts
https://github.com/nextstrain/forecasts-ncov

bioinformatics forecasts nextstrain pango-lineages pathogen sars-cov-2 sars-cov-2-variants

Last synced: 11 months ago
JSON representation

SARS-CoV-2 variant growth rates and frequency forecasts

Host: GitHub
URL: https://github.com/nextstrain/forecasts-ncov
Owner: nextstrain
Created: 2022-04-12T18:00:49.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-12-17T23:07:28.000Z (over 1 year ago)
Last Synced: 2024-12-18T00:19:21.880Z (over 1 year ago)
Topics: bioinformatics, forecasts, nextstrain, pango-lineages, pathogen, sars-cov-2, sars-cov-2-variants
Language: Python
Homepage: https://nextstrain.org/sars-cov-2/forecasts/
Size: 2.06 MB
Stars: 7
Watchers: 12
Forks: 2
Open Issues: 18
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Forecasts SARS-CoV-2

> :warning: **WARNING: This is an alpha release.** Output file format and address may change at any time

This repo forms the basis of our continually-updated modelling of SARS-CoV-2 variant frequencies.
Broadly speaking, the moving pieces in this repo are:

- Data ingest, which produces TSV files of sequence counts. See [./ingest/README](https://github.com/nextstrain/forecasts-ncov/blob/main/ingest/README.md) for more details.
- Variant modelling, which is detailed in this README. The models themselves are defined in the [evofr](https://github.com/blab/evofr) repo.
- The `./viz/` directory contains a web-app which visualises the latest model outputs. See [./viz/README](https://github.com/nextstrain/forecasts-ncov/blob/main/viz/README.md) for more details. Currently this web-app is available at [nextstrain.github.io/forecasts-ncov/](https://nextstrain.github.io/forecasts-ncov/).

## Automated pipeline

The automated pipeline runs daily based on a scheduled jobs and triggers from upstream data ingests.
We use GitHub actions to schedule these jobs, often with one job triggering another upon completion.

- Case counts are fetched from [external data sources](./ingest/README.md#data-sources) daily at 8 AM PST
- Raw metadata/sequences are fetched and cleaned via [nextstrain/ncov-ingest].
- See [GISAID](https://github.com/nextstrain/ncov-ingest/blob/master/.github/workflows/fetch-and-ingest-gisaid-master.yml) and [open](https://github.com/nextstrain/ncov-ingest/blob/master/.github/workflows/fetch-and-ingest-genbank-master.yml) data workflows for their daily scheduled times
- The [nextstrain/ncov-ingest] pipelines trigger the clade counts jobs once the latest curated data has been uploaded to S3
- The GISAID and open data ingest pipelines have different run times, so their clade counts jobs are triggered at different times.
- Clade counts jobs trigger the model runs once the counts data has been uploaded to S3
- Model results are uploaded to S3 as dated files where the date indicates the ***run*** date

### Inputs

See [available counts files](./ingest/README.md#outputs) for the input case counts and clade counts files.

### Outputs

The model results for GISAID data are stored at `s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid`.
The model results for open (GenBank) data are stored at `s3://nextstrain-data/files/workflows/forecasts-ncov/open`.

The latest results are stored as `latest_results.json` and previously uploaded results can be found as `_results.json`.

#### Summary of Available files

| Data Provenance | Variant Classification | Geographic Resolution | Model | Address |
| --------------- | ---------------------- | --------------------- | ------ | -------------------------------------------------------------------------------------------------------------------- |
| GISAID | Nextstrain clades | Global | MLR | `https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/nextstrain_clades/global/mlr/latest_results.json` |
| | Pango lineages | | | `https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json` |
| open (GenBank) | Nextstrain clades | | | `https://data.nextstrain.org/files/workflows/forecasts-ncov/open/nextstrain_clades/global/mlr/latest_results.json` |
| | Pango lineages | | | `https://data.nextstrain.org/files/workflows/forecasts-ncov/open/pango_lineages/global/mlr/latest_results.json` |

## Installation

Please follow [installation instructions](https://docs.nextstrain.org/en/latest/install.html#installation-steps) for Nextstrain's software tools.

## Usage

To run pipeline for all available data generated by ingest:

```bash
nextstrain build .
```

To run the pipeline for specific data provenance, variant classification and geo resolution (e.g. gisaid, nextstrain_clades and global only):

```bash
nextstrain build . --configfile config/config.yaml --config data_provenances=gisaid variant_classifications=nextstrain_clades geo_resolutions=global
```

### Optional uploads

To run the pipeline that uploads the model results to S3 and sends Slack notifications:

```bash
nextstrain build . --configfile config/config.yaml config/optional.yaml
```

Run the GitHub Action workflow named "Run models" to run the pipeline on AWS Batch.

## Configuration

The `data_provenances`, `variant_classifications` and `geo_resolutions` are required configs for the pipeline.

The current available options for `data_provenances` are

- `gisaid`
- `open`

The current available options for `variant_classifications` are

- `nextstrain_clades`
- `pango_lineages`

The current available options for `geo_resolutions` are

- `global`
- `usa`

### Data Prep Configurations

The `prepare_data` params in `config/config.yaml` are used to subset the full
case counts and clades counts data to specific date range, locations, and clades.

### Model configurations

The specific model configurations are housed in separate config YAML files or each model.
These separate config files must be provided in the main config as `mlr_config` and `renewal_config` in order to run the models.
By default, the model config files used are `config/mlr-config.yaml` and `config/renewal-config.yaml`.
Note the inputs and outputs for the models are overridden in the Snakemake pipeline to conform to the Snakemake input/output framework.

### Clade and Lineage colours

Model JSONs are post processed by `./scripts/modify-lineage-colours-and-order.py`.
For `nextstrain_clades` this sets the colours and display names.
For `pango_lineages` this orders lineages based on their full (unaliased) pango designation, and sets colours based on the associated nextstrain clade.

When new clades are added please modify the `CLADES` definitions in the script accordingly.

### Environment variables

No environment variables are required for open data.
However, the following environment variables are required for the gisaid data:

- `AWS_DEFAULT_REGION`
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`

#### Uploads

If running pipeline with uploads to S3, the following environment variables are required (regardless of data provenance):

- `AWS_DEFAULT_REGION`
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`

#### Slack notifications

If running pipeline with Slack notifications, the following environment variables are required:

- `SLACK_CHANNELS`
- `SLACK_TOKEN`

[nextstrain/ncov-ingest]: https://github.com/nextstrain/ncov-ingest

## USA-specific models

Using a modified version of this workflow, we produce USA-specific clade frequency estimates to contribute to [the SARS-CoV-2 variant nowcast hub](https://github.com/reichlab/variant-nowcast-hub/).
To run this version of the workflow, provide the additional `config/variant_hub.yaml` configuration file as shown below and specify the `push_all_hub_submission` workflow target.
These additional configuration details tell the workflow to run models for states in the USA, produce a parquet file with posterior samples of clade frequencies per location and date, and push the resulting file to a new branch in [the Nextstrain organization's fork of the variant-nowcast-hub repository](https://github.com/nextstrain/variant-nowcast-hub/).

To run this USA-specific workflow locally up through the preparation of each model's parquet file, run the following command.

``` bash
nextstrain build \
--docker \
. \
--configfile config/config.yaml config/variant_hub.yaml \
-p \
--forceall \
prepare_all_hub_submissions
```

To run the full workflow which pushes model parquet files to the Nextstrain organization's fork of the hub repository, run the workflow manually through GitHub Actions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nextstrain/forecasts-ncov

Awesome Lists containing this project

README