https://github.com/quiltdata/quilt-package-metadata-athena
A comprehensive tutorial outlining
https://github.com/quiltdata/quilt-package-metadata-athena
athena metadata quilt-data quilt-packages quilt3 s3 sql
Last synced: 11 months ago
JSON representation
A comprehensive tutorial outlining
- Host: GitHub
- URL: https://github.com/quiltdata/quilt-package-metadata-athena
- Owner: quiltdata
- Created: 2024-05-03T00:05:36.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-12T17:11:53.000Z (almost 2 years ago)
- Last Synced: 2025-04-10T19:04:59.263Z (about 1 year ago)
- Topics: athena, metadata, quilt-data, quilt-packages, quilt3, s3, sql
- Language: Jupyter Notebook
- Homepage: https://open.quiltdata.com/b/quilt-example/packages/examples/ccle-rnaseq-metadata-nfquilt-athena
- Size: 654 KB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Streamlining NGS Insights: From Raw Data to Athena Queries with Quilt Packages and Metadata
In this tutorial series, we demonstrate the benefit of packaging raw -omics data in Quilt packages with attached sample-level metadata. Annotating packages with workflow-standardized metadata enables the creation of AWS Athena tables, joining sample-level metadata & pipeline outputs (e.g. Nextflow) from your processed NGS data. Together, joining these two data sources in Athena allows users to query large datasets across multiple processing runs and cohorts efficiently using SQL.
For example, use an SQL query within a Jupyter Notebook to generate a table of EGFR expression across all colon cancer cell lines, where "colon cancer" represents a piece of sample-level metadata from the raw data Quilt packages, and "EGFR expression" is a piece of processed data from packaged Nextflow pipeline outputs.
The ultimate goal of this demo is to provide an end-to-end framework, from raw data to analysis, to maximize the utility of your NGS data, and make querying your datasets fast & easy (no more searching through directories & file systems to find specific sample or run IDs!).
## Dataset & Nextflow Pipeline
For the purpose of this tutorial, we are using a subset of publicly available RNA-sequencing data generated by the [Cancer Cell Line Encyclopedia (CCLE)](https://sites.broadinstitute.org/ccle/) initiative.
RNA-sequencing data is processed using the [`nf-core/rna-seq` Nextflow pipeline](https://nf-co.re/rnaseq/3.14.0) with the [`nf-quilt`](https://docs.quiltdata.com/examples/nextflow) plugin to package pipeline outputs into a Quilt package with pipeline parameters as metadata.
Although focussed on bulk RNA-seq data, this tutorial is generalizable - with the core principles applying across data types, and reproducible with your in-house datasets.
## Project Outline
We have generated a series of 4 core tutorials (+ 1 optional) demonstrating a framework to go from raw NGS data to annotated Quilt data packages with metadata & Nextflow pipeline outputs, to enable quick data access & queries through AWS Athena.
### 1. Annotated Quilt Packages for Raw NGS Data & Metadata
`00_curate_raw_ccle_rnaseq_data.ipynb` (optional)
`01_create_metadata_workflow_schema.ipynb`
`02_generate_raw_data_pkgs_with_metadata.ipynb`
Raw data is either generated in house by an instrument, or as in the case of this demo, curated from a public source. Here, we downloaded raw RNA-sequencing data in the form of fastqs from the Sequence Read Archive (SRA). Raw sequencing data was then packaged into Quilt packages, 1 per package per sample.
Sample-level metadata describing both biological (tumor type, patient age, histology ...) and technical (sequencer used, library kit, freezing media used for storage ...) features of the sample were obtained from SRA and attached as metadata to each Quilt package housing raw data.
Quilt workflows & metadata schemas were used to ensure the integrity of the metadata across samples -- a key step to maximize the utility of sample metadata in downstream analysis! No more Tumor vs. tumor vs. tumour...!!
### 2. Tractable Nextflow Pipeline Processing with `nf-quilt`
`03_run_nfcore_rnaseq_with_nfquilt.ipynb`
The Nextflow `nf-core/rnaseq` pipeline, in conjunction with `nf-quilt` was used to process raw sequencing data (fastqs) and generate per sample expression values. Samples were processed together in batches (called "runs"), mirroring common practice in NGS centers when multiple samples on a sequencing flow cell are pre-processed at the same time. The `nf-quilt` plugin automatically packages Nextflow pipeline output into a Quilt package, and appends detailed pipeline run metadata to the package.
### 3. Metadata & Pipeline Results Data Lake
`04_athena_metadata_nfcore_output.ipynb`
To enable valuable data searches, we must align the sample-level metadata appended to the raw data packages to the pipeline outputs. In this demo, the primary data generated by the pipeline is expression tables. With Athena, its possible to integrate sample metadata & pipeline output tables together to empower quick queries and slicing and dicing of large datasets.
### 4. Efficiently Query & Analyze Pipeline Outputs Alongside Sample Metadata
`06_query_athena_data_and_perform_analysis`
Once Athena is enabled, the world (or data in this case...) is the Computational Biologist's oyster! Computational biologists can now use the Athena to make SQL queries to obtain desired subsets of data to empower their analysis quickly and efficiently. Queries can be performed directly in Jupyter notebooks, enabling seamless data loading upstream of analysis.
In contrast, without Athena capabilities, comp bio folks would have figure out which samples they want by loading a master metadata table somewhere, perform some detective work to track down where the output tables of their desired samples live, and load those files 1-by-1.
Additionally, Athena tables are compatible with interactive dashboards (e.g. Tableau, Spotfire, QuickSight), allowing you to keep track of the number of samples, which samples, or other accounting metrics that may be helpful beyond computational teams (business development, project management) in a "no-code" manner.
## Pre-Requisites
The tutorials are in the form of Jupyter Notebooks, and are fully executable. To run the notebooks, the following pre-requisites are required:
1. Python >=3.7
2. Required Python packages:`pip install -r requirements.txt`
2. AWS credentials
3. Quilt Open Data Account
4. NextFlow Tower Account (optional)
## Questions?
We love to help! Please reach out to the Quilt Data team with any comments or questions. Let's get your data up to snuff together!
- Laura Richards: laura@quiltdata.io
- Simon Kohnstamm: simon@quiltdata.io
- Kevin Moore: kevin@quildata.io