{"id":18565643,"url":"https://github.com/quiltdata/quilt-package-metadata-athena","last_synced_at":"2025-07-08T17:35:56.147Z","repository":{"id":240392751,"uuid":"795279136","full_name":"quiltdata/quilt-package-metadata-athena","owner":"quiltdata","description":"A comprehensive tutorial outlining ","archived":false,"fork":false,"pushed_at":"2024-08-12T17:11:53.000Z","size":670,"stargazers_count":5,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-10T19:04:59.263Z","etag":null,"topics":["athena","metadata","quilt-data","quilt-packages","quilt3","s3","sql"],"latest_commit_sha":null,"homepage":"https://open.quiltdata.com/b/quilt-example/packages/examples/ccle-rnaseq-metadata-nfquilt-athena","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quiltdata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-03T00:05:36.000Z","updated_at":"2024-11-05T05:09:43.000Z","dependencies_parsed_at":"2024-05-19T23:24:46.197Z","dependency_job_id":"4aac19e7-97a5-4eed-ad5a-fb8e0832e92e","html_url":"https://github.com/quiltdata/quilt-package-metadata-athena","commit_stats":null,"previous_names":["laura-quilt/quilt-package-metadata-athena","quiltdata/quilt-package-metadata-athena"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/quiltdata/quilt-package-metadata-athena","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiltdata%2Fquilt-package-metadata-athena","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiltdata%2Fquilt-package-metadata-athena/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiltdata%2Fquilt-package-metadata-athena/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiltdata%2Fquilt-package-metadata-athena/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quiltdata","download_url":"https://codeload.github.com/quiltdata/quilt-package-metadata-athena/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quiltdata%2Fquilt-package-metadata-athena/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264315090,"owners_count":23589705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["athena","metadata","quilt-data","quilt-packages","quilt3","s3","sql"],"created_at":"2024-11-06T22:19:37.406Z","updated_at":"2025-07-08T17:35:56.125Z","avatar_url":"https://github.com/quiltdata.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Streamlining NGS Insights: From Raw Data to Athena Queries with Quilt Packages and Metadata\n\nIn this tutorial series, we demonstrate the benefit of packaging raw -omics data in Quilt packages with attached sample-level metadata. Annotating packages with workflow-standardized metadata enables the creation of AWS Athena tables, joining sample-level metadata \u0026 pipeline outputs (e.g. Nextflow) from your processed NGS data. Together, joining these two data sources in Athena allows users to query large datasets across multiple processing runs and cohorts efficiently using SQL.\n\nFor example, use an SQL query within a Jupyter Notebook to generate a table of EGFR expression across all colon cancer cell lines, where \"colon cancer\" represents a piece of sample-level metadata from the raw data Quilt packages, and \"EGFR expression\" is a piece of processed data from packaged Nextflow pipeline outputs.\n\nThe ultimate goal of this demo is to provide an end-to-end framework, from raw data to analysis, to maximize the utility of your NGS data, and make querying your datasets fast \u0026 easy (no more searching through directories \u0026 file systems to find specific sample or run IDs!).\n\n## Dataset \u0026 Nextflow Pipeline\n\nFor the purpose of this tutorial, we are using a subset of publicly available RNA-sequencing data generated by the [Cancer Cell Line Encyclopedia (CCLE)](https://sites.broadinstitute.org/ccle/) initiative.   \n\nRNA-sequencing data is processed using the [`nf-core/rna-seq` Nextflow pipeline](https://nf-co.re/rnaseq/3.14.0) with the [`nf-quilt`](https://docs.quiltdata.com/examples/nextflow) plugin to package pipeline outputs into a Quilt package with pipeline parameters as metadata.  \n  \nAlthough focussed on bulk RNA-seq data, this tutorial is generalizable - with the core principles applying across data types, and reproducible with your in-house datasets.\n\n## Project Outline\n\nWe have generated a series of 4 core tutorials (+ 1 optional) demonstrating a framework to go from raw NGS data to annotated Quilt data packages with metadata \u0026 Nextflow pipeline outputs, to enable quick data access \u0026 queries through AWS Athena.\n\n### 1. Annotated Quilt Packages for Raw NGS Data \u0026  Metadata\n\n`00_curate_raw_ccle_rnaseq_data.ipynb` (optional)  \n`01_create_metadata_workflow_schema.ipynb`  \n`02_generate_raw_data_pkgs_with_metadata.ipynb`  \n\nRaw data is either generated in house by an instrument, or as in the case of this demo, curated from a public source. Here, we downloaded raw RNA-sequencing data in the form of fastqs from the Sequence Read Archive (SRA). Raw sequencing data was then packaged into Quilt packages, 1 per package per sample. \n\nSample-level metadata describing both biological (tumor type, patient age, histology ...) and technical (sequencer used, library kit, freezing media used for storage ...) features of the sample were obtained from SRA and attached as metadata to each Quilt package housing raw data. \n\nQuilt workflows \u0026 metadata schemas were used to ensure the integrity of the metadata across samples -- a key step to maximize the utility of sample metadata in downstream analysis! No more Tumor vs. tumor vs. tumour...!!\n\n### 2. Tractable Nextflow Pipeline Processing with `nf-quilt`\n\n`03_run_nfcore_rnaseq_with_nfquilt.ipynb`  \n\nThe Nextflow `nf-core/rnaseq` pipeline, in conjunction with `nf-quilt` was used to process raw sequencing data (fastqs) and generate per sample expression values. Samples were processed together in batches (called \"runs\"), mirroring common practice in NGS centers when multiple samples on a sequencing flow cell are pre-processed at the same time. The `nf-quilt` plugin automatically packages Nextflow pipeline output into a Quilt package, and appends detailed pipeline run metadata to the package.\n\n\n### 3. Metadata \u0026 Pipeline Results Data Lake\n\n`04_athena_metadata_nfcore_output.ipynb`  \n\nTo enable valuable data searches, we must align the sample-level metadata appended to the raw data packages to the pipeline outputs. In this demo, the primary data generated by the pipeline is expression tables. With Athena, its possible to integrate sample metadata \u0026 pipeline output tables together to empower quick queries and slicing and dicing of large datasets.\n\n### 4. Efficiently Query \u0026 Analyze Pipeline Outputs Alongside Sample Metadata\n\n`06_query_athena_data_and_perform_analysis`  \n\nOnce Athena is enabled, the world (or data in this case...) is the Computational Biologist's oyster! Computational biologists can now use the Athena to make SQL queries to obtain desired subsets of data to empower their analysis quickly and efficiently. Queries can be performed directly in Jupyter notebooks, enabling seamless data loading upstream of analysis. \n\nIn contrast, without Athena capabilities, comp bio folks would have figure out which samples they want by loading a master metadata table somewhere, perform some detective work to track down where the output tables of their desired samples live, and load those files 1-by-1. \n\nAdditionally, Athena tables are compatible with interactive dashboards (e.g. Tableau, Spotfire, QuickSight), allowing you to keep track of the number of samples, which samples, or other accounting metrics that may be helpful beyond computational teams (business development, project management) in a \"no-code\" manner.\n\n## Pre-Requisites\n\nThe tutorials are in the form of Jupyter Notebooks, and are fully executable. To run the notebooks, the following pre-requisites are required:\n\n1. Python \u003e=3.7 \n2. Required Python packages:`pip install -r requirements.txt`\n2. AWS credentials\n3. Quilt Open Data Account\n4. NextFlow Tower Account (optional)\n\n## Questions?\n\nWe love to help! Please reach out to the Quilt Data team with any comments or questions. Let's get your data up to snuff together!\n\n- Laura Richards: laura@quiltdata.io\n- Simon Kohnstamm: simon@quiltdata.io\n- Kevin Moore: kevin@quildata.io","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquiltdata%2Fquilt-package-metadata-athena","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquiltdata%2Fquilt-package-metadata-athena","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquiltdata%2Fquilt-package-metadata-athena/lists"}