{"id":15051498,"url":"https://github.com/bioconductor/genomicdatacommons","last_synced_at":"2025-05-16T06:04:15.423Z","repository":{"id":9308384,"uuid":"60694062","full_name":"Bioconductor/GenomicDataCommons","owner":"Bioconductor","description":"Provide R access to the NCI Genomic Data Commons portal.","archived":false,"fork":false,"pushed_at":"2025-05-12T18:25:24.000Z","size":4619,"stargazers_count":89,"open_issues_count":20,"forks_count":24,"subscribers_count":17,"default_branch":"devel","last_synced_at":"2025-05-12T19:41:41.259Z","etag":null,"topics":["api-client","bioconductor","bioinformatics","cancer","core-services","data-science","genomics","nci","r","tcga","vignette"],"latest_commit_sha":null,"homepage":"http://bioconductor.github.io/GenomicDataCommons/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bioconductor.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-06-08T11:51:11.000Z","updated_at":"2025-05-12T18:24:47.000Z","dependencies_parsed_at":"2024-01-09T00:23:05.903Z","dependency_job_id":"ab7b6261-2bda-46cf-a5aa-61661a381f8c","html_url":"https://github.com/Bioconductor/GenomicDataCommons","commit_stats":{"total_commits":544,"total_committers":14,"mean_commits":"38.857142857142854","dds":"0.31433823529411764","last_synced_commit":"33f339ce3c7e394b46b6d0166afad65ddc19c9d6"},"previous_names":[],"tags_count":42,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bioconductor%2FGenomicDataCommons","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bioconductor%2FGenomicDataCommons/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bioconductor%2FGenomicDataCommons/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bioconductor%2FGenomicDataCommons/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bioconductor","download_url":"https://codeload.github.com/Bioconductor/GenomicDataCommons/tar.gz/refs/heads/devel","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254478186,"owners_count":22077675,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api-client","bioconductor","bioinformatics","cancer","core-services","data-science","genomics","nci","r","tcga","vignette"],"created_at":"2024-09-24T21:36:27.047Z","updated_at":"2025-05-16T06:04:14.515Z","avatar_url":"https://github.com/Bioconductor.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# GenomicDataCommons\n\n\u003c!-- badges: start --\u003e\n\n[![R-CMD-check](https://github.com/Bioconductor/GenomicDataCommons/workflows/R-CMD-check/badge.svg)](https://github.com/Bioconductor/GenomicDataCommons/actions)\n\u003c!-- badges: end --\u003e\n\n# What is the GDC?\n\nFrom the [Genomic Data Commons (GDC)\nwebsite](https://gdc.nci.nih.gov/about-gdc):\n\nThe National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a\ndata sharing platform that promotes precision medicine in oncology. It\nis not just a database or a tool; it is an expandable knowledge network\nsupporting the import and standardization of genomic and clinical data\nfrom cancer research programs.\n\nThe GDC contains NCI-generated data from some of the largest and most\ncomprehensive cancer genomic datasets, including The Cancer Genome Atlas\n(TCGA) and Therapeutically Applicable Research to Generate Effective\nTherapies (TARGET). For the first time, these datasets have been\nharmonized using a common set of bioinformatics pipelines, so that the\ndata can be directly compared.\n\nAs a growing knowledge system for cancer, the GDC also enables\nresearchers to submit data, and harmonizes these data for import into\nthe GDC. As more researchers add clinical and genomic data to the GDC,\nit will become an even more powerful tool for making discoveries about\nthe molecular basis of cancer that may lead to better care for patients.\n\nThe [data model for the GDC is\ncomplex](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components),\nbut it worth a quick overview. The data model is encoded as a so-called\nproperty graph. Nodes represent entities such as Projects, Cases,\nDiagnoses, Files (various kinds), and Annotations. The relationships\nbetween these entities are maintained as edges. Both nodes and edges may\nhave Properties that supply instance details. The GDC API exposes these\nnodes and edges in a somewhat simplified set of\n[RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer)\nendpoints.\n\n# Quickstart\n\nThis software is available at Bioconductor.org and can be downloaded via\n`BiocManager::install`.\n\nTo report bugs or problems, either [submit a new\nissue](https://github.com/Bioconductor/GenomicDataCommons/issues) or\nsubmit a `bug.report(package='GenomicDataCommons')` from within R (which\nwill redirect you to the new issue on GitHub).\n\n## Installation\n\nInstallation can be achieved via Bioconductor’s `BiocManager` package.\n\n``` r\nif (!require(\"BiocManager\"))\n    install.packages(\"BiocManager\")\n\nBiocManager::install('GenomicDataCommons')\n```\n\n``` r\nlibrary(GenomicDataCommons)\n```\n\n## Check basic functionality\n\n``` r\nstatus()\n#\u003e $commit\n#\u003e [1] \"4dd3680528a19ed33cfc83c7d049426c97bb903b\"\n#\u003e \n#\u003e $data_release\n#\u003e [1] \"Data Release 34.0 - July 27, 2022\"\n#\u003e \n#\u003e $status\n#\u003e [1] \"OK\"\n#\u003e \n#\u003e $tag\n#\u003e [1] \"3.0.0\"\n#\u003e \n#\u003e $version\n#\u003e [1] 1\n```\n\n## Find data\n\nThe following code builds a `manifest` that can be used to guide the\ndownload of raw data. Here, filtering finds gene expression files\nquantified as raw counts using `STAR` from ovarian cancer patients.\n\n``` r\nge_manifest \u003c- files() |\u003e\n    filter( cases.project.project_id == 'TCGA-OV') |\u003e\n    filter( type == 'gene_expression' ) |\u003e\n    filter( analysis.workflow_type == 'STAR - Counts') |\u003e\n    manifest(size = 5)\nge_manifest\n#\u003e                                     id data_format     access                                                                   file_name\n#\u003e 1 7c69529f-2273-4dc4-b213-e84924d78bea         TSV       open d6472bd0-b4e2-4ed1-a892-e1702c195dc7.rna_seq.augmented_star_gene_counts.tsv\n#\u003e 2 0eff4634-f8c4-4db9-8a7c-331b21689bae         TSV       open 42165baf-b32c-4fc4-8b04-29c5b4e76de0.rna_seq.augmented_star_gene_counts.tsv\n#\u003e 3 7d74b4c5-6391-4b3e-95a3-020ea0869e86         TSV controlled   accf08d4-a784-4908-831a-7a08d4c5f0f5.rna_seq.star_splice_junctions.tsv.gz\n#\u003e 4 dc2aeea4-3cd0-4623-92f4-bbbc962851cc         TSV controlled   8ab508b9-2993-4e66-b8f9-81e32e936d4a.rna_seq.star_splice_junctions.tsv.gz\n#\u003e 5 0cf852be-d2e3-4fde-bba8-c93efae2961a         TSV       open 93831282-1dd1-49a3-acd7-dae2a49ca62e.rna_seq.augmented_star_gene_counts.tsv\n#\u003e                           submitter_id           data_category       acl            type file_size                 created_datetime                           md5sum\n#\u003e 1 7085a70b-2f63-4402-9e53-70f091f26fcb Transcriptome Profiling      open gene_expression   4254435 2021-12-13T20:53:42.329364-06:00 19d5596bba8949f4c138793608497d56\n#\u003e 2 f0d44930-b1ad-447a-86b9-27d0285954b9 Transcriptome Profiling      open gene_expression   4257461 2021-12-13T20:47:24.326497-06:00 d89d71b7c028c1643d7a3ee7857d8e01\n#\u003e 3 e6473134-6d65-414c-9f52-2c25057fac7d Transcriptome Profiling phs000178 gene_expression   3109435 2021-12-13T21:03:56.008440-06:00 fb8332d6413c44a9de02a1cbe6b018aa\n#\u003e 4 f99b93a9-70cb-44f8-bd1f-4edeee4425a4 Transcriptome Profiling phs000178 gene_expression   4607701 2021-12-13T21:02:23.944851-06:00 26231bed1ef67c093d3ce2b39def81cd\n#\u003e 5 fb4d7abe-b61a-4f35-9700-605f1bc1512f Transcriptome Profiling      open gene_expression   4265694 2021-12-13T20:50:55.234254-06:00 050763aabd36509f954137fbdc4eeb00\n#\u003e                   updated_datetime                              file_id                      data_type    state experimental_strategy\n#\u003e 1 2022-01-19T14:47:28.965154-06:00 7c69529f-2273-4dc4-b213-e84924d78bea Gene Expression Quantification released               RNA-Seq\n#\u003e 2 2022-01-19T14:47:07.478144-06:00 0eff4634-f8c4-4db9-8a7c-331b21689bae Gene Expression Quantification released               RNA-Seq\n#\u003e 3 2022-01-19T14:01:15.621847-06:00 7d74b4c5-6391-4b3e-95a3-020ea0869e86 Splice Junction Quantification released               RNA-Seq\n#\u003e 4 2022-01-19T14:01:15.621847-06:00 dc2aeea4-3cd0-4623-92f4-bbbc962851cc Splice Junction Quantification released               RNA-Seq\n#\u003e 5 2022-01-19T14:47:07.036781-06:00 0cf852be-d2e3-4fde-bba8-c93efae2961a Gene Expression Quantification released               RNA-Seq\n```\n\n## Download data\n\nThis code block downloads the 5 gene expression files specified in the\nquery above. Using multiple processes to do the download very\nsignificantly speeds up the transfer in many cases. The following\ncompletes in about 15 seconds.\n\n``` r\nlibrary(BiocParallel)\nregister(MulticoreParam())\ndestdir \u003c- tempdir()\nfnames \u003c- lapply(ge_manifest$id,gdcdata)\n```\n\nIf the download had included controlled-access data, the download above\nwould have needed to include a `token`. Details are available in [the\nauthentication section below](#authentication).\n\n## Metadata queries\n\nHere we use a couple of ad-hoc helper functions to handle the output of\nthe query. See the `inst/script/README.Rmd` folder for the source.\n\nFirst, create a `data.frame` from the clinical data:\n\n``` r\nexpands \u003c- c(\"diagnoses\",\"annotations\",\n             \"demographic\",\"exposures\")\nclinResults \u003c- cases() |\u003e\n    GenomicDataCommons::select(NULL) |\u003e\n    GenomicDataCommons::expand(expands) |\u003e\n    results(size=6)\ndemoDF \u003c- filterAllNA(clinResults$demographic)\nexposuresDF \u003c- bindrowname(clinResults$exposures)\n```\n\n``` r\ndemoDF[, 1:4]\n#\u003e                                      cause_of_death         race gender              ethnicity\n#\u003e 2525bfef-6962-4b7f-8e80-6186400ce624           \u003cNA\u003e not reported female           not reported\n#\u003e 126507c3-c0d7-41fb-9093-7deed5baf431 Cancer Related not reported female           not reported\n#\u003e c43ac461-9f03-44bc-be7d-3d867eb708a0           \u003cNA\u003e not reported female           not reported\n#\u003e a59a90d9-f1b0-49dd-9c97-bcaa6ba55d44 Cancer Related not reported   male           not reported\n#\u003e 59122a43-606a-4669-806b-6747e0ac9985           \u003cNA\u003e        white   male not hispanic or latino\n#\u003e 4447a969-e5c8-4291-b83c-53a0f7e77cbc Cancer Related        white female not hispanic or latino\n```\n\n``` r\nexposuresDF[, 1:4]\n#\u003e                                       submitter_id                 created_datetime    alcohol_intensity pack_years_smoked\n#\u003e 2525bfef-6962-4b7f-8e80-6186400ce624 C3N-03839-EXP 2019-12-30T10:23:07.190853-06:00 Lifelong Non-Drinker                NA\n#\u003e 126507c3-c0d7-41fb-9093-7deed5baf431 C3N-01518-EXP 2018-06-21T14:27:48.817254-05:00 Lifelong Non-Drinker                NA\n#\u003e c43ac461-9f03-44bc-be7d-3d867eb708a0 C3N-03933-EXP 2019-03-14T08:23:14.054975-05:00 Lifelong Non-Drinker                NA\n#\u003e a59a90d9-f1b0-49dd-9c97-bcaa6ba55d44 C3N-02695-EXP 2019-03-14T08:23:14.054975-05:00   Occasional Drinker              16.8\n#\u003e 59122a43-606a-4669-806b-6747e0ac9985 C3L-03642-EXP 2019-06-24T07:53:15.534197-05:00 Lifelong Non-Drinker              39.0\n#\u003e 4447a969-e5c8-4291-b83c-53a0f7e77cbc C3L-03728-EXP 2019-06-24T07:53:15.534197-05:00 Lifelong Non-Drinker                NA\n```\n\nNote that the diagnoses data has multiple lines per patient:\n\n``` r\ndiagDF \u003c- bindrowname(clinResults$diagnoses)\ndiagDF[, 1:4]\n#\u003e                                      ajcc_pathologic_stage                 created_datetime tissue_or_organ_of_origin age_at_diagnosis\n#\u003e 2525bfef-6962-4b7f-8e80-6186400ce624             Stage IIB 2019-07-22T06:40:02.183501-05:00          Head of pancreas            19956\n#\u003e 126507c3-c0d7-41fb-9093-7deed5baf431          Not Reported 2018-12-03T12:05:16.846188-06:00             Temporal lobe            26312\n#\u003e c43ac461-9f03-44bc-be7d-3d867eb708a0             Stage III 2019-03-14T10:37:34.405260-05:00       Floor of mouth, NOS            25635\n#\u003e a59a90d9-f1b0-49dd-9c97-bcaa6ba55d44          Not Reported 2019-03-14T10:37:34.405260-05:00       Floor of mouth, NOS            16652\n#\u003e 59122a43-606a-4669-806b-6747e0ac9985          Not Reported 2019-07-22T06:40:02.183501-05:00          Upper lobe, lung            23384\n#\u003e 4447a969-e5c8-4291-b83c-53a0f7e77cbc          Not Reported 2019-05-07T07:41:33.411909-05:00              Frontal lobe            29326\n```\n\n# Basic design\n\nThis package design is meant to have some similarities to the\n“tidyverse” approach of dplyr. Roughly, the functionality for finding\nand accessing files and metadata can be divided into:\n\n1.  Simple query constructors based on GDC API endpoints.\n2.  A set of verbs that when applied, adjust filtering, field selection,\n    and faceting (fields for aggregation) and result in a new query\n    object (an endomorphism)\n3.  A set of verbs that take a query and return results from the GDC\n\nIn addition, there are auxiliary functions for asking the GDC API for\ninformation about available and default fields, slicing BAM files, and\ndownloading actual data files. Here is an overview of functionality[^1].\n\n-   Creating a query\n    -   `projects()`\n    -   `cases()`\n    -   `files()`\n    -   `annotations()`\n-   Manipulating a query\n    -   `filter()`\n    -   `facet()`\n    -   `select()`\n-   Introspection on the GDC API fields\n    -   `mapping()`\n    -   `available_fields()`\n    -   `default_fields()`\n    -   `grep_fields()`\n    -   `available_values()`\n    -   `available_expand()`\n-   Executing an API call to retrieve query results\n    -   `results()`\n    -   `count()`\n    -   `response()`\n-   Raw data file downloads\n    -   `gdcdata()`\n    -   `transfer()`\n    -   `gdc_client()`\n-   Summarizing and aggregating field values (faceting)\n    -   `aggregations()`\n-   Authentication\n    -   `gdc_token()`\n-   BAM file slicing\n    -   `slicing()`\n\n[^1]: See individual function and methods documentation for specific\n    details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbioconductor%2Fgenomicdatacommons","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbioconductor%2Fgenomicdatacommons","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbioconductor%2Fgenomicdatacommons/lists"}