https://github.com/vjcitn/bioccuratedccdg_anvil

Manage concepts related to PFB import from Gen3 to AnVIL; example of CCDG
https://github.com/vjcitn/bioccuratedccdg_anvil

Last synced: 3 months ago
JSON representation

Manage concepts related to PFB import from Gen3 to AnVIL; example of CCDG

Host: GitHub
URL: https://github.com/vjcitn/bioccuratedccdg_anvil
Owner: vjcitn
Created: 2022-02-17T12:25:21.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-02-17T12:26:12.000Z (over 3 years ago)
Last Synced: 2025-01-09T13:46:31.346Z (5 months ago)
Size: 1.95 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # BiocCuratedCCDG_ANVIL

Manage concepts related to PFB import from Gen3 to AnVIL; example of CCDG

# AnVIL Workspace Description

## Concept

Given a PFB "export to terra" from Gen3, produce a DataFrame instance

that can be used as a 'sample metadata' resource for molecular data

lodged in the workspace.

Here are the fields of "sample" data from CCDG:

```

entity:sample_id	pfb:created_datetime	pfb:project_id	pfb:sample_provider	pfb:specimen_id	pfb:state	pfb:subject	pfb:submitter_id	pfb:updated_datetime

```

The function given below will take sample, subject, and sequencing tables and put them together with sample_id as the primary key.

## colData construction

```

build_colData = function() {

 require(AnVIL)

 require(dplyr)

 simplify_names = function(.data) {

   names(.data) = gsub("^pfb:", "", names(.data))

   .data

 }

 samp = avtable(table="sample")

 subj = avtable(table="subject")

 seqtab = avtable(table="sequencing")

#alltypes = unique(samp$`pfb:tissue_type`)

#tissframes = lapply(na.omit(alltypes), function(x) samp |> filter(`pfb:tissue_type` == x))

#names(tissframes) = na.omit(alltypes)

#library(MultiAssayExperiment)

# by stages

#t2 = lapply(tissframes, function(x) mutate(x, subject_id=`pfb:subject`))

 t3 = left_join(mutate(samp, subject_id=`pfb:subject`), subj, by="subject_id")

 sqsq = mutate(seqtab, sample_id=`pfb:sample`)

 inner_join(sqsq, t3, by="sample_id") |> simplify_names()

}

```

The result of running the above function in this workspace has fields:

```

> names(ccdg_full_coldata)

 [1] "sequencing_id"              "total_reads"                "file_format"                "project_id"                

 [5] "estimated_library_size"     "data_format"                "sequencing_assay"           "mean_coverage_per_base"    

 [9] "date_data_generation"       "data_type"                  "file_size"                  "md5sum"                    

[13] "experimental_strategy"      "file_name"                  "submitter_id"               "object_id"                 

[17] "file_state"                 "fragment_length_mean"       "file_type"                  "duplication_rate_of_mapped"

[21] "data_category"              "state"                      "reference_genome_build"     "created_datetime"          

[25] "ga4gh_drs_uri"              "file_md5sum"                "updated_datetime"           "sample"                    

[29] "sample_id"                  "project_id.x"               "submitter_id.x"             "sample_provider"           

[33] "state.x"                    "created_datetime.x"         "specimen_id"                "updated_datetime.x"        

[37] "subject"                    "subject_id"                 "sex"                        "project_id.y"              

[41] "participant_id"             "submitter_id.y"             "state.y"                    "dbgap_submission"          

[45] "created_datetime.y"         "dbgap_study_id"             "updated_datetime.y"         "project"          

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vjcitn/bioccuratedccdg_anvil

Awesome Lists containing this project

README