{"id":16520890,"url":"https://github.com/jhrcook/tidygraph-oncotree","last_synced_at":"2025-10-25T14:49:04.662Z","repository":{"id":115870895,"uuid":"247693085","full_name":"jhrcook/tidygraph-oncotree","owner":"jhrcook","description":"A tidygraph of the MSK OncoTree","archived":false,"fork":false,"pushed_at":"2020-03-16T13:00:14.000Z","size":8155,"stargazers_count":7,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-08T17:14:24.894Z","etag":null,"topics":["cancer","cancer-genetics","oncotree","r","rlang","tidygraph","tidyverse"],"latest_commit_sha":null,"homepage":"https://joshuacook.netlify.com/post/tidygraph-oncotree/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jhrcook.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-03-16T12:11:12.000Z","updated_at":"2025-01-26T21:39:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"4781c5fa-7ea6-407e-af76-c17214a6e7ef","html_url":"https://github.com/jhrcook/tidygraph-oncotree","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jhrcook/tidygraph-oncotree","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhrcook%2Ftidygraph-oncotree","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhrcook%2Ftidygraph-oncotree/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhrcook%2Ftidygraph-oncotree/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhrcook%2Ftidygraph-oncotree/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jhrcook","download_url":"https://codeload.github.com/jhrcook/tidygraph-oncotree/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhrcook%2Ftidygraph-oncotree/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280971731,"owners_count":26422675,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-25T02:00:06.499Z","response_time":81,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cancer","cancer-genetics","oncotree","r","rlang","tidygraph","tidyverse"],"created_at":"2024-10-11T16:53:33.392Z","updated_at":"2025-10-25T14:49:04.634Z","avatar_url":"https://github.com/jhrcook.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"Parsing OncoTree into a Tidygraph\"\nauthor: \"Joshua Cook\"\ndate: \"3/16/2020\"\noutput:\n  md_document:\n    variant: gfm\n---\n\n```{r setup, include=FALSE}\nknitr::opts_chunk$set(echo = TRUE, comment = \"#\u003e\")\n```\n\n## Introduction\n\nCancers are often first classified by their tissue of origin, but there are several types of cancer for each tissue.\nFurther, each of these can have several subdivisions.\nFor example, head and neck cancers can be further divided into seven cancers, including head and neck squamous cell carcinoma (HNSC).\nHNSC itself has six subtypes, too.\nThis hierarchy can be represented in a directed acyclic graph (DAG), as shown below.\n\n![](./assets/oncotree_online_example.png)\n\nThe [OncoTree](http://oncotree.mskcc.org/#/home) is a DAG of cancer subtypes maintained by one of the leading cancer research institutes, [Memorial Sloan Kettering](https://www.mskcc.org) (MSK).\n\nFor one of my projects in [lab](https://www.haigislab.org), I am dealing with many types of cancers from a variety of studies.\nThus, I want to use OncoTree to organize the types and provide relational information of the cancers.\nFor instance, depending on what I want to analyze, I may want the most specific subtype of cancer possible, or maybe I want the cancer grouped by their first level on the OncoTree (e.g. \"Head and Neck\").\n\nTherefore, I decided to parse the OncoTree into a ['tidygraph'](https://cran.r-project.org/web/packages/tidygraph/index.html), a \"tidy\" way to manage graph structures.\nThe following is a tutorial on how I did this.\nThe GitHub repository for this analysis is available at [jhrcook/tidygraph-oncotree](https://github.com/jhrcook/tidygraph-oncotree).\n\n## Setup\n\nI will load the packages 'tidyverse', 'tidygraph', and 'ggraph', and also use 'httr' and 'jsonlite', but call functions directly from their namespace.\nThe 'ggraph' package is a \"grammar of graphics\" for graph structures - it is used for plotting graphs at the end of this tutorial.\n\n```{r}\nlibrary(ggraph)\nlibrary(tidygraph)\nlibrary(tidyverse)\n```\n\n\n## Sending requests to OncoTree's API\n\nThe OncoTree has a fairly simple API (shown below).\nWe are interested in acquiring the full tree so we will use the \"/api/tumorTypes/tree\" endpoint.\n\n![](./assets/oncotree-api.png)\n\nWe can use the package ['httr'](https://cran.r-project.org/web/packages/httr/index.html) to send a \"get\" request to the OncoTree API.\nChecking the status code indicates how the request went, 200 representing success.\n\n```{r}\noncotree_res \u003c- httr::GET(\"http://oncotree.mskcc.org/api/tumorTypes/tree\")\noncotree_res$status_code\n```\n\nThe request returns a list of lists containing meta data on the request and the actual OncoTree data.\n\n```{r}\nhttr::headers(oncotree_res)\n```\n\nThe OncoTree data is a JSON in the `content` list of the response.\nIn case you want this data for other projects, the JSON can be written to file using the `write()` function.\n\n```{r}\nwrite(rawToChar(oncotree_res$content), file.path(\"oncotree.json\"))\n```\n\n## Parsing the OncoTree JSON\n\nThe OncoTree is organized as a highly nested list.\nTo get insight into the structure we must turn the JSON into a list of lists so we can parse it in R.\n\n```{r}\noncotree_json \u003c- jsonlite::fromJSON(rawToChar(oncotree_res$content))\n```\n\nFrom here, I just played around with the list until I got an understanding of the structure.\n\n```{r}\nnames(oncotree_json)\n```\n\n```{r}\nnames(oncotree_json$TISSUE)\n```\n\n```{r}\noncotree_json$TISSUE$code\n```\n\n```{r}\noncotree_json$TISSUE$name\n```\n\n```{r}\noncotree_json$TISSUE$level\n```\n\nAlmost all of the parts of `oncotree_json$TISSUE` contain information about the first level of the graph.\nThe `children` section, though, contains all of the tissues that we can see on the OncoTree web application.\n\n```{r}\nnames(oncotree_json$TISSUE$children)\n```\n\nEach of these children \"nodes\" had the same information, and children of their own, and so on.\n\n```{r}\nnames(oncotree_json$TISSUE$children$HEAD_NECK)\n```\n\n```{r}\nnames(oncotree_json$TISSUE$children$HEAD_NECK$children)\n```\n\nTherefore, we can tell that the JSON is a nested list of the nodes in the DAG.\nAnd all we need to do is implement a graph-traversing function to extract all of the information.\nSince it is nested and we are building a graph, this strongly suggests we will need a recursive algorithm.\n\n## Building the Tidygraph\n\n### Data needed for the tidygraph\n\nTo figure out what to do first, I often find it helpful to figure out what my output should look like.\nTo make a `tidygraph`, I will need an *edge list* and a *node list*.\nThe first is a two-column table with names \"from\" and \"to\" populated by names of the nodes where each row indicates an edge of the graph.\nThe node list is optional and contains any other information about each node, one row per node.\n\nHere is a mock example of the data frames we want out of our recursive algorithm.\n\n```{r}\n# edge list\nedge_list \u003c- tibble::tribble(\n    ~ from, ~ to,\n       \"A\",  \"B\",\n       \"B\",  \"C\",\n       \"C\",  \"D\",\n       \"A\",  \"D\",\n       \"B\",  \"E\",\n       \"E\",  \"D\"\n)\nedge_list\n```\n\n```{r}\n# node list\nnode_list \u003c- tibble(name = LETTERS[1:5], values = round(runif(5), 2))\nnode_list\n```\n\nThe edge list can be turned into a `tidygraph` object using the `as_tbl_graph()` function.\nI explicitly set the `directed` parameter `TRUE` even though it is the default value.\n\n```{r}\ngr \u003c- as_tbl_graph(edge_list, directed = TRUE)\ngr\n```\n\nTo add the node information, I join the node list table by the \"name\" column.\nThe `%N\u003e%` infix operates just like the 'magrittr' pipe `%\u003e%` except it also activates the nodes of the `tidygraph` object so that the `full_join()` operates on the nodes and not the edges.\nI am not able to fully describe the 'tidygraph' API in this tutorial, but see vignettes by the creator, Thomas Pedersen ([\\@thomasp85](https://twitter.com/thomasp85)), for a good introduction: [Data Imaginist - tidygraph](https://www.data-imaginist.com/tags/tidygraph).\n\n```{r}\ngr %N\u003e%\n    full_join(node_list, by = \"name\")\n```\n\nNow we just need to figure out how to build an edge list and node list from the nested JSON.\n\n### Extracting OncoTree from the JSON\n\n\u003e Note: Below I demonstrate how to create the final algorithm as if it was a linear process - it in fact took me about and hour and a half of toying with the functions to get the desired result. If it is not easy to grasp right away, don't worry, it wasn't for me either.\n\nSo we know our algorithm will be recursive, which means we will pass the first node to a function once and this function will call itself from within.\nTherefore, let's create a function `add_children_to_dag()` that takes a node and an edge list and returns an edge list.\n\n```{r}\nadd_children_to_dag \u003c- function(node, el) {\n    return(el)\n}\n```\n\n#### Extracting node information\n\nFirst, we can deal with the node information becasue that can be extracted right away and stashed in a global variable, no recursion needed.\n\n```{r}\n# The node information.\nNODE_INFO \u003c- tibble()\n\n# Build a tibble of the node information from Oncotree.\nextract_node_info \u003c- function(node) {\n    NODE_INFO \u003c\u003c- bind_rows(\n        NODE_INFO,\n        tibble(\n            code = node$code,\n            description = node$name,\n            tissue = ifelse(is.null(node$tissue), \"tissue\", node$tissue),\n            main_type = ifelse(is.null(node$mainType), \"tissue\", node$mainType),\n            color = ifelse(is.null(node$color), \"Black\", node$color),\n            level = node$level\n        )\n    )\n}\n\n\nadd_children_to_dag \u003c- function(node, el) {\n    # Add node information to the `NODE_INFO` global variable.\n    extract_node_info(node)\n    \n    return(el)\n}\n```\n\nNow, the node information is extracted and added to a global variable `NODE_INFO` using the `extract_node_info()` function.\nIt just pulls out some of the useful information from the JSON for each node and binds it with the existing data frame.\n\nNow we can run the first experiment to make sure everything is working properly.\nThe initial edge list is just an empty `tibble()`.\n\n```{r}\nadd_children_to_dag(oncotree_json[[1]], tibble())\n```\n\nNothing is done to the `el` variable in `add_children_to_dag()` function yet, so it returns an empty data frame.\nHowever, the `NODE_INFO` data frame should have the information for the first level of OncoTree.\n\n```{r}\nNODE_INFO\n```\n\nSuccess!\n\n#### Add the node's children to the edge list\n\nNow we need to start on the hard part, the recursive traversal of the DAG in the JSON.\nTo begin, we should add an if-statement to check if the node has children.\nIf it does, then we need to add the connections from this node to the children to the edge list `el`.\n\nThis is easily done by binding the existing `el` with a new tibble with \"from\" and \"to\" columns containing the name of the current node (from) and the names of the children (to).\n\nLet's start with that and make sure it works.\n\n```{r}\nadd_children_to_dag \u003c- function(node, el) {\n    # Add node information to the `NODE_INFO` global variable.\n    extract_node_info(node)\n    \n    if (length(names(node$children)) \u003e 0) {\n        # Add this node and children to edge list.\n        el \u003c- bind_rows(el, tibble(from = node$code, to = names(node$children)))\n    } \n    return(el)\n}\n```\n\n```{r}\nadd_children_to_dag(oncotree_json[[1]], tibble())\n```\n\nGreat! We can see that the connections from `\"TISSUE\"` to each of the top-level cancer groups were successfully added to the edge list.\n\nNow we need to apply this function to each of the children of this node.\nI do this with `map()` from the 'purrr' package (attached along with 'tidyverse').\nIt works similarly to `lapply()` from base R, but is a bit easier to manage, in my opinon (and it has some other useful helpers and capabilities that we don't use here.)\n\nBasically, each child node is passed to `add_children_to_dag()` along with the edge list.\nThe node information for each child will be extracted and, if they have children, they will be sent through `add_children_to_dag()`, too.\nEach time, an edge list is returned.\n\nThis is a recursive process and will naturally visit every node, building up the edge list through every \"leaf\" on the tree.\n\n```{r}\nadd_children_to_dag \u003c- function(node, el) {\n    # Add node information to the `NODE_INFO` global variable.\n    extract_node_info(node)\n    \n    if (length(names(node$children)) \u003e 0) {\n        # Add this node and children to edge list.\n        nodes_el \u003c- bind_rows(el, tibble(from = node$code, to = names(node$children)))\n        \n        # Repeat for children nodes.\n        childrens_el \u003c- map(node$children, add_children_to_dag, el = el)\n       \n        # Combine into one edge list.\n        el \u003c- bind_rows(nodes_el, childrens_el)\n    } \n    return(el)\n}\n```\n\nAnd that's it!\nWe should now be able to create the entire DAG.\n\n```{r}\n# reset `NODE_INFO`\nNODE_INFO \u003c- tibble()\n\n# Run the recursive algorithm.\noncotree_el \u003c- add_children_to_dag(oncotree_json[[1]], tibble())\noncotree_el\n```\n\n```{r}\nNODE_INFO\n```\n\n#### Create the tidygraph\n\nJust as before, the `tidygraph` object can be constructed from the edge list, followed by joining the node list information.\n\n```{r}\noncotree_gr \u003c- as_tbl_graph(oncotree_el, directed = TRUE) %N\u003e%\n    full_join(NODE_INFO, by = c(\"name\" = \"code\"))\n```\n\nThis tidygraph object can now be saved for future reference.\n\n```{r}\nsaveRDS(oncotree_gr, \"msk_oncotree_tidygraph.rds\")\n```\n\n\n## Visualization\n\nTo make the visualization a bit better, I mapped the colors extracted from the JSON to colors available in R.\n\n```{r}\noncotree_colors \u003c- tibble::tribble(\n             ~ color,      ~ new_color,\n             \"black\",          \"black\",\n             \"Black\",          \"black\",\n              \"Blue\",           \"blue\",\n              \"Cyan\",           \"cyan\",\n           \"DarkRed\",        \"darkred\",\n         \"Gainsboro\",        \"#DCDCDC\",\n              \"Gray\",           \"grey\",\n             \"Green\",          \"green\",\n           \"HotPink\",        \"hotpink\",\n         \"LightBlue\",      \"lightblue\",\n       \"LightSalmon\",    \"lightsalmon\",\n      \"LightSkyBlue\",   \"lightskyblue\",\n       \"LightYellow\",    \"lightyellow\",\n         \"LimeGreen\",      \"limegreen\",\n    \"MediumSeaGreen\", \"mediumseagreen\",\n            \"Orange\",         \"orange\",\n         \"PeachPuff\",        \"#FFDAB9\",\n            \"Purple\",         \"purple\",\n               \"Red\",            \"red\",\n       \"SaddleBrown\",           \"tan4\",\n              \"Teal\",        \"#008080\",\n             \"White\",          \"white\",\n            \"Yellow\",         \"yellow\"\n)\n\noncotree_gr \u003c- oncotree_gr %N\u003e%\n    left_join(oncotree_colors, by = \"color\") %\u003e%\n    select(-color) %\u003e%\n    dplyr::rename(color = new_color)\n```\n\nFinally, we can use the 'ggraph' package to create some visualizations of the DAG.\nAgain, I am unable to fully explain 'ggraph' here, but the package vignettes are very good: [Data Imaginist - ggraph](https://www.data-imaginist.com/tags/ggraph).\n\nThe first plot below shows the OncoTree graph spreading radially from the center, each layer representing a further subdivision of the cancer type.\nThe colors roughly correspond to the tissue of origin.\n\n```{r, fig.width=7, fig.height=7}\noncotree_gr %N\u003e%\n    ggraph(layout = \"tree\", circular = TRUE) +\n    geom_edge_diagonal(color = \"grey50\", alpha = 0.5) +\n    geom_node_point(aes(color = color)) +\n    scale_color_identity() +\n    theme_graph()\n```\n\nEach of the subgraphs can also be separated by tissue of origin.\n\n```{r, fig.width=7, fig.height=20, warning=FALSE, message='hide'}\noncotree_gr %N\u003e%\n    filter(name != \"TISSUE\") %\u003e%\n    morph(to_components) %\u003e%\n    mutate(grp = tissue[which.max(level)]) %\u003e%\n    unmorph() %\u003e%\n    ggraph(layout = \"tree\") +\n    facet_nodes(~ grp, ncol = 4, scales = \"free\") +\n    geom_edge_diagonal(color = \"grey50\", alpha = 0.5) +\n    geom_node_point(aes(color = color)) +\n    scale_color_identity() +\n    theme_graph()\n```\n\nAnd below is an example of the subdivisions of lung cancer.\n\n```{r, fig.width=7, fig.height=8}\noncotree_gr %N\u003e%\n    filter(name != \"TISSUE\") %\u003e%\n    morph(to_components) %\u003e%\n    mutate(grp = tissue[which.max(level)]) %\u003e%\n    unmorph() %N\u003e%\n    filter(grp == \"Lung\") %\u003e%\n    ggraph(layout = \"tree\") +\n    geom_edge_diagonal(color = \"grey50\") +\n    geom_node_label(aes(label = name, fill = color), size = 3,\n                    repel = FALSE, label.r = unit(0.1, \"lines\")) +\n    scale_fill_identity() +\n    coord_flip() +\n    scale_y_reverse() +\n    theme_graph()\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjhrcook%2Ftidygraph-oncotree","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjhrcook%2Ftidygraph-oncotree","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjhrcook%2Ftidygraph-oncotree/lists"}