{"id":20046477,"url":"https://github.com/zachcp/substrate_prediction_comparison","last_synced_at":"2025-10-30T05:53:23.356Z","repository":{"id":146170320,"uuid":"185405821","full_name":"zachcp/substrate_prediction_comparison","owner":"zachcp","description":"Compare AS5 and AS5 substrate Predictions","archived":false,"fork":false,"pushed_at":"2019-05-07T14:40:11.000Z","size":372,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-24T04:13:01.410Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zachcp.png","metadata":{"files":{"readme":"Readme.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-07T13:17:22.000Z","updated_at":"2021-02-27T08:00:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"982e6e2a-6cbc-40d1-8769-d7ed0f547d70","html_url":"https://github.com/zachcp/substrate_prediction_comparison","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zachcp%2Fsubstrate_prediction_comparison","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zachcp%2Fsubstrate_prediction_comparison/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zachcp%2Fsubstrate_prediction_comparison/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zachcp%2Fsubstrate_prediction_comparison/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zachcp","download_url":"https://codeload.github.com/zachcp/substrate_prediction_comparison/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241476435,"owners_count":19968916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T11:24:12.609Z","updated_at":"2025-10-30T05:53:18.312Z","avatar_url":"https://github.com/zachcp.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"Antismash 4 vs Antismash 5 Predictions\"\noutput: \n  github_document:\n    toc: true \n---\n\n\n```{r echo=FALSE, warning=FALSE, message=FALSE}\n\n#options(warn=-1)\n\nlibrary(dplyr)\nlibrary(data.table)\nlibrary(ggplot2)\n\n## --------------------------------------------------------------------------------------------------------------\n## Data Loading/Munging \n\ncolumns \u003c- c(\"mibig\", \"AD_domain_idx\", \"AD_domain_id\", \"code_type\", \"prediction\" )\n\nas4 \u003c- data.table::fread(\"output/domains_as4.txt\", col.names = columns)\nas5 \u003c- data.table::fread(\"output/domains_as5.txt\", col.names = columns)\n\n# fix domain IDs\nas5$AD_domain_id \u003c-  gsub(\"AMP-binding\\\\.\", \"A\", as5$AD_domain_id)\n\n\nas4$code_type \u003c-  gsub(\"Stachelhaus code\", \"stachelhaus_predictions_4\", as4$code_type)\nas4$code_type \u003c-  gsub(\"NRPSpredictor3 SVM\", \"nrpspredictor3_svm_single\", as4$code_type)\n\nas5$code_type \u003c-  gsub(\"stachelhaus_predictions\", \"stachelhaus_predictions_5\", as5$code_type)\nas5$code_type \u003c-  gsub(\"single_amino_pred\", \"nrpspredictor2_single\", as5$code_type)\n\n\nas4 \u003c- as4  %\u003e% filter(\n  code_type %in% c(\"stachelhaus_predictions_4\", \"nrpspredictor3_svm_single\", \n                   \"stachelhaus_predictions_5\", \"nrpspredictor2_single\"))\n\nas5 \u003c- as5  %\u003e% filter(\n  code_type %in% c(\"stachelhaus_predictions_4\", \"nrpspredictor3_svm_single\", \n                   \"stachelhaus_predictions_5\", \"nrpspredictor2_single\",\n                   \"physicochemical_class\"))\n\n\nas4_wide \u003c-  data.table::dcast(as4, mibig+AD_domain_idx+AD_domain_id~code_type)\nas5_wide \u003c-  data.table::dcast(as5, mibig+AD_domain_idx+AD_domain_id~code_type)\n\n\n\n#all_data \u003c- dplyr::full_join(as4_wide, as5_wide, by=c(\"mibig\", \"AD_domain_idx\", \"AD_domain_id\")) %\u003e% \nall_data \u003c- dplyr::full_join(as4_wide, as5_wide, by=c(\"mibig\", \"AD_domain_id\")) %\u003e% \n  group_by(mibig) %\u003e% \n  add_count() %\u003e%\n  ungroup() %\u003e%\n  arrange(mibig, AD_domain_id )\n\n\n\nall_data$compare_stach1 \u003c- as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,\n             ~.x == .y))\n\nall_data$compare_stach2 \u003c- as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,\n                                    ~grepl(.x, .y)))\n\nall_data$compare_nrpspred \u003c- as.logical(purrr::map2(all_data$nrpspredictor2_single, all_data$nrpspredictor3_svm_single,\n                                                  ~.x == .y))\n\n```\n\n\n\n## Executive Summary\n\nThere are differences in Adenylation domain substrate predictions between AS4 and AS5 due to the different programs used for substrate identification. Using the two common measures of substrate prediction, Stachelhaus and NRPSPredictor (2/3) we can see that most of the differences between the two programs are due to instances where one of the programs does not generate a call. This suggests that the differences may simply be due to different acceptance thresholds between AS4 and AS5. However, there is also a small percentage of sequences that are predicted to be different substrates altogether which is potentially a concern if you are working with these clusters contianing this type of domain.. \n\n\n\n\n\n\n## Intro\n\nI've been watching the Antismash team develop version 5 [on github](https://github.com/antismash/antismash) and have been very\nimpressed with the refactoring process. There seem to be major upgrades across the board - in the front end HTML (more interactivity, new cluster rule\nfeatures, changed tabbed layout for clusterblast and substrate predictions); in the refactoring of the code itself (modularized, type hints, new `Record` handling),\nas well as the Dockerfile (more data required outside of the application which will allow smaller images and sideloading/reuse of large datasets). Really its quite an update - big kudos the whole team and Kblin and SJShaw in particular. \n\nClearly I'm a big fan so I decided to kick the tires.  After exploring for a bit I noticed that a few of the Adenylation domain substrates for clusters that I have worked on are not being called identical in Antismash 4 (AS4) and Antismash 5 (AS5). I wondered how prevalent this problem was so I took a reasonably large public dataset and ran AS4 and AS5 on them and compared the predictions between them.  AS4 used to offer a larger number of prediction programs while AS5 has narrowed donw to 1 (or 2 depending on how you count). While the AS5 approach has its benefits in the form of speed/efficiency, I wonder if the more limited substrate predictions of AS5 might cause us to mis or mis-predict certain substrates.\n\n\n## My Approach\n\n1. Download Mibig GBKS and convert to fasta\n2. Download AS4 and AS5 docker images\n3. Download AS5 sample data\n4. Run AS4 and AS5 on each of the gbks in Mibig\n5. Parse the results from the output GBK (AS4) or JSON (AS5) files.\n6. Explore the results here.\n\nYou can reproduce the data here although the domain files are availalbe in the `output/` directory.\n\n```bash\n# get Mibig and run against AS4 and AS5 using their docker image\n#\n# there are some software deps I use you may need\n# parallel, docker, biopython\nmake download\nmake runsmash\noutput/domains_as4.txt\n```\n\n\n## The Data\n\n![](images/as4_as5.png)\n\nThe substrate information for AS4 and AS5 differs. We can parse this information out of the AS4 gbk files and the AS5 json files. As you can see in the image above AS4 contains specificity predictions for Stachelhaus, NRPSpredictor3,  and a few other programs. AS4 has NRPSPredictor2 outputs as well as the stachelhaus prediction.  I retrieved data from the fields in red in order looked for AS4 Stachelhaus \u003c---\u003e AS5 Stachelhaus differences as well as NRPSPredictor2 \u003c---\u003e NRPSPredictor3 SVM. \n\nMy parser scripts are in `scripts` and after pulling out the data and renaming a few columns, I join the AS4 and AS5 data togther to create the final analysis. To compare the substrate predictions I performed the following checks of equality.\n\n\n```{r ,eval=FALSE}\n# compare stachelhuas calls directly\nall_data$compare_stach1 \u003c- \n  as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,\n             ~.x == .y))\n\n# use grep to compare any of the substrates predicted in AS4 against AS5\n# example: grepl(\"leu|d-leu\", \"leu) -\u003e TRUE\nall_data$compare_stach2 \u003c- \n  as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,\n                                    ~grepl(.x, .y)))\n\n# compare the nrpspredictor calls directly\nall_data$compare_nrpspred \u003c- \n  as.logical(purrr::map2(all_data$nrpspredictor2_single, all_data$nrpspredictor3_svm_single,\n                                                  ~.x == .y))\n```\n\n\nThe data including the equality checks are now all in a single table with one row for each domain. There are `r nrow(all_data)` Adenylation domains in this dataset. \nIt looks like this:\n\n```{r, echo=FALSE}\nhead(all_data)\n```\n\n\n## Stachelhaus Findings\n\n\n**Are AS4 Stachelhaus values identical to AS5 Stachelhaus values?**\n\n\n```{r echo=FALSE}\ntable(all_data$compare_stach1)\n```\n\n\n**Are AS4 Stachelhaus values identical to AS5 Stachelhaus values?** (use grep to match multiple AS4 values to a single AS5 value)\n\n```{r echo=FALSE}\ntable(all_data$compare_stach2)\n```\n\n\n**What are the non-matching values?** \n\nWhat are the twenty most common AS4 values when AS4 and AS5 do not match? Most are `no-call`s where AS4 didn't predict a value.  \n\n```{r, echo=FALSE}\nall_data %\u003e% \n  filter(compare_stach2 == FALSE) %\u003e% \n  .$stachelhaus_predictions_4 %\u003e% \n  table() %\u003e%  \n  sort(decreasing=TRUE) %\u003e% \n  .[1:20]\n```\n\n\nWhat are the twenty most common AS5 values when AS4 and AS5 do not match? (Most are `no-call`s where AS4 didn't predict a value.  \n\nwith `as4 no_calls`\n\n```{r, echo=FALSE}\nall_data %\u003e% \n  filter(compare_stach2 == FALSE) %\u003e% \n  .$stachelhaus_predictions_5 %\u003e% \n  table() %\u003e%  \n  sort(decreasing=TRUE) %\u003e% \n  .[1:20]\n```\n\nwithout `as4 no_calls`\n\n```{r, echo=FALSE}\nall_data %\u003e% \n  filter(compare_stach2 == FALSE, \n         stachelhaus_predictions_4 != \"no_call\") %\u003e% \n  .$stachelhaus_predictions_5 %\u003e% \n  table() %\u003e%  \n  sort(decreasing=TRUE) %\u003e% \n  .[1:20]\n```\n\n\n**What were the AS4 Values that change to Phe in AS5?**\n\n\nThere are 60 Phe differences. What are the AS4 calls when `AS5=\"phe\"`?  Many hydrophobic residues to Phe. A few charged residue predicitons where you might not expect a change.\n\n```{r , echo=FALSE}\n\n## Look at the Stachelhaus Calls first\n## \n## of the non no-calls, what are the other problems?\nnot_no_call \u003c- all_data %\u003e% filter(compare_stach2 == FALSE, stachelhaus_predictions_4 != \"no_call\")\n#sort(table(not_no_call$stachelhaus_predictions_5))\n#not_no_call  %\u003e% filter(stachelhaus_predictions_5  == \"phe\")\n\n\nnot_no_call  %\u003e% \n  filter(stachelhaus_predictions_5  == \"phe\") %\u003e% \n  .$stachelhaus_predictions_4 %\u003e%\n  table() %\u003e%\n  sort(decreasing=TRUE)\n\n```\n\n\n\n\n## NRPS Predictor Findings\n\n**Are AS4 NRPSPredictor3 values identical to AS5 NRPSPredictor2 values?**\n\n\nThere is 80% concordance between V2 and V3. Most of the differences are due to N/A values. This brings concordnace to \u003e90% if we include only those cases where a call is made.\n\n```{r , echo=FALSE}\n\n#table(all_data$compare_stach1) # 897/1903 T/F\n#table(all_data$compare_stach2) # 1598/1320 T/F\ntable(all_data$compare_nrpspred) # 2249/551 T/F\n\n```\n\n\nWhat are the NRPSPredictor2 values when there is a discrepancy?\n\n```{r, echo=FALSE}\n\nall_data %\u003e% \n  filter(compare_nrpspred == FALSE) %\u003e% \n  .$nrpspredictor2_single %\u003e% \n  table() %\u003e%  \n  sort(decreasing=TRUE) %\u003e% \n  .[1:20]\n\n```\n\nWhat are the NRPSPredictor3 values when there is a discrepancy?\n\n```{r, echo=FALSE}\n\nall_data %\u003e% \n  filter(compare_nrpspred == FALSE) %\u003e% \n  .$nrpspredictor3_svm_single %\u003e% \n  table() %\u003e%  \n  sort(decreasing=TRUE) %\u003e% \n  .[1:20]\n\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzachcp%2Fsubstrate_prediction_comparison","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzachcp%2Fsubstrate_prediction_comparison","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzachcp%2Fsubstrate_prediction_comparison/lists"}