{"id":19917147,"url":"https://github.com/austinv11/erc-pipeline","last_synced_at":"2025-05-03T06:30:45.085Z","repository":{"id":49044544,"uuid":"363197150","full_name":"austinv11/ERC-Pipeline","owner":"austinv11","description":"The pipeline in this repository allows for the generation of evolutionary rate correlations (ERCs). ERCs use a phylogeny-based approach to look for evolutionary signatures of potential protein-protein interaction. Since ERCs can be fairly straightforward to produce, they can be used as an easy method of discovering candidate protein-protein interactions.","archived":false,"fork":false,"pushed_at":"2021-09-15T14:19:41.000Z","size":7474,"stargazers_count":3,"open_issues_count":1,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-07T11:51:30.563Z","etag":null,"topics":["erc","protein"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/austinv11.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-30T16:23:24.000Z","updated_at":"2024-09-09T09:38:28.000Z","dependencies_parsed_at":"2022-09-09T05:20:29.923Z","dependency_job_id":null,"html_url":"https://github.com/austinv11/ERC-Pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinv11%2FERC-Pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinv11%2FERC-Pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinv11%2FERC-Pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinv11%2FERC-Pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/austinv11","download_url":"https://codeload.github.com/austinv11/ERC-Pipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252154732,"owners_count":21702982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["erc","protein"],"created_at":"2024-11-12T21:48:55.301Z","updated_at":"2025-05-03T06:30:43.680Z","avatar_url":"https://github.com/austinv11.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ERC-Pipeline\n\nThe pipeline in this repository allows for the generation of evolutionary rate correlations (ERCs). ERCs use a \nphylogeny-based approach to look for evolutionary signatures of potential protein-protein interaction. Since ERCs can\nbe fairly straightforward to produce, they can be used as an easy method of discovering candidate protein-protein \ninteractions.\n\n## Installation\n### Environment Requirements\nThe pipeline should be compatible with Python version 3.7 or higher. \nR scripts were written for R 3.6.1 (\"Action of the Toes\") but are optional.\nNote that the pipeline was made to take advantage of \nmultiple cores. We recommend at least 5 available cores and 16GB of RAM for large data sets.\n\n### Python Package Dependencies\nThe external Python dependencies can be installed using pip (most packages can also be found on conda/bioconda):\n`pip install -r requirements.txt`\n\nNote that `uvloop` and `pygraphviz` are *optional* dependencies so if they fail to install on your system, you can still\nrun the pipeline. Just note that `pygraphviz` is required for network diagrams.\n\n### R Package Requirements\nAll R packages are optional. They can be installed using the `install.packages()` function in base R.\n\n* If using time-corrected partial correlation calculations, `ppcor`\n* If using any of the `generate_figs.R` script code, `ggplot2, readxl, writexl, ggvenn, dplyr, gridExtra, ape, picante, EnvStats, cowplot, patchwork, and Cairo` \n\n### External Tooling Requirements\nAll of the following are *required* to be installed.\n* MAFFT (https://mafft.cbrc.jp/alignment/software/) - For generating alignments.\n* trimAl (http://trimal.cgenomics.org/) - For trimming alignments.\n* IQ-TREE 2 (http://www.iqtree.org/) - For calculating phylogenetic trees.\n\n### Preparation of Data\n1. Time-scaled phylogeny (Newick format)\n    * If not specified, the pipeline will use the Mammalian time-scaled phylogeny generated using TimeTree \n      (http://timetree.org/) for all the Mammalian taxa available in OrthoDB v10 (https://www.orthodb.org/). \n      **Implementation Note:** The species names in this case are formatted in all caps with spaces replaced with \n      underscores (i.e. `Homo sapiens` is formatted as `HOMO_SAPIENS`). You can find the full file in \n      [data/finished_mam_timetree.nwk](./data/finished_mam_timetree.nwk).\n2. Protein sequences (FASTA format)\n    * All proteins of interest should be provided as FASTA formatted sequences (these are assumed to be unaligned) in a\n      directory. Note that the title of each sequence must be the taxon name for each sequence corresponding to the \n      time-scaled phylogeny (i.e. if you have a human protein sequence, the FASTA sequence title must be exactly \n      `HOMO_SAPIENS` if using the default tree).\n    * This pipeline does not attempt to disambiguate paralogous sequences. So each protein sequence should be singe-copy,\n      any multi-copy sequences in the FASTA files will be totally ignored.\n      \n\n## Calculating ERCs on the command line\nAssuming the sequences are in a directory called `SEQUENCE_DIRECTORY` and the time-scaled phylogeny is a file called\n`PHYLOGENY.nwk` and you wish to run calculations in a directory called `OUTPUT_DIRECTORY`.\n\n`python3 cli.py --timetree PHYLOGENY.nwk --sequences SEQUENCE_DIRECTORY --wd OUTPUT_DIRECTORY`\n\n* Note that if you are using the default mammalian phylogeny, you can omit the `--timetree PHYLOGENY.nwk` argument. \n  Additionally, you would be able to use the 20MY or 30MY ERC calculations described in Varela et al 2021 using either\n  `--erc-type 20my` or `--erc-type 30my` respectively. \n  \n* If you wish to use the time-corrected correlation method, you can add the `--erc-type bt` argument (requires R).\n\n* If you want to only run ERCs on specific pairs of proteins, you can pass the \n  `--align-pair /path/to/align1.fasta /path/to/align2.fasta` option as many times as needed.\n\n* To specify the number of CPU cores to run with, use the `-n` argument.\n\n* If your sequence data are already prepared, you can use `--skip-align` and `--skip-trim` to skip alignment and \n  trimming of alignments, respectively.\n  \n* If your sequences are titled based on non-readable identifiers instead of protein names/symbols (for example: \n  OrthoDB ids), you can create a tab-separated file with 2 columns: \"alignment_identifier\" and \"readable_name\". This\n  will be used to replace the original alignment names with the readable names if you pass the path to this file to the\n  argument: `--id2name /path/to/file.tsv`. The annotations used in the publication can be used like so: \n  `--id2name data/id2name.tsv`\n  \n* If you have previously run the ERC-Pipeline in another directory, you can include the proteins from the previous run\n  in the current run with the `--previous-run /path/to/previous/wd` flag. You can pass it multiple times to include data\n  from multiple runs. If you wish to include the (30MY) ERCs from the publication, you can do it like this: \n  `--previous-run data/30my_erc_results` Note: If you wish to do this with the published previous 30MY ERC data, make \n  sure to clone this repository with git-lfs installed (https://git-lfs.github.com/).\n  \n* To automatically archive intermediate FASTA files, pass the `--archive` argument.\n\n* If you want to run ERCs along pieces of the alignments, add the `--segment` argument. This can be modified using the \n  `--slide` argument, which rather than splitting the alignment into kmers, will normalize the data using a sliding \n  window based on kmers. You can also use `--kmer K` to change the size of the kmers (replace \"K\" with the number). \n  \n* If you want to just prepare all your data and not calculate ERCs, you can pass the `--prepare` flag.\n\n## Calculating ERCs in Python\nThe pipeline is easily accessible in Python. For practical examples of usage, check out the source of `cli.py`.\n\nSetting up the environment:\n```python\nfrom pipeline import ErcWorkspace\n\n# The working dir and tree args are the only required arguments\nworkspace = ErcWorkspace(\"directory/to/run/in\", \"path/to/tree/topology.nwk\")  \n```\nNote that this immediately changes the Python's runtime current working directory to the working directory passed as the\nfirst argument.\n\nDefinitions of the additional optional arguments;\n* segmented: If True, run ERC on pieces of the alignments\n* segment_size: The size of the pieces\n* internal_requirement: The minimum required size of internal branches to perform correlations with (only applies to \n  normal ERCs and by default only all terminal branches are included in ERC calculations)\n* include_terminal: If you pass an internal_requirement argument, should terminal branches also be included in ERC \n  calculations?\n* recalculate: If True, even if ERCs have been previously calculated for tree pairs in the environment, recalculate the \n  correlation results.\n* sliding_window: If True, ERCs run on pieces of alignments are normalized with a sliding window.\n* skip_align: If True, skip alignment of the input sequences.\n* skip_trim: If True, skip trimming of alignments.\n* archive: If True, save intermediate files into archives.\n* taxon_set: Pass a list of taxa to perform calculations with, by default all taxa are considered.\n* time_corrected: If True, instead of standard Spearman's test, run Spearman's partial correlations controlling for time.\n* id2name: A dictionary that if passed, will be used to convert protein ids to readable names in outputs.\n* cores: Number of cores to use for calculations.\n* prepare: If True, calculate intermediate files and then stop.\n* skip_qc: If True, skip the QC step (still currently in development).\n\nInclude previously run ERCs (assuming you ran the pipeline in directory: `old/erc_dir`):\n```python\nfrom pipeline import register_previous_run\n\nregister_previous_run(\"old/erc_dir/\")\n```\n\nAdd sequences to the workspace:\n```python\n# Add alignments for all-by-all calculations\nworkspace.add_alignment(\"path/to/alignment.fasta\")\n\n# Add specific pairs of alignments to run calculations on\nworkspace.add_alignment(\"path/to/alignment1.fasta\", \"path/to/alignment2.fasta\")\n\n# Add alignments that should be concatenated together when run for ERCs\nworkspace.add_concatenated_alignment([\"path/to/alignment1.fasta\", \"path/to/alignment2.fasta\"], \"concatenated_name\")\n\n# You can specify specific pairs of concatenated pairs as well\nworkspace.add_concatenated_alignment([\"path/to/alignment1.fasta\", \"path/to/alignment2.fasta\"], \"concatenated_name\",\n                                     [\"path/to/alignment3.fasta\", \"path/to/alignment4.fasta\"], \"concatenated_name2\")\n```\n\nRun the ERCs:\n```python\n# If running outside of a coroutine:\nimport asyncio\nasyncio.get_event_loop().run_until_complete(workspace.run())\n\n# If inside a coroutine:\nawait workspace.run()\n```\n\nFollowing calculations, you can load the data into a `networkx` `Graph` object using:\n```python\nnet = workspace.generate_network()\n```\nThis network has each protein ran as nodes, with edges connecting the nodes based on ERCs. Each\nedge has the `rho` and `p` properties representing the ERC results.\n\nSave data:\n```python\n# Save the data to an excel spreadsheet\nworkspace.export_results(\"filename.xlsx\")\n```\n\nAccess Mammalian TimeTree:\n```python\nfrom pipeline import _10mya_cutoff, _20mya_cutoff, _30mya_cutoff, prune_tree\nfrom utilities import _self_path, safe_phylo_read\nimport os.path as osp\n\ntree = safe_phylo_read(\"path/to/file.nwk\")  # Read general newick trees\ntimetree = safe_phylo_read(osp.join(_self_path(), 'data', 'finished_mam_timetree.nwk'))  # Read the mammalian TimeTree\n\n# Prune mammalian tree to the 10my, 20my, and 30my taxa sets respectively\n_10my = prune_tree(timetree, _10mya_cutoff)\n_20my = prune_tree(timetree, _20mya_cutoff)\n_30my = prune_tree(timetree, _30mya_cutoff)\n```\n\n### Misc. Features in Python\nImproved async performance:\n```python\n# If the uvloop package is installed (unix only), you can run the following to improve performance before running ERCs:\nfrom utilities import try_hook_uvloop\n\ntry_hook_uvloop()\n```\n\nExtract rate data from trees:\n```python\nfrom pipeline import get_rates\n\n# Calculate rates for the trees passed. First argument is the PhyloTree object for the species topology.\n# Second arg determines whether to prune the tree topologies so all trees have the same topologies\n# Third arg is a list of taxa to limit calculations to\n# The remaining args are the trees for proteins of interest\ntaxa, rates = get_rates(tree_topology_object, True, None, tree1, tree2)\n# Taxa is a list of taxa names\n# Rates is a list of lists. The first list are all the time units, the following lists are the rates. \n# With the indices of each element corresponding to the taxon in the matching index of the taxa list.\n```\n\nConvert rate data to correlations:\n```python\nfrom pipeline import get_rates, rates_to_correlation\n\nrate_info = get_rates(tree_topology_object, True, None, tree1, tree2)\nrho, p = rates_to_correlation(rate_info)\n```\n\nRun an enrichment analysis on protein sets:\n```python\nfrom pipeline import enrich_network\n\n# Where protein_symbols is a list of protein identifiers (ids or protein symbols),\n# background_protein_symbols is a list of protein identifiers from the background set (ids or protein symbols),\n# id2name is a dictionary (can be empty) mapping ids to symbols, and the last argument is the base file name for \n# enrichment reports.\nenrich_network(protein_symbols, background_protein_symbols, id2name, \"enrichment_results\")\n```\n\n*The following features assume you generated a network object from your ERCs*\n\nGenerate Reciprocal-Rank 20 Networks (RRNs)\n```python\nfrom utilities import rrn_nets\n\n# Each step represents an intermediate RRN result\nstep1, step2, step3 = rrn_nets(net, \"central_protein\")  # You can pass multiple central proteins\n```\n\nExport a network diagram (requires pygraphviz):\n```python\nfrom utilities import graphviz_network_plot\n\n# You can pass the following optional arguments:\n# * highlight: A dictionary mapping proteins -\u003e color code for nodes\n# * circo: If True, use the circo graphviz layout instead of the default\n# * highlight_by_font: If True, highlighted nodes will have the text highlighted instead of the background\ngraphviz_network_plot(my_network, \"my_network.png\")\n```\n\n## Citations\nPlease cite this specific implementation of ERCs with the following reference:\n```\nVarela, Austin A., Sammy Cheng, and John H. Werren. 2021. \"Novel ACE2 protein interactions relevant to COVID-19 predicted by evolutionary rate correlations.\" PeerJ 2021 9:e12159 https://doi.org/10.7717/peerj.12159\n```\n\nPlease cite the general ERC method with the following reference:\n```\nYan, Zhichao, Gongyin Ye, and John H. Werren. \"Evolutionary rate correlation between mitochondrial-encoded and mitochondria-associated nuclear-encoded proteins in insects.\" Molecular biology and evolution 36.5 (2019): 1022-1036.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faustinv11%2Ferc-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faustinv11%2Ferc-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faustinv11%2Ferc-pipeline/lists"}