{"id":33051379,"url":"https://github.com/conchoecia/odp","last_synced_at":"2026-02-17T06:06:29.013Z","repository":{"id":147603678,"uuid":"329185461","full_name":"conchoecia/odp","owner":"conchoecia","description":"oxford dot plots","archived":false,"fork":false,"pushed_at":"2026-01-19T23:46:54.000Z","size":47493,"stargazers_count":162,"open_issues_count":36,"forks_count":11,"subscribers_count":6,"default_branch":"main","last_synced_at":"2026-01-21T19:10:13.975Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/conchoecia.png","metadata":{"files":{"readme":"docs/README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-01-13T03:42:41.000Z","updated_at":"2026-01-18T07:40:42.000Z","dependencies_parsed_at":"2024-04-16T01:44:45.558Z","dependency_job_id":"f4c2cd85-7131-4d61-a51f-ac0a95de750a","html_url":"https://github.com/conchoecia/odp","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/conchoecia/odp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conchoecia%2Fodp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conchoecia%2Fodp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conchoecia%2Fodp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conchoecia%2Fodp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/conchoecia","download_url":"https://codeload.github.com/conchoecia/odp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conchoecia%2Fodp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29535934,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T05:00:25.817Z","status":"ssl_error","status_checked_at":"2026-02-17T04:57:16.126Z","response_time":100,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-11-14T03:00:27.111Z","updated_at":"2026-02-17T06:06:28.948Z","avatar_url":"https://github.com/conchoecia.png","language":"Python","funding_links":[],"categories":["Comparative"],"sub_categories":[],"readme":"\u003ca href=\"https://www.nature.com/articles/s41586-023-05936-6\"\u003e\u003cimg src=\"manuscript_DOI.svg\" alt=\"Manuscript\"  height=\"20\"\u003e\u003c/a\u003e\n[![DOI](https://zenodo.org/badge/329185461.svg)](https://zenodo.org/badge/latestdoi/329185461)\n\n\u003cpicture\u003e\n  \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"odp_logo_black_400px.png\"\u003e\n  \u003cimg      alt=\"logo for ODP\"                    src=\"odp_logo_white_400px.jpg\"\u003e\n\u003c/picture\u003e\n\n# odp - Oxford dot plots\n\n## \u003ca name=\"started\"\u003e\u003c/a\u003eGetting Started\n```sh\n#install\ngit clone https://github.com/conchoecia/odp.git\n# NOTE: The make step will automatically use all of the cores on\n#       your current machine. If using a slurm cluster be sure to\n#       request all of the threads on that node. If you need to use\n#       fewer cores, run `make -f Makefile_1core` instead.\ncd odp \u0026\u0026 make\n# make a config.yaml file for your odp analysis\ncp odp/example_configs/CONFIG_odp.yaml ./config.yaml\n# modify the config file to include your own data\nvim config.yaml\n# run the pipeline\nsnakemake -r -p --snakefile odp/scripts/odp\n# currently there is no man page, see https://github.com/conchoecia/odp/ for instructions\n```\n\n## \u003ca name=\"quickstart\"\u003e\u003c/a\u003eQuick Start\n\n-----\n\n### Oxford Dot Plots or ALG-genome comparisons\n\n\u003ca href=\"https://github.com/conchoecia/odp#make-macrosynteny-plots-between-two-or-more-genomes\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"dotplot_black_small.png\"\u003e\n    \u003cimg alt=\"Example Oxford Dot Plot report\"       src=\"dotplot_white_small_300KB.jpg\"\u003e\n  \u003c/picture\u003e\n\u003c/a\u003e\n\n[CLICK HERE or on THE FIGURE^](#macrosynuse) if your goal is to make an Oxford dot plot report between\ntwo or more combinations of genomes, OR you want to compare your\ngenomes to:\n\n- the chordate linkage groups (CLGs) from [Simakov et al 2020](https://www.nature.com/articles/s41559-020-1156-z)\n- the BCnS ancestral linkage groups from [Simakov et al 2022](https://www.science.org/doi/10.1126/sciadv.abi5884)\n- or the pre-metazoan linkage groups from [Schultz et al 2023](https://www.nature.com/articles/s41586-023-05936-6)\n\n-----\n\n### Ribbon Diagrams\n\n\u003ca href=\"https://github.com/conchoecia/odp#make-ribbon-diagrams-of-conserved-linkages-between-genomes\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"ribbon_diagram_black_1000px.png\"\u003e\n    \u003cimg alt=\"Example ribbon diagram of animals with the BCnS ALGs\" src=\"ribbon_diagram_white_1000px.jpeg\"\u003e\n  \u003c/picture\u003e\n\u003c/a\u003e\n\n[CLICK HERE or on THE FIGURE^](#ribbondiagram) if your goal is to make ribbon\ndiagrams of conserved linkages between genomes.\n\n-----\n\n## Table of Contents\n\n- [Getting Started](#started)\n- [Quick Start](#quickstart)\n- [Users' Guide](#uguide)\n  - [Installation](#install)\n    - [Python Requirements](#python)\n    - [Other Requirements](#otherreq)\n  - [General usage](#general)\n    - [Input File Requirements](#inputspec)\n    - [`.chrom` file specifications](#chromspec)\n    - [Help generating `.chrom` files](#chromhelp)\n  - [Use cases](#cases)\n    - [If you want to analyze chordate linkage groups](#clgsection)\n    - [Make macrosynteny plots between two or more genomes](#macrosynuse)\n    - [Make ribbon diagrams of conserved linkages between genomes](#ribbondiagram)\n    - [Find and characterize ancestral linkage groups](#alganalysis)\n      - [ALGs part 1 - Ortholog finding in 3+ species](#nwayreciprocalbest)\n      - [ALGs part 2 - Find significantly numerous groups of orthologs](#groupby)\n      - [ALGs part 3 - Filter groups of orthologs](#gbfilter)\n      - [ALGs part 4 - Annotate groups of orthologs](#gbannotate)\n      - [ALGs part 5 - Merge groupby files](#gbmerge)\n      - [ALGs part 6 - Find orthologs in more species](#gbtohmm)\n      - [ALGs part 7 - Plot mixing of select linkage groups](#plotmixing)\n    - [Determine which clade is sister](#4speciesphylogeny)\n- [Citing odp](#cite)\n\n## \u003ca name=\"uguide\"\u003e\u003c/a\u003eUsers' Guide\n\nOdp is a protein-based synteny analysis software suite that is useful for\ncomparing the evolution of chromosomes between two or more species. Use cases\ninclude (1) ploting synteny relationships between two genome assemblies, (2)\ninferring evolutionary relationships using chromosome synteny information, and\n(3) determining groups of ancestrally linked genes given a set of species'\nchromosome-scale genomes.\n\nThis software was visually modelled on the dotplots found in [Simakov, Oleg, et al. \"Deeply conserved synteny resolves early events in vertebrate evolution.\" Nature ecology \u0026 evolution 4.6 (2020):820-830.](https://www.nature.com/articles/s41559-020-1156-z), and was further\nexpanded to determine the phylogenetic tree toplogy of animals in\n[Schultz, D.T., et al. (2023)](https://www.nature.com/articles/s41586-023-05936-6).\n\nThis software fills a niche in that it automates comparisons of chromosome-scale\ngenomes, an increasingly important task as the genomes of more non-model\norganisms become available.\n\nFor the aims above, and for comparisons with *two* species, this software works by:\n1. Identifying orthologs.\n  - For comparisons between two species, this program finds reciprocal-best protein matches using diamond blastp. The pipeline performs comparions between all *n* species in the config file. Compute time scales quadratically with increasing species *O(n*\u003csup\u003e2\u003c/sup\u003e*)*.\n2. Identifying homologous scaffolds between species.\n  - The measure D of each pairwise comparison is calculated to determine synteny block cutoffs in the cases of complex rearrangements. See section [6.4 Identification of 17 Chordate Linkage Groups (page 11)](https://static-content.springer.com/esm/art%3A10.1038%2Fs41559-020-1156-z/MediaObjects/41559_2020_1156_MOESM1_ESM.pdf) of the Supplementary Information PDF from [Simakov et al. (2020).](https://www.nature.com/articles/s41559-020-1156-z).\n  -  The homology of scaffolds between two species is inferred by calculating the pairwise p-value using Fisher's exact test. See section [6.5 Significance testing of blocks of conserved synteny (page 12)](https://static-content.springer.com/esm/art%3A10.1038%2Fs41559-020-1156-z/MediaObjects/41559_2020_1156_MOESM1_ESM.pdf) of the Supplementary Information PDF from [Simakov et al. (2020).](https://www.nature.com/articles/s41559-020-1156-z).\n3. Identifying orthologs in the context of known ancestral linkage groups.\n  - Finding which orthologs correspond to previously identified ancestral linkage groups (ALGs), such as the Bilateria-Cnidaria-Sponge ALGs from [Simakov et al 2022](https://www.science.org/doi/10.1126/sciadv.abi5884).\n4. Plotting the genome assembly, reciprocal best protein hits, the ALGs, and D for both species using matplotlib.\n\nFor comparisons between *three or more* species, the software can:\n1. Identify ALGs of a given number of species. Warning - each additional species added increases the stringency of the ALG identification.\n2. Test phylogenetic hypotheses with species quartets.\n\n## \u003ca name=\"install\"\u003e\u003c/a\u003eInstallation\n\nOdp and its dependencies are developed for a unix environment (linux, Mac OS X)\nrunning bash as the shell. You can download the software with this command:\n\n```\ngit clone https://github.com/conchoecia/odp.git\ncd odp \u0026\u0026 make\n```\n\n### \u003ca name=\"python\"\u003e\u003c/a\u003ePython Requirements\n\nYour active python environment must be python 3. This software is implemented in\n[`snakemake`](https://snakemake.readthedocs.io/en/stable/). Specific python\npackages within the pipeline that must be available in your python installation\nare:\n\n```\nsnakemake\nmatplotlib\nnetworkx\nscipy\npandas\nnumpy\nseaborn\n```\n\nIf you have `conda` I recommend `conda install snakemake matplotlib pandas numpy seaborn`\nif you are not sure if you have the required packages.\n\n### \u003ca name=\"otherreq\"\u003e\u003c/a\u003eOther Requirements\n\nDirect calls to these programs must also be available in your environment.\nFuture versions of `odp` may bundle these software packages directly to avoid\nthese requirements.\n\n- [diamond](https://github.com/bbuchfink/diamond)\n- [blastp](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web\u0026PAGE_TYPE=BlastDocs\u0026DOC_TYPE=Download)\n- awk\n\n## \u003ca name=\"general\"\u003e\u003c/a\u003eGeneral Usage\n\nOdp requires, at minimum, the genome assembly sequence file, a sequence file of proteins found in the genome, and a file specifying the protein coordinates in the genome. The paths to these files for each sample is specified in a `.yaml` configuration file. \n\nA minimal working example of a config file that is set up to compare the genomes of _C. elegans_ and _H. sapiens_ looks like this:\n\n```yaml\n# this file is called config.yaml\nignore_autobreaks: True       # Skip steps to find breaks in synteny blocks\ndiamond_or_blastp: \"diamond\"  # \"diamond\" or \"blastp\"\nplot_LGs: True                # Plot the ALGs based on the installed databases\nplot_sp_sp: True              # Plot the synteny between two species, if False just generates .rbh files\n\nspecies:\n  Celegans:\n    proteins: /path/to/proteins_in_Cel_genome.fasta\n    chrom: /path/to/Cel_genome_annotation.chrom\n    genome: /path/to/Cel_genome_assembly.fasta\n  Homosapiens:\n    proteins: /path/to/Human_prots.fasta\n    chrom: /path/to/Human_annotation.chrom\n    genome: /path/to/Human_genome_assembly.fasta\n```\n\nYou can perform a comparison between these two genomes with:\n\n```\nsnakemake --snakefile odp/scripts/odp\n```\n\n### \u003ca name=\"inputspec\"\u003e\u003c/a\u003eInput file requirements\n\nThe file formats that are needed for the three files per genome are:\n1. A genome assembly in [`.fasta` format](https://en.wikipedia.org/wiki/FASTA_format).\n2. Protein sequences in [`.fasta` format](https://en.wikipedia.org/wiki/FASTA_format).\n3. A file which details where the proteins are located in the genome, in [`.chrom` format](#chromspec). Using a `.gff` or `.gtf` file is currently not supported, but support is planned.\n\n### \u003ca name=\"chromspec\"\u003e\u003c/a\u003e`.chrom` file specifications\n\nThe `.chrom` file format has 5 tab-delimited fields. Each line details the location of a protein on a scaffold. The fields are:\nThe requirements for each field are:\n1. `protein_header` - the string here must match the header of a protein in the protein fasta.\n2. `scaffold_header` - the string here must match the header of a sequence in the genome assembly fasta.\n3. `strand` - must be `+` or `-`.\n4. `start` - the numerically least position, in basepair coordinates, of the CDS of the protein. Like the start coords in a `bed` or `GFF` file, not necessarily the position of the start codon. Can often be found in a GFF3 or GTF.\n5. `stop` - same as #4, but the stop position.\n\nFor example, the following `.chrom` file details four proteins that exist on two scaffolds. Two of the proteins are on the negative strand. The first protein, `BFGX8636T1`, has its start codon from the first position of scaffold 1, and the last codon ends at base 1246.\n\n```\nBFGX8636T1      sca1    +       1       1246\nBFGX0001T1      sca1    -       2059    2719\nBFGX0002T1      sca2    +       6491    12359\nBFGX0003T1      sca2    -       12899   18848\n```\n\n### \u003ca name=\"chromhelp\"\u003e\u003c/a\u003eHelp generating `.chrom` files\n\nA `.chrom` file can usually easily be generated from a genome annotation, such as a [`GFF3`](https://m.ensembl.org/info/website/upload/gff3.html) or [`GTF/GFF2`](https://uswest.ensembl.org/info/website/upload/gff.html) file. If you are working with NCBI GFFs, CDS entries have a predictable format that enables us to compile all of the information required for a chrom file: `NC_000001.11\tBestRefSeq\tCDS\t65565\t65573\t.\t+\t0\tID=cds-NP_001005484.2;Parent=rna-NM_001005484.2;Dbxref=CCDS:CCDS30547.1,Ensembl:ENSP00000493376.2,GeneID:79501,Genbank:NP_001005484.2,HGNC:HGNC:14825;Name=NP_001005484.2;gbkey=CDS;gene=OR4F5;product=olfactory receptor 4F5;protein_id=NP_001005484.2;tag=MANE Select`. There are many CDS lines per gene, so a special parsing program is required.\n\nThe program bundled with `odp`, [scripts/NCBIgff2chrom.py](scripts/NCBIgff2chrom.py), parses gzipped/uncompressed NCBI GFFs and gets the full protein span in the genome. Running [scripts/NCBIgff2chrom.py](scripts/NCBIgff2chrom.py) on the [human GFF from NCBI](https://www.ncbi.nlm.nih.gov/genome/?term=human) with the command `python NCBIgff2chrom.py GCF_000001405.39_GRCh38.p13_genomic.gff.gz` results in a legal `.chrom` file with all of the proteins from the annotation. This file can be easily filtered later on.\n\n```\nNP_001005484.2  NC_000001.11    +       65565   69037\nXP_024307731.1  NC_000001.11    -       358067  399041\nXP_024307730.1  NC_000001.11    -       358153  399041\nNP_001005221.2  NC_000001.11    -       450740  450740\nXP_011540840.1  NC_000001.11    -       586839  611112\n```\n\n----\n\n## \u003ca name=\"cases\"\u003e\u003c/a\u003eUse cases\n\n### \u003ca name=\"clgsection\"\u003e\u003c/a\u003e Use Case - If you want to analyze chordate linkage groups\n\nThe preinstalled ALGs are the Bilaterian-Cnidarian-Sponge Linkage Groups (BCnS\nLGs) that are discussed in [Simakov et al.(2022)](https://www.science.org/doi/full/10.1126/sciadv.abi5884). If you want to\nanalyze your genomes in the context of the Chordate Linkage Groups (CLGs), then\nplease compile them first by changing directories to where you installed the software, then running this command.\n\n```sh\ncd odp \u0026\u0026 make CLGs_v1.0\n```\n\nBe warned that this will take a long time as there are 25 thousand gene groups\nfor which HMMs must be built. The final directory will occupy 6.2Gb on disk. By\ndefault this command will use all of the threads available on the machine you\nare using: `make CLGs_v1.0`. To use only one core, run `make -f Makefile_1core\nCLGs_v1.0`.\n\n----\n\n### \u003ca name=\"macrosynuse\"\u003e\u003c/a\u003eUse Case - Make macrosynteny plots between two or more genomes\n\n\u003ca href=\"https://github.com/conchoecia/odp#make-macrosynteny-plots-between-two-or-more-genomes\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"dotplot_black_small.png\"\u003e\n    \u003cimg alt=\"Example Oxford Dot Plot report\"       src=\"dotplot_white_small_300KB.jpg\"\u003e\n  \u003c/picture\u003e\n\u003c/a\u003e\n\nProgram: `odp/scripts/odp`\nInput: `config.yaml` file with the species you wish to compare.\nOutput:\n  - Tables of reciprocal best blastp hits between species.\n  - PDF figures of Oxford dot plots comparing genome macrosynteny.\n  - PDF figures of plots showing the degree of homology between chromosomes.\n  - Oxford dot plots colored by predefined gene IDs, or by orthologs in other species\n\n`config.yaml` format for running `odp/scripts/odp`:\n\n```yaml\n# this file is called config.yaml\nignore_autobreaks: True       # Skip steps to find breaks in synteny blocks\ndiamond_or_blastp: \"diamond\"  # \"diamond\" or \"blastp\"\nplot_LGs: True                # Plot the ALGs based on the installed databases\nplot_sp_sp: True              # Plot the synteny between two species, if False just generates .rbh files\n\nspecies:\n  Celegans:\n    proteins: /path/to/proteins_in_Cel_genome.fasta # required field\n    chrom:    /path/to/Cel_annot.chrom              # required field\n    genome:   /path/to/Cel_genome_assembly.fasta    # required field\n\n    genus: \"Caenorhabditis\" # This is an optional field\n    species: \"elegans\" # This is an optional field\n\n    minscafsize: 1000000 # optional field. Sets minimum scaffold size to plot.\n\n    manual_breaks:    # optional field, tells the software to treat breaks\n      - \"I:50000\"     #  as separate units for calculating the homology p-values\n      - \"IV:9000000\"  #  with Fisher's exact test. Useful for plotting centromeres.\n      - \"II:99009\"    #  Here, we tell the software that Cel chroms I, IV, II have breaks.\n\n    plotorder:    # This optional field tells the software to only plot the scaffolds\n      - \"I\"       #  listed here, and to do it in this order. This is useful for plotting\n      - \"II\"      #  comparisons between two species where you want a specific order for\n      - \"III\"     #  both species.\n\n  Homosapiens:\n    proteins: /path/to/Human_prots.fasta\n    chrom:    /path/to/Human_annotation.chrom\n    genome:   /path/to/Human_genome_assembly.fasta\n```\n\nRun the pipeline with the command `snakemake -r -p --snakefile odp/scripts/odp`. The output files will be located in the folder `synteny_analysis/`. In this folder there are these folders:\n  - `db`\n    - blastp and and diamond databases for searches.\n  - `step0-blastp_results`\n    - blastp/diamond searches between all the genomes. Also has files with the reciprocal best hits.\n  - `step0-chromsize`\n    - chromosome information used later for plotting.\n  - `step1-rbh`\n    - reciprocal best hits files for the analyses.\n  - `step2-figures`\n    - `ALG-species_plots`\n      - If there are ALGs installed in the `LG_db`, then the ALG-species plots will appear here.\n    - `synteny_coloredby_*`\n      - If there are ALGs installed in the `LG_db`, then the species-species synteny plots will appear here.\n    - `synteny_nocolor`\n      - Two-species synteny plots appear here regardless of what is in `LG_db`.\n\n-----\n\n### \u003ca name=\"ribbondiagram\"\u003e\u003c/a\u003eUse Case - Make ribbon diagrams of conserved linkages between genomes\n\n\u003ca href=\"https://github.com/conchoecia/odp#make-ribbon-diagrams-of-conserved-linkages-between-genomes\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"ribbon_diagram_black_1000px.png\"\u003e\n    \u003cimg alt=\"Example ribbon diagram of animals with the BCnS ALGs\" src=\"ribbon_diagram_white_1000px.jpeg\"\u003e\n  \u003c/picture\u003e\n\u003c/a\u003e\n\nCurrently `odp` has a script that plots a ribbon diagram of conserved linkages\nbetween two species pairs. To make a ribbon plot, you first must follow the\ninstructions above to run `odp` to [make macrosynteny plots between two or more genomes](#macrosynuse). \nThis will generate the `.rbh` files necessary to make the plots.\n\n```sh\n# make a new directory to store the ribbon plot\nmkdir new_ribbon_plot \u0026\u0026 cd new_ribbon_plot\n# copy the example ribbon plot config file to this directory\ncp ${YOUR_ODP_INSTALL_PATH}/odp/example_configs/CONFIG_rbh_to_ribbon.yaml ./config.yaml\n# edit the config file using the instructions you find there\nvim config.yaml\n# run the script to make your figure\nsnakemake --snakefile ${YOUR_ODP_INSTALL_PATH}/odp/scripts/odp_rbh_to_ribbon\n# your file will be saved as output.pdf\n```\n\nA minimal working example of the config file is below. Again, more details are in\n[`CONFIG_rbh_to_ribbon.yaml`](../example_configs/CONFIG_rbh_to_ribbon.yaml):\n\n```yaml\n# There are several options for how to sort the chromosomes.\n# More information is available in the config file.\nchr_sort_order: optimal-chr-or\n\n# Tells the program whether to plot the non-significant interactions.\nplot_all: True\n\n# Specifies which species will be plotted from the top-to-bottom.\nspecies_order:\n  - BFI\n  - HCA\n  - BFL\n  - EMU\n  - CLA\n\nrbh_directory: \u003cYOUR_PATH_TO_ODP_RESULTS\u003e/step2-figures/synteny_coloredby_BCnS_LGs/\n\n# Only two species are shown here for brevity,\n#  but please include the species information for all the species you wish to plot.\nspecies:\n  BFI:\n    proteins: \"/path/to/BFI.pep\"\n    chrom:    \"/path/to/BFI.chrom\"\n    genome:   \"/path/to/BFI.fasta\"\n\n  HCA:\n    proteins:  \"/path/to/HCA.pep\"\n    chrom:     \"/path/to/HCA.chrom\"\n    genome:    \"/path/to/HCA.fasta\"\n```\n\nRIBBON DIAGRAM CAVEATS:\n- This script plots pairwise reciprocal-best blastp hits between species pairs,\n  even if this ortholog does not include a gene in the next species to be plotted.\n  For example, in the plot above there are fewer conserved orthologs between the\n  ctenophore _Hormiphora_ and the cnidarian _Rhopilema_ than between the two\n  ctenophores _Hormiphora_ and _Bolinopsis_.\n- Given the above point, it is not currently possible to plot only orthologs\n  that have a sequence in every species in the figure, such as seen here in\n  [Figure 1 of Simakov et al 2022](https://www.science.org/doi/10.1126/sciadv.abi5884#F1).\n  Please open a feature request if you would like this feature.\n-----\n\n### \u003ca name=\"alganalysis\"\u003e\u003c/a\u003eUse Case - Find and characterize ancestral linkage groups\n\nFinding ancestral linkage groups of proteins for a group of species is a useful\nway to characterize what the genome at the ancestral node of that clade may have\nlooked like, and to analyze how the genomes have evolved since that node.  See\n[Simakov et al. (2022)](https://www.science.org/doi/full/10.1126/sciadv.abi5884)\nfor an example on how this concept was used to determine the ancestral number of\nchromosomal linkage groups in the common ancestor of sponges, cnidarians, and\nbilaterians.\n\nThe current implementation of this pipeline uses multiple steps to perform these\nanalyses and determine the ALGs. For future versions of odp, we plan to\nimplement this analysis into a single step.\n\n#### \u003ca name=\"nwayreciprocalbest\"\u003e\u003c/a\u003eALGs part 1 - Ortholog finding in 3+ species\n\nFor this analysis, `blastp` or `diamond` analyses are performed against _n_\nspecies that you specify. Orthologs are kept only when proteins in the _n_\nspeices are reciprocal best hits of each other. These are found by loading the\n`blast` results into a graph structure and finding\n[bidirectional complete graphs](https://en.wikipedia.org/wiki/Complete_graph) of `blastp`\nhits. This process is highly conservative, therefore as the number of genomes _n_\nincreases, the number of highly conserved orthologs decreases.\n\nProgram: `odp/scripts/odp_nway_rbh`\nInput: `config.yaml`, same as for `odp/scripts/odp`, but with some modifications.\nOutput:\n  - a file that contains the reciprocal best hits for the species included. This is called a `.rbh` file\n\n`config.yaml` format for running `odp/scripts/odp`:\n\n```yaml\n# this file is called config.yaml\n\n# the number of species you want to be included in each analysis\nnways: 3\n# How you want to identify the orthologs [diamond|blastp]\nsearch_method: diamond\n# What analyses you want to produce. Saves on some compute.\n#  Must match headers of `xaxisspecies`. Order doesn't matter.\nanalyses:\n  - [\"Celegans\", \"Homosapiens\", \"Dmel\"]\n  - [\"Celegans\", \"Homosapiens\", \"Mmus\"]\n  - [\"Celegans\", \"Dmel\", \"Mmus\"]\n  # - [\"Homosapiens\", \"Dmel\", \"Mmus\"]   # You can comment out lines if you would like\n\nspecies:\n  Celegans:\n    proteins: /path/to/proteins_in_Cel_genome.fasta\n    chrom: /path/to/Cel_genome_annotation.chrom\n    genome: /path/to/Cel_genome_assembly.fasta\n  Homosapiens:\n    proteins: /path/to/Human_prots.fasta\n    chrom: /path/to/Human_annotation.chrom\n    genome: /path/to/Human_genome_assembly.fasta\n  Dmel:\n    proteins: /path/to/drosophila_prots.fasta\n    chrom: /path/to/drosophila_annotation.chrom\n    genome: /path/to/drosophila_genome_assembly.fasta\n  Mmus:\n    proteins: /path/to/mouse_prots.fasta\n    chrom: /path/to/mouse_annotation.chrom\n    genome: /path/to/mouse_genome_assembly.fasta\n```\n\nThe results of these analyses are found in `odp_nway_rbh/rbh/`. The reciprocal best hits file contains an unspecified number of columns, but always contains the columns:\n\n* A unique identifier for each ortholog\n* The protein ID for each species in that ortholog\n* The scaffold ID on which that chromosome resides in that species\n* The scaffold coordinates on which that chromosome resides in that species\n\nCurrently, the naming convention for these files is `[Sp.1]_[Sp.2]...[Sp.N]_reciprocal_best_hits.rbh`. This format is used in downstream steps to parse the headers found in the file.\n\n#### \u003ca name=\"groupby\"\u003e\u003c/a\u003eALGs part 2 - Find significantly numerous groups of orthologs\n\nThe output of the previous program, the `.rbh` file, has one ortholog per line. In this step, we will group the orthologs together based on whether they exist on the same set of scaffolds in each species. For example, all of the orthologs that exist on:\n  * _C. elegans_ chromosome I\n  * _D. melanogaster_ chromosome 3\n  * and human *chromosome 17*\n\nwill be one group. All of the orthologs that exist on a slightly different set of chromosomes will be another group, for example:\n  * _C. elegans_ chromosome I\n  * _D. melanogaster_ chromosome 3\n  * and *human chromosome 16*\n\nThe groups are saved in a tab-delimited file called a `.groupby` file. Each line\nis one group, and gene ids, scaffolds on which they reside, and genome\ncoordinates are saved in python-type lists in single columns.\n\nThe number of groups found in this analysis, and the number of genes found in\neach group, depend on the degree of shared macrosynteny between the species used\nin the analysis. Distantly related species, or species with fast-evolving\ngenomes, will have many groups, each with few genes. Closely related species or\nspecies with slowly-evolving genomes will have fewer groups, with more genes per\ngroup. Regardless of the relationships between the species, there will be a log\ndecay of group sizes given the phenomenon of single genes translocating to other\nchromosomes.\n\nFor each group of orthologs *G*, we can estimate the false discovery rate *α* of\nfinding a group with *i* or fewer genes given the real genomes in these\ncomparisons. We estimate this false discovery rate by producing randomized\nversions of the genomes by shuffling the gene IDs in the `.chrom` file,\nmeasuring whether a group of *i* or fewer genes was present, then repeating this\nmeasurement hundreds of millions of times.\n\n#### \u003ca name=\"gbfilter\"\u003e\u003c/a\u003eALGs part 3 - Filter groups of orthologs\n\nThe groups of reciprocal best hits, as well as the newly-calculated false\ndiscovery rates, are saved in the resulting `.groupby` file. This can be\nmanually or programmatically filtered to only keep groups with certain\nproperties, or groups with a significantly low false discovery rate.\n\nThis is performed automatically by `odp_nway_rbh`, but can be performed with the\nscript `odp_groupby_filter` by specifying the `.groupby` file, and by specifying\nthe acceptable false discovery rate cutoff.\n\nThis process can also be performed in a table editor, such as the spreadsheets\non Google Drive, Apple Sheets, or Microsoft Excel.\n\nAfter removing the rows that have a less-than-significant false discovery rate,\ncontinue on to the next step to annotate the groups of orthologs.\n\n#### \u003ca name=\"gbannotate\"\u003e\u003c/a\u003eALGs part 4 - Annotate groups of orthologs\n\nAt this stage the resulting rows are groups of orthologous genes that are\npresent on the same set of chromosomes in the species under consideration, and\nhave been since the common ancestor of these species. In other words, these are\nancestral linkage groups (ALGs) for this clade.\n\nIt is useful at this stage to assign names to each of the rows in the `group` column of the `.groupby` file. There is some precedence for these naming conventions, see [Simakov et al. (2020)](https://www.nature.com/articles/s41559-020-1156-z) and [Simakov et al. (2022)](https://www.science.org/doi/full/10.1126/sciadv.abi5884). So, if your analysis includes animal genomes then it may be helpful to include some of the species from these publications.\n\nIt is not necessary that each row has its own unique group ID. However, doing so will help plot mixing in downstream analyses.\n\n#### \u003ca name=\"gbmerge\"\u003e\u003c/a\u003eALGs part 5 - Merge `.groupby`/`.rbh` files\n\nIn this section let’s consider a few species to compare:\n  - Unicellular (+colonial multicellular) outgroups of animals:\n    - the icthyosporean _Creolimax fragrantissima_ (CFR)\n    - the filasterean amoeba _Capsaspora owczarzaki_ (COW)\n    - the choanoflagellate _Monosiga brevicollis_ (MBR)\n  - The animals:\n    - the ctenophore _Hormiphora californensis_ (HCO)\n    - the sponge _Ephydatia muelleri_ (EMU)\n    - the jellyfish _Rhopilema esculentum_ (RES)\n\nIt is desirable to merge the `.groupby` files from the searches of multiple species if the evolutionary distance between the outgroup and the other species is extreme. For example, the degree of synteny between animals and their unicellular Holozoan relatives is very little, and merging multiple searches enables the discovery of more ancestrally linked genes.\n\nWe can perform the ALG-finding steps 1-4 described above for the following three analyses:\n  - `CFR-HCA-EMU-RES`\n  - `COW-HCA-EMU-RES`\n  - `MBR-HCA-EMU-RES`\n\nEach row in the `.groupby` files for these analyses will contain one gene per species in the analysis. It is possible that many of the orthologs will also contain proteins in two or more unicellular outgroups, so we now run `odp_groupby_to_rbh` to unwrap each `.groupby` file to a `.rbh` file, then `odp_rbh_merge` to join the `.rbh` files on the species `HCA`, `EMU`, and `RES`.\n\nEach ortholog (row) in the resulting `.rbh` file will have a gene for each animal species (`HCA`, `EMU`, `RES`), and will contain a gene in between one and three of the unicellular species (`CFR`, `COW`, `MBR`).\n\nThe notation we use to refer to an `.rbh` file created by merging other `.rbh` files uses parentheses to note the species that may have missing data, and unmodified text to note the species that will always have a gene for each ortholog. The analysis discussed above is notated as `(CFR-COW-MBR)-HCA-EMU-RES`.\n\n#### \u003ca name=\"gbtohmm\"\u003e\u003c/a\u003eALGs part 6 - Find orthologs in more species\n\nSteps 1-4 of finding ALGs relies on using only a few species (perhaps 3-5) to avoid loss of orthologs due to the stringent ortholog selection process. [Step 5 - Merge `.groupby`/`.rbh` files, discussed above,](#groupbymerge) enables the inclusion of more genes by allowing for missing data in select groups. Then, by constructing hidden Markov models of the orthologs, we can search for orthologs in more species.\n\nThe script `odp_rbh_to_hmm` reads in a `.rbh` file and constructs one HMM model per ortholog (row). The models are then searched against the proteins of every additional species that is included in the `config.yaml` file. The best protein for each HMM is selected, and only proteins with a significant match are kept. Missing data are permissible in this step, so it is not guaranteed that every ortholog will have an identifiable protein in every species added in this step.\n\nThe output of this pipeline is another `.rbh` file, now with the proteins of the additional species identified with the HMM.\n\n#### \u003ca name=\"plotmixing\"\u003e\u003c/a\u003eALGs part 7 - Plot mixing of select linkage groups\n\nIf you have followed the above steps, you now have a `.rbh` file with orthologs that have been annotated by group, and includes many additional species thanks to the merging and HMM search steps.\n\nIf you are using `odp` to look for phylogenetically diagnostic fusion-then-mixing events, then it is useful to plot linkage groups to visualize the extent of mixing of those groups. The script `odp_rbh_plot_mixing` does that. The output of this script are PNG and PDFs of the orthologs in those two groups plotted in the chromosome coordinates for each species. This script also estimates the degree of intermixing of two groups of genes on the chromosomes on which they coexist.\n\n### \u003ca name=\"4speciesphylogeny\"\u003e\u003c/a\u003eDetermine which clade is sister\n\nThe module `odp_genome_rearrangegment_simulation` was developed to help answer the question of whether ctenophores or sponges are the sister clade of all other animals. This script requires one species that is the known outgroup, and one species nested in the phylogeny with a known relative position to all other species. In our study, we performed analyses in which the filasterean amoeba _Capsaspora owczarzaki_ or the choanoflagellate _Salpingoeca rosetta_ were the outgroup species. The species with the known phylogenetic position was the fire jellyfish _Rhopilema esculentum_. The program uses these genomes to polarize the relationships between the two genomes in an unresolved polytomy, in this case the ctenophore _Hormiphora californensis_ and the sponge _Ephydatia muelleri_.\n\nThe program `odp_genome_rearrangement_simulation` does the following:\n1. Finds linkage groups that simultaneously satisfy all of these requirements:\n  - The linkage groups are on separate chromosomes in the outgroup species\n  - The linkage groups are on separate chromosomes in one of the unplaced species\n  - The linkage groups are fused and mixed on single chromosomes in the other unplaced species\n  - The linkage groups are fused and mixed on single chromosomes in the species with a known phylogenetic positions\n2. Quantifies the number of fusion events identified in step 1, and calculates the number of genes per linkage group, and the number of linkage groups participating in those events.\n3. For each species in the analysis, tests whether the phylogentically informative fusion-then-mixing events seen in the real genomes are due to randomness, or true biological signal. This is done by:\n  - One simulation shuffles the protein IDs in the genome of one species. Step 1 above is run on this shuffled genome, and the three other observed genomes. In other workds, the number of phylogenetically informative fusion-then-mixing events are identified given the new shuffled genome.\n  - Step 2 above is run on the events found above.\n  - These steps are performed millions of times to estimate how many times we see a genome configuration that has at least as many genes, linkage groups, and fusion events participating in phylogentically diagnostic fusion-then-mixing events.\n\nThe output of this program is histograms showing the different measured parameters from the simulations (grey bars), plotted with the parameters observed from the real genomes (red vertical bars). These plots show whether the sister clade hypotheses seen in the real data can be explained by a highly rearranged state in any of the genomes in the analysis.\n\n\n## \u003ca name=\"cite\"\u003e\u003c/a\u003eCiting odp\n\nIf you use `odp` in your work, please cite the following paper:\n\n[Schultz, D.T., Haddock, S.H.D., Bredeson, J.V., Green, R.E., Simakov, O \u0026 Rokhsar, D.S. Ancient gene linkages support ctenophores as sister to other animals. Nature (2023). https://doi.org/10.1038/s41586-023-05936-6](https://www.nature.com/articles/s41586-023-05936-6)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconchoecia%2Fodp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconchoecia%2Fodp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconchoecia%2Fodp/lists"}