{"id":23787796,"url":"https://github.com/schultzm/havic","last_synced_at":"2025-06-15T19:35:57.182Z","repository":{"id":201757124,"uuid":"112992426","full_name":"schultzm/havic","owner":"schultzm","description":"Detect Hepatitis A Virus Infection Clusters","archived":false,"fork":false,"pushed_at":"2021-09-13T05:05:42.000Z","size":9435,"stargazers_count":1,"open_issues_count":5,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-01T15:17:39.504Z","etag":null,"topics":["cluster-analysis","phylogenomics-pipeline","transmission","viral-genomics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/schultzm.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-12-04T03:56:38.000Z","updated_at":"2024-10-18T22:36:21.000Z","dependencies_parsed_at":"2024-03-04T12:52:05.391Z","dependency_job_id":null,"html_url":"https://github.com/schultzm/havic","commit_stats":null,"previous_names":["schultzm/havic"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fhavic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fhavic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fhavic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fhavic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/schultzm","download_url":"https://codeload.github.com/schultzm/havic/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240004724,"owners_count":19732631,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster-analysis","phylogenomics-pipeline","transmission","viral-genomics"],"created_at":"2025-01-01T15:17:42.743Z","updated_at":"2025-02-21T11:27:19.879Z","avatar_url":"https://github.com/schultzm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: havic user manual\ndate: 21 September 2020\nbibliography: paper.bib\ncsl: harvard-the-university-of-melbourne.csl\n---\n\n# havic\n\n[![Github All Releases](https://img.shields.io/github/downloads/schultzm/havic/total.svg)]()  \n[![CircleCI](https://circleci.com/gh/schultzm/havic.svg?style=svg\u0026circle-token=9d17418bb752aa29e07f95b09af106aef7cc6b02)](https://app.circleci.com/pipelines/github/schultzm/havic)\n\nDetect **H**epatitis **A** **V**irus **I**nfection **C**lusters from virus consensus sequences.  `havic` allows objective, fast and automated detection of infection clusters from clinical virus samples.  \n\n## Overview\n\n`havic` is a bioinformatics pipeline for detecting infection clusters in clinical Hepatitis A Virus (HAV) samples from DNA or cDNA sequence data.  The pipeline is written in `python3` and uses `ruffus` to connect a number of open-source software tools to achieve this task.  The user feeds `havic` some query files via a `yaml` config file, waits for the program to run and then checks the output folder for results.  `havic` allows fast and objective detection of infection clusters in clinical virus sample sequences.  The figure below is a schematic representation of the pipeline.  \n\n![Pipeline](https://github.com/schultzm/havic/blob/master/havic/data/pipeline_graph.svg?raw=true)\n\nThe user is free to modify parameters of a `havic` run through modifying a config file.  The above pipeline is summarised briefly here.  \n\n- create output directory to receive results files\n- QC query sequences\n  - collect queries into a single set\n  - discard duplicate sequences based on seqIDs\n    - seqIDs are sequence headers up until the first space character\n    - duplicate seqID are reported to file\n  - replace 'troublesome' characters in sequence headers (character replacements reported to file)\n- map query sequences to reference sequence\n  - reverse complement as required\n- extract alignment from mapping file\n- optionally, trim sequences to target region\n- perform Maximum Likelihood phylogenetic inference on alignment\n- pick infection clusters based on tree and alignment\n- optionally, visualise:\n  - phylogentic tree next to alignment with samples of interest and infection clusters highlighted\n  - a heatmap of genetic distances between samples in alignment with infection clusters highlighted\n\n`havic` has been optimised for analysis of the the VP1/P2A amplicon, which is the genomic marker recommended by the Hepatitis A Virus Network ([HAVNet](https://www.rivm.nl/en/havnet)) ![protocol](https://github.com/schultzm/havic/blob/master/havic/data/Typing_protocol_HAVNET_VP1P2A_a1a.pdf).  The VP1/P2A region is shown here in the context of the HAV genome:\n\n![Amplicon](https://github.com/schultzm/havic/blob/master/havic/data/VP1P2A.png?raw=true \"The HAV genome with HAVNet amplicon, sourced from RIVM\")\n\nThe bed coordinates of the HAVNet VP1/P2A amplicon are 2915 to 3374.  \n\n## Installation\n\nInstallation of `havic` requires [Miniconda](https://docs.conda.io/en/latest/miniconda.html) and [git](https://git-scm.com/downloads).  After installing these packages, simply do:\n\n    git clone https://github.com/schultzm/havic.git\n    cd havic\n    . install.sh\n\nThe installation process will take up to 30 minutes with verbose output printed to screen during the install.  If the installation fails, read the screen output to determine the error via traceback.  Submit installation issues to github.  Installation has been tested via continuous integration on CircleCI and tested inside a conda environment.  At installation time, a test suite is run.  The suite analyses pre-packaged HAV amplicon data, HAV whole genome sequence (WGS) data and measles WGS data.  After the install, the user is free to delete the test output folder if desired using `rm -r havic/havic_test_results/`.\n\n## Usage\n\n### Quickstart\n\nAfter installing, activate the conda environment by doing `conda activate havic_env`.  The most basic usage of `havic` is to type `havic` on the command line and hit enter/return.  If the install has worked correctly, the user should see:\n\n    usage: havic [-h]  ...\n\n    optional arguments:\n    -h, --help  show this help message and exit\n\n    Sub-commands help:\n    \n        detect    Detect infection clusters from cDNA or DNA consensus sequences.\n        version   Print version.\n        test      Run havic test using pre-packaged example data.\n\nThe program is accessed via three subcommands, with help via the `-h` suffix.  \n\n`havic detect` is the main sub-command.  Use this for detecting infection clusters from user-specified cDNA or DNA consensus sequences.  \n`havic version` will print the installed version to `stdout`.  \n`havic test` will run `havic detect` on a pre-packaged test dataset.  If successful, the analyst should see `ok` at the end of each test.\n\n### Example usage\n\nThe results in this example were obtained using the command `havic test`.  Let's walk-through this test analysis of HAV VP1/P2A amplicons, using the same `config.yaml` as `havic test`.  With the user's own config.yaml file, the command would be `havic detect path/to/config.yaml`.  \n\n#### Editing the `yaml` file for parsing by `havic detect`\n\n`havic detect` receives instructions from a `yaml` config file via the command `havic detect path/to/yaml.yaml`.  The `test.yaml` file from `havic/havic/data/hav_amplicon.yaml` is presented below as an example:\n\n    ---\n    FORCE_OVERWRITE_AND_RE_RUN:\n      Yes # Yes for full re-run, No to start from an interrupted run,\n\n    DEFAULT_REFS:\n      Yes # Yes if using havic pre-packaged SUBJECT test data, No otherwise\n\n    DEFAULT_QUERIES:\n      Yes # Yes if using havic pre-packaged QUERY test data, No otherwise\n\n    SUBJECT_FILE: # the \"SUBJECT\" sequence in BLAST terms, i.e., reference genome\n      data/NC_001489.fa # relative or absolute paths to fasta file\n      # if DEFAULT_REFS is Yes, path will be prefixed to use pre-packaged data\n\n    SUBJECT_TARGET_REGION: # the target region of the genome to focus on\n      data/havnet_amplicon.fa # in fasta format, relative or absolute paths okay\n      # if DEFAULT_REFS is Yes, path will be prefixed to use pre-packaged data\n\n    OUTDIR: # the parent directory for the results folders\n      havic_test_results/amplicon # relative or absolute path to parent result folder\n\n    TREE_ROOT:\n      midpoint # sequence name to root iqtree on, or midpoint for midpoint root\n\n    RUN_PREFIX:\n      HAV_amplicon_\n\n    PLOTS:\n      Yes # Yes to make plots (slow for large runs), No otherwise.\n\n    MAPPER_SETTINGS:\n      executable:\n        minimap2 # https://github.com/lh3/minimap2\n      other:\n        -c --cs --secondary=no\n      k_mer: # select an odd number, between 3 and 27 inclusive\n        -k 5 # 5 has been good for the HAV amplicon seqs, adjust sensibly\n\n    IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf\n      executable:\n        iqtree # command to call iqtree2\n      other:\n        '-T AUTO -m MFP+FO --ufboot 1000 -pers 0.2 -nstop 500'\n\n    CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/\n      executable:\n        ClusterPicker\n      coarse_subtree_support: # divide tree into subtrees at/above this threshold\n        70\n      fine_cluster_support: # branch support minimum value for clusters of tips\n        95\n      distance_fraction: # float please, genetic distance\n        0.01 # (e.g., 1 SNP in 100 bp = 0.01)\n      large_cluster_threshold:\n        15\n      distance_method:\n        valid # options are ambiguity, valid, gap, or abs\n\n    HIGHLIGHT_TIP:\n      - CmvAXJTIqH # Specify tip name to highlight in final plot\n      - CCHkiFhcxG # Specify tip name to highlight in final plot\n      - PAvYXhYkLM # Specify tip name to highlight in final plot\n\n    TRIM_SEQS: # these sequences will be trimmed to length of SUBJECT_AMPLICON\n      - AY644337_55443_seq_1 # these are sequences in the QUERY_FILES\n      - RIVM-HAV171_64913_seq_2_MapsOutsideTrimRegionSoEmpty\n      - nDNLdjtgha#HashInSeqName\n      - '' # give it nothing\n      - xyzyx # give it a non-name\n\n    QUERY_FILES:\n      - data/example1.fa # relative or absolute paths to fasta files\n      - data/example2.fa\n      - xyz # to test a dud file name\n      - '' # to test an empty file name (which would return a folder, not file)\n    ...\n\nBefore starting a run, `cd` to a working directory (preferably not inside the git cloned folder).  Either copy the above `yaml` file, or use `wget https://raw.githubusercontent.com/schultzm/havic/master/havic/data/hav_amplicon.yaml`.  For more information on the `yaml` standard, refer to [https://yaml.org/](https://yaml.org/).  \n\nLets go through the `yaml` step-by-step.\n\n##### Opening and closing fields, nesting, special characters\n\n- `yaml` code blocks open and close with `---` and `...`, respectively.  Ensure your file includes these lines.  \n- Indents are two spaces, use a carriage return followed by an increas of two spaces to increase a nesting level.  \n- Special characters or numbers in tip names or folders can be correctly parsed by enclosing values in single-quotes to allow string interpretation of values.\n\n##### Force overwrite and re-run\n\n    FORCE_OVERWRITE_AND_RE_RUN:\n      Yes\n\n`havic` manages tasks via [`ruffus`](https://code.google.com/archive/p/ruffus/), and out-of-date stages of the pipeline will be re-run as required.  To start a new run or force overwrite files in the OUTDIR, set `FORCE_OVERWRITE_AND_RE_RUN` to `Yes`.  Otherwise to start off from the last point, set to `No`.  \n\n##### Default Subject and Queries\n\n    DEFAULT_SUBJECT:\n      Yes # Yes if using havic pre-packaged SUBJECT (i.e., 'reference') sequence and region test data, No otherwise\n\n    DEFAULT_QUERIES:\n      Yes # Yes if using havic pre-packaged QUERY test data, No otherwise\n\nIf `DEFAULT_SUBJECT` is set to `Yes` `havic` will prefix the filepaths in `SUBJECT_FILE` `SUBJECT_TARGET_REGION` with the `havic` install path (using `pkg_resources.resource_filename`) for the pre-packaged data.  The same logic applies for `DEFAULT_QUERY`.  To specify a custom path to `SUBJECT_FILE` and `SUBJECT_TARGET_REGION` set `DEFAULT_SUBJECT` to `No`.  To specify custom `QUERY_FILES`, set `DEFAULT_QUERY` to `No`.  \n\n##### Subject/Reference sequence\n\n    SUBJECT_FILE: # the \"SUBJECT\" sequence in BLAST terms, i.e., reference genome\n      data/NC_001489.fa # relative or absolute paths to fasta file\n      # if DEFAULT_SUBJECT is Yes, path will be prefixed to use pre-packaged data\n\n`havic` will use this fasta sequence as the subject/reference sequence.  If a different reference is required, change the path value.  The subject sequence may only be a single consensus sequence.\n\n##### Subject target region\n\n    SUBJECT_TARGET_REGION: # the target region of the genome to focus on\n      data/havnet_amplicon.fa # in fasta format, relative or absolute paths okay\n      # if DEFAULT_REFS is Yes, path will be prefixed to use pre-packaged data\n\nThis regions will guide trimming of the alignment.  In this example, the VP1/P2A region is the target region.  Sample names listed in TRIM_SEQS will be trimmed to match the boundaries of this region.  A sequence is used here instead of a bed coordinates file because the exact boundaries of the target region in the final alignment are not always obvious.  After mapping this region to the subject sequence, the boundaries become obvious.  Automatic delineation of this region alleviates the need for the analyst to manually search for and define the boundaries.  \n\n##### Output directory\n\n    OUTDIR: # the parent directory for the results folders\n      havic_test_results/amplicon # relative or absolute path to parent result folder\n\nSpecify the path to the output directory.  The files listed in the table below will be sent to this directory as the run progresses.  \n\n###### Output files\n\nStage number | Stage name | File or directory name\n---:|:---|:---\n1 | create_outdir | `havic_test_results/amplicon`\n2 | compile_input_fasta | `HAV_amplicon_duplicate_seqs.txt`\n2 | compile_input_fasta | `HAV_amplicon_seq_id_replace.tsv`\n2 | compile_input_fasta | `HAV_amplicon_tmpfasta.fa`\n3 | map_input_fasta_to_ref | `HAV_amplicon_map.bam`\n3 | map_input_fasta_to_ref | `HAV_amplicon_map.bam.bai`\n4 | bam2fasta | `HAV_amplicon_map.bam2fasta.R`\n4 | bam2fasta | `HAV_amplicon_map.bam2fasta.Rout`\n4 | bam2fasta | `HAV_amplicon_map.stack.fa`\n5 | get_cleaned_fasta | `HAV_amplicon_map.stack.trimmed.fa`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.bionj`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.ckp.gz`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.contree`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.iqtree`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.log`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.mldist`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.model.gz`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.splits.nex`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.treefile`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.ufboot`\n6 | run_iqtree | `HAV_amplicon_map.stack.trimmed.fa.uniqueseq.phy`\n7 | root_iqtree | `HAV_amplicon_map.stack.trimmed.fa.rooted.treefile`\n8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | `HAV_amplicon_map.stack.trimmed.fa_HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_cluster4_sequenceList.txt`\n8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | `HAV_amplicon_map.stack.trimmed.fa_HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks.fas`\n8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | `HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_list.txt`\n8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | `HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_log.txt`\n8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | `HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks.nwk`\n8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | `HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks.nwk.figTree`\n9 | summarise_cluster_assignments | `HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_summarised.txt`\n10 | plot_results_ggtree | `HAV_amplicon_map.stack.trimmed.fa_SNPcountsOverAlignLength.csv`\n10 | plot_results_ggtree | `HAV_amplicon_map.stack.trimmed.fa_SNPdists.csv`\n10 | plot_results_ggtree | `HAV_amplicon_map.stack.trimmed.fa_SNPdists.pdf`\n10 | plot_results_ggtree | `HAV_amplicon_map.stack.trimmed.fa.rooted.treefile_1percent_divergence_valid_msa.pdf`\n10 | plot_results_ggtree | `HAV_amplicon_map.stack.trimmed.fa.Rplot.R`\n10 | plot_results_ggtree | `HAV_amplicon_map.stack.trimmed.fa.Rplot.Rout`\n11 | pipeline_printout_graph | `pipeline_graph.svg`\n\n##### Setting the location of the tree root\n\n    TREE_ROOT:\n      midpoint # sequence name to root iqtree on, or midpoint for midpoint root\n\nFor visual representation only, the tree root is set to orientate the plot in `HAV_amplicon_map.stack.trimmed.fa.rooted.treefile_1percent_divergence_valid_msa.pdf`.  The tree root does not affect cluster definitions.  \n\n##### Set the prefix of output filenames\n\n    RUN_PREFIX:\n      HAV_amplicon_\n\nTo facilitate tracking of output files, the user is able to specify a custom prefix for output files.  \n\n##### Draw results plots\n\n    PLOTS:\n      Yes # Yes to make plots (slow for large runs), No otherwise.\n\nThis setting controls the drawing of output plots.  The plots (shown below) are helpful to understand how the multiple sequence alignment affects tree topology, cluster detection and pairwise SNP distances.\n\n![Heatmap](https://github.com/schultzm/havic/blob/master/havic/data/_heatmap_SNPs.png?raw=true \"Pairwise genetic distances and ClusterPicker clusters\")\n\n![Tree](https://github.com/schultzm/havic/blob/master/havic/data/tree_MSA_clusters.png?raw=true \"Maximum Likelihood tree with bootstrap support, ClusterPicker clusters, and Multiple Sequence Alignment\")\n\n##### Input query files\n\nInput query sequences should be in fasta format with one sequence per sample.  Multiple samples may be included per file, and/or multiple files may be passed to `havic`.  Query sequences within files will be reverse complemented as necessary during their mapping to the subject/reference.  If the query sequence files are named `batch1.fa`, `batch2.fa`, `batch3.fa`,  edit the `QUERY_FILES` section of the `yaml` file as follows:\n\n    QUERY_FILES:\n      - batch1.fa # relative or absolute paths to fasta files\n      - batch2.fa\n      - batch3.fa\n\n##### Trimming sequences to genomic region of interest\n\nTo trim input queries to the reference VP1/P2A amplicon, list the sequence name of the query under `TRIM_SEQS`, otherwise ignore this section.  \n\n##### Executables settings\n\n    MAPPER_SETTINGS: # https://github.com/lh3/minimap2\n\n    IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf\n\n    CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/\n\nUse these variables to set parameters for `Minimap2`, `IQ-Tree2` and `ClusterPicker`.  For further information, refer to the user manuals for each software in the above links.  \n\n##### Highlighting samples of interest\n\nTo highlight query sequences in the final plots, list the sequence names under `HIGHLIGHT_TIP` in the `yaml`, otherwise ignore this section.\n\n\n    HIGHLIGHT_TIP:\n      - CmvAXJTIqH # Specify tip name to highlight in final plot\n      - CCHkiFhcxG # Specify tip name to highlight in final plot\n      - PAvYXhYkLM # Specify tip name to highlight in final plot\n\nSamples listed under HIGHLIGHT_TIP will be annotated in the final tree plot with a red dot, as shown below.  \n\n![Tree](https://github.com/schultzm/havic/blob/master/havic/data/highlight_tip.png?raw=true \"Tip CmvAXJTIqH highlighted as requested under HIGHLIGHT_TIP\")\n![Tree](https://github.com/schultzm/havic/blob/master/havic/data/highlight_tips.png?raw=true \"Tips CCHkiFhcxG and PAvYXhYkLM highlighted as requested under HIGHLIGHT_TIP\")\n\n##### Trim sequences to SUBJECT_TARGET_REGION\n\n    TRIM_SEQS: # these sequences will be trimmed to length of SUBJECT_AMPLICON\n      - AY644337_55443_seq_1 # these are sequences in the QUERY_FILES\n      - RIVM-HAV171_64913_seq_2_MapsOutsideTrimRegionSoEmpty\n      - nDNLdjtgha#HashInSeqName\n\nSometimes query sequences are whole genome, off target, or longer than the target regions.  By supplying those sequence names here, `havic` will trim the aligned sequence to the SUBJECT_TARGET_REGION.  This list may be long, which is why it is placed toward the end of the `yaml` file.  \n\n##### Query sequences\n\n    QUERY_FILES:\n    - data/example1.fa # relative or absolute paths to fasta files\n    - data/example2.fa\n    - xyz # to test a dud file name\n    - '' # to test an empty file name (which would return a folder, not file)\n\nProvide relative or absolute paths to files containing query sequences.  Each sample may only consist of a single sequence.  Each file may contain one or more samples.  Multiple files may be input to `havic` via this option.  \n\n### Tips and tricks\n\n#### Filter samples to subtype and analyse by subtype\n\nFor larger datasets, when runtimes are prohibitive, it is preferable to perform analyses by subtype.  HAV sub-genotypes (or 'subtypes') infecting humans are IA, IB, IIA, IIB, IIIA and IIIB.  Typically, the minimum genetic divergence between the subtypes is around 0.076 (i.e., more than 7.6 nucleotides in 100 nucleotides are different between subtypes in pairwise comparisons).  `havic` can be used to approximately type samples.  Here we describe the process to subset data for analysis of a single VP1/P2A query sequence in the context of thousands of VP1/2A sequences obtained from NCBI GenBank.\n\n##### Run in fast mode to determine the subset\n\nFirst we need to run the analysis in fast mode to obtain the subtype for the query sequence.  Within the `havic` pipeline, this will require tweaking the settings for `IQTree` and, consequently, `ClusterPicker`.  A run in fast mode might look like:\n\n    IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf\n      executable:\n        iqtree\n      other:\n        '-T 4 -m GTR+I+G --fast -bnni -alrt 2000'\n\n    CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/\n      executable:\n        ClusterPicker\n      coarse_subtree_support: # divide tree into subtrees at/above this threshold\n        70\n      fine_cluster_support: # branch support minimum value for clusters of tips\n        80 # as we are using -alrt in iqtree, use a value of 80 here instead of 95 as we would for --ufboot values\n      distance_fraction: # float please, genetic distance\n        0.076 # genetic distance between subtypes is roughly equal to this\n      large_cluster_threshold:\n        15\n      distance_method:\n        valid # options are ambiguity, valid, gap, or abs\n\nThe `--fast` iqtree command `-T 3 -m GTR+I+G --fast -bnni -alrt 2000` is explained more thoroughly in the IQTree2 User Manual.  Briefly, compute time is reduced by _not_ `AUTO` searching for the best threading strategy and _not_ `AUTO` searching for the best fit model.  Working with a short amplicon of 460 bp, we can safely choose three threads (`-T 3`). Pre-emptively, opting for the highly-parameterised GTR+I+G model, `-bnni` is used to compensate for any severe model violations.  In `--fast` mode, we also need to use an alternative to the `--ufboot` branch support method, so we have implemented the `-alrt` single branch test.  \n\nTo reiterate, our aim for the fast analysis is to find clades that approximately correspond to HAV subtype, and then pick the subtype/clade that contains our novel query sequence.  Given that we have used `-alrt` as a proxy for branch support we need to lower our acceptance threshold for branch support.  That is, in `ClusterPicker` we set `fine_cluster_support` to 80 (as opposed to 95 for `UFBoot`)to find the well supported clades (please refer to IQTree manual for further advice on this), and we increase the genetic divergence to 0.076 or 7.6% to cluster the subtypes.  \n\nTo subset the dataset, after running `havic` in fast mode, open the output file `\u003cRUN_PREFIX\u003emap.stack.trimmed.fa.rooted_clusterPicks.nwk.figTree` in `FigTree`.  Search for the sample of interest.  Select the appropriate subset to give context to the sample of interest.  Note, sample names are santised by `havic` to remove problematic characters from fasta headers.  Original sample names are in `\u003cRUN_PREFIX\u003eseq_id_replace.tsv`.  Use the list of original sample names to subset the input data and modify the `yaml` file accordingly.\n\n##### Re-run the analysis in slow mode using the subset data\n\nAfter selecting the subset of interest, re-run the analysis in 'slow' mode at least three times.  \n\nA re-run in slow mode might look like:\n\n    IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf\n      executable:\n        iqtree # command to call iqtree2\n      other: # threads\n        '-T AUTO -m MFP+FO --ufboot 1000 -pers 0.2 -nstop 200'\n\n    CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/\n      executable:\n        ClusterPicker\n      coarse_subtree_support: # divide tree into subtrees at/above this threshold\n        70\n      fine_cluster_support: # UPFboot minimum value for clusters of tips\n        95\n      distance_fraction: # float please, genetic distance\n        0.01 # (e.g., 1 SNP in 100 bp = 0.01)\n      large_cluster_threshold:\n        15\n\n\n#### Re-run from specified stage\n\nTo run the pipeline from a user-specified stage, delete or re-prefix the files in the output directory and set `FORCE_OVERWRITE_AND_RE_RUN` to `No`.  For example, to re-run the pipeline from the ClusterPicker stage, firstly set `FORCE_OVERWRITE_AND_RE_RUN` to `No` and then delete files shown in the **Output files** table (above) numbered 8 and larger.  Alternatively, re-prefix the files numbered 8 and larger with an underscore.  Note, when `FORCE_OVERWRITE_AND_RE_RUN` is set to `Yes`, all files in `OUTDIR` with the prefix as per `RUN_PREFIX` will be deleted.  \n\n#### Input whole genome sequences (or do not trim the MSA)\n\nDuring development of `havic`, it was recognised that HAV surveillance will move to whole genome sequencing in the near future.  To improve utility of `havic` over the coming years, `havic` is written to allow the user to pass in any query and subject sequences.  Prior to phylogenetic analysis, query headers listed under `TRIM_SEQS` will be trimmed to the subject target region given by `SUBJECT_TARGET_REGION`.  To avoid cropping the alignment, either set the value of `SUBJECT_TARGET_REGION` to `SUBJECT_FILE` or set `TRIM_SEQS` to `''`.  \n\n#### Include the reference/subject sequence in the final alignment\n\nTo include the subject sequence in the final alignment, just add the path to the subject file to the list in the `QUERY_FILES` block.\n\n#### Overcome biases in results\n\nAs `havic` implements ML phylogenetic inference (via `IQ-Tree2`), there is a chance of arriving on a local optimum; hence, **the analysis should be run multiple times (\u003e3) to more completely explore tree space**.  Epidemiological conclusions should be based on the consensus of multiple runs and patient metadata (e.g., contact tracing, travel history).  \n\n## Limitations\n\nInsertions in alleles relative to the reference will be deleted in the alternative allele during output to alignment.  For example, if the `REF` has `ACCCCCCCCT` and the `ALT` has `ACCCCCCCCCCT`, the final alignment will be:\n\n    \u003eREF\n    ACCCCCCCCT\n    \u003eALT\n    ACCCCCCCCT\n\nNote, the deletion above in `ALT` of `CC`.  4644M1I8M1D3M7I287M1D10760M\n\n\n## Release history\n\nPre-release.  \n\n## Frequently Asked Questions\n\n_Why the name_ `havic`_?_\n\n`havic` is an acronym for **H**epatitis **A** **V**irus **I**nfection **C**luster (HAVIC), the **VIC** acknowledges that the development team hails from Victoria, Australia.\n\n_Who is_ `havic` _for?_\n\n`havic` is for molecular epidemiologists working in public health laboratories who want to discover infection clusters in their virus sample cDNA or DNA sequences.  \n\n_What is `havic` for?_\n\n`havic` is for bioinformatic analysis of Hepatitis A Virus genome sequences.  It takes fasta files as input (`QUERIES`), maps the `QUERIES` to a reference (`SUBJECT`), extracts the alignment from the binary alignment map (bam) file, infers a phylogenetic tree from the alignment, picks infection clusters within the `QUERIES` using the tree and alignment as evidence.  Theoretically, `havic` can be used on other viral genomes though testing on non-HAV samples has so far been limited to Measles and SARS-CoV-2.\n\n_How do you define_ `SUBJECT` _and_ `QUERY` _sequences?_\n\nTo maintain consistency with already established methods, SUBJECT ([BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web\u0026PAGE_TYPE=BlastDocs\u0026DOC_TYPE=References) nomenclature) is used interchangeably with REFERENCE, REF or reference allele [.vcf standard](https://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/).  `SUBJECT` is the backbone onto which all `QUERY` sequences will be mapped.  `QUERY` (BLAST nomenclature) is used interchangeably with `ALTERNATE` or `ALT` or alternate allele (`.vcf` standard).  \n\n_Can havic be used with a custom_ `SUBJECT` _sequence?_\n\nYes.  The havic pipeline is expected to work for any non-segmented virus genome.\n\n_Can the the `SUBJECT` file consist of multiple contigs?_\n\nNo.  The `SUBJECT` sequence needs to be a single consensus sequence from a single sample.  \n\n_Can input_ `QUERY` _samples be comprised of multiple consensus sequences from the same sample?_\n\nNo.  A `QUERY` file may NOT consist of multiple contigs from the same sample.  However, a `QUERY` file may consist of multiple sequences, one sequence from each sample.  \n\n_Can input_ `QUERY` _files consist of multiple sequences?_\n\nYes.  A `QUERY` file may either be a single consensus sequence from a single sample, or multiple samples with a single consensus sequence for each sample.  A single `QUERY` file can be input to `havic`, but the program is designed to accept as many `QUERY` files as you wish to feed it.  \n\n_What's all this talk about consensus sequences?  I'm used to talking about contigs._\n\nIn the 2020 pandemic era, virus genome sequencing is dominated by tiled-PCR-amplicon Illumina paired-end sequencing and/or Oxford Nanpore Technologies (ONT) long read sequencing.  The typically low input nucleic acid quantity from clinical samples means that Illumina sequencing of tiled PCR amplicons is the preferred method whole genome sequencing of clinical virus samples.  Tiled amplicon Illumina sequencing allows mapping of reads from a single sample to a single reference, with the final sample genome sequence called as the consensus variants against the reference, padded by inter-variant reference bases.  The final sample sequence is not produced from a de novo assembly of reads so is referred to as a consensus sequence.  Further, in diagnostic laboratories worldwide, quantitative Reverse Transcriptase Real-time PCR (qRT-PCR, qPCR or sometimes just RT-PCR) is used to detect positive cases.  Due to difficulties associated with whole genome sequencing, diagnostic laboratorie often use Sanger sequencing of PCR products to call the strain of virus.  `havic` was originally written to discover and characterise outbreak clusters from short amplicon Sanger sequences, but now is also capable of analysis virus whole genome consensus sequences.  \n\n_Will_ `havic` _work on organisms other than viruses?_\n\nProbably.  havic has been designed and tested specifically to work on Hepatitis A Virus (HAV, genome size ~7.5kb) genomes.  However, `havic` should work on any non-segmented virus genome, and successful test analyses have been performed on Measles (~15.9kb) and SARS-CoV-2 (~30kb) genomes.  Ultimately it is up to the analyst to decide whether `havic`'s treatment of the data makes biological sense.  \n\n_What is the minimum number of sequences that can be analysed using_ `havic`_?_\n\nThe answer is 3.  To obtain context sequences for the query sample/s, go to NCBI's GenBank or RIVM's HAVNet.  It is recommended to use [entrez e-utils](https://www.ncbi.nlm.nih.gov/books/NBK179288/) for obtaining large numbers of sequences and associated metadata.\n\n## Glossary\n\nAcronym | Expansion\n---|---\nHAV | Hepatitis A Virus\nMSA | Multiple Sequence Alignment\nHAVNet | Hepatitis A Virus Network\nML | Maximum Likelihood\nNCBI | National Center for Biotechnology Information\nRIVM | Rijksinstituut voor Volksgezondheid en Milieu\nPCR | Polymerase Chain Reaction\ncDNA | complementary DNA\nWGS | Whole genome sequence/ing\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschultzm%2Fhavic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschultzm%2Fhavic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschultzm%2Fhavic/lists"}