{"id":35037949,"url":"https://github.com/zheminzhou/etoki","last_synced_at":"2025-12-27T08:03:30.029Z","repository":{"id":50477862,"uuid":"114418850","full_name":"zheminzhou/EToKi","owner":"zheminzhou","description":"all methods related to Enterobase","archived":false,"fork":false,"pushed_at":"2024-09-04T14:38:13.000Z","size":244956,"stargazers_count":38,"open_issues_count":8,"forks_count":18,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-09-05T19:55:36.034Z","etag":null,"topics":["assembly","genotype","mlst","phylo","phylogeny","spades"],"latest_commit_sha":null,"homepage":"https://enterobase.warwick.ac.uk","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zheminzhou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-15T23:18:42.000Z","updated_at":"2024-09-04T14:38:17.000Z","dependencies_parsed_at":"2023-02-10T12:01:33.374Z","dependency_job_id":"3dd83113-7e5f-42c8-9234-14f8295be88c","html_url":"https://github.com/zheminzhou/EToKi","commit_stats":{"total_commits":486,"total_committers":10,"mean_commits":48.6,"dds":"0.16049382716049387","last_synced_commit":"08d73828a54c172ecf67cc51b6133d27ee7ca9de"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/zheminzhou/EToKi","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zheminzhou%2FEToKi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zheminzhou%2FEToKi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zheminzhou%2FEToKi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zheminzhou%2FEToKi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zheminzhou","download_url":"https://codeload.github.com/zheminzhou/EToKi/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zheminzhou%2FEToKi/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28075691,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-27T02:00:05.897Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assembly","genotype","mlst","phylo","phylogeny","spades"],"created_at":"2025-12-27T08:03:05.652Z","updated_at":"2025-12-27T08:03:30.023Z","avatar_url":"https://github.com/zheminzhou.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# EToKi (Enterobase Tool Kit)\nAll methods related to Enterobase data analysis pipelines.\n\n# INSTALLATION:\n\nEToKi was developed and tested in both Python 2.7 and Python 3.5. EToKi depends on several Python libraries: \n~~~~~~~~~~\nete3\nnumba\nnumpy\npandas\npsutil\nsklearn\n~~~~~~~~~~\n\nAll libraries can be installed using pip: \n\n~~~~~~~~~~\npip install ete3 numba numpy pandas sklearn psutil\n~~~~~~~~~~\nEToKi also calls the following 3rd party programs for different pipelines:\n\n~~~~~~~~~~\nraxml\nfasttree\nrapidnj\nbbmap\nmmseqs\nncbi-blast\nusearch\nspades\nmegahit\nsamtools\npilon\ngatk\nbwa\nbowtie2\nminimap2\nkraken2 \u0026 minikraken2\nlastal \u0026 lastdb\npilercr\ntrf\n~~~~~~~~~~\n\nAll 3rd party programs except for usearch can be automatically installed using *configure* command:\n~~~~~~~~~~\npython EToKi.py configure --install --download_krakenDB\n~~~~~~~~~~\n\nNOTE: This has only been tested in Ubutu 16.06 but is expected to run on other 64-bit Linux systems. \n \nUsearch is a commercial program and allows free use of the 32-bit version for individuals. Please download it from [https://www.drive5.com/usearch/](https://www.drive5.com/usearch/)\n\nAfter it is downloaded, pass its executable file to EToKi using **--usearch**\n\n~~~~~~~~~~\npython EToKi.py configure --usearch /path/to/usearch\n~~~~~~~~~~\n\n You can also run both **--install** and **--usearch** at the same time:\n~~~~~~~~~~\npython EToKi.py configure --install --download_krakenDB --usearch /path/to/usearch\n~~~~~~~~~~\n\nNote that **--download_krakenDB** will download the minikraken2 database, which is about 8GB in size. Alternatively, you can use **--link_krakenDB** to pass a different Kraken database to EToKi.\n~~~~~~~~~~\npython EToKi.py configure --install --link_krakenDB /path/to/krakenDB --usearch /path/to/usearch\n~~~~~~~~~~\n\nYou can also use pre-installed 3rd party programs in EToKi, by passing their absolute paths into the program using **--path**. This argument can be specified multiple times in the same command:\n~~~~~~~~~~\npython EToKi.py configure --path fasttree=/path/to/fasttree --path raxml=/path/to/raxml\n~~~~~~~~~~\n  \n\n# Quick Start (with examples)\n\n### Trim genomic reads\n~~~~~~~~~~~\npython EToKi.py prepare --pe examples/S_R1.fastq.gz,examples/S_R2.fastq.gz -p examples/prep_out\n~~~~~~~~~~~\n### Merge and trim metagenomic reads\n~~~~~~~~~~~\npython EToKi.py prepare --pe examples/S_R1.fastq.gz,examples/S_R2.fastq.gz -p examples/meta_out --noRename --merge\n~~~~~~~~~~~\n### Assemble genomic reads using SPAdes\n~~~~~~~~~~~\npython EToKi.py assemble --pe examples/prep_out_L1_R1.fastq.gz,examples/prep_out_L1_R2.fastq.gz --se examples/prep_out_L1_SE.fastq.gz -p examples/asm_out\n~~~~~~~~~~~\n### Assemble genomic reads using MEGAHIT\n~~~~~~~~~~~\npython EToKi.py assemble --se examples/meta_out_L1_MP.fastq.gz \\\n--pe examples/meta_out_L1_R1.fastq.gz,examples/meta_out_L1_R2.fastq.gz --se examples/meta_out_L1_SE.fastq.gz \\\n-p examples/asm_out2 --assembler megahit\n~~~~~~~~~~~\n### Map reads onto reference, with pre-filtering with ingroups and outgroups\n~~~~~~~~~~~\npython EToKi.py assemble --se examples/meta_out_L1_MP.fastq.gz --metagenome \\\n--pe examples/meta_out_L1_R1.fastq.gz,examples/meta_out_L1_R2.fastq.gz --se examples/meta_out_L1_SE.fastq.gz \\\n-p examples/map_out -r examples/GCF_000010485.1_ASM1048v1_genomic.fna.gz \\\n-i examples/GCF_000214765.2_ASM21476v3_genomic.fna.gz -o examples/GCF_000005845.2_ASM584v2_genomic.fna.gz\n~~~~~~~~~~~\n### Prepare reference alleles and a local database for 7 Gene MLST scheme\n~~~~~~~~~~~\npython EToKi.py MLSTdb -i examples/Escherichia.Achtman.alleles.fasta -r examples/Escherichia.Achtman.references.fasta -d examples/Escherichia.Achtman.convert.tab\n~~~~~~~~~~~\n### Calculate 7 Gene MLST genotype for a queried genome\n~~~~~~~~~~~\ngzip -cd examples/GCF_001566635.1_ASM156663v1_genomic.fna.gz \u003e examples/GCF_001566635.1_ASM156663v1_genomic.fna \u0026\u0026 \\\npython EToKi.py MLSType -i examples/GCF_001566635.1_ASM156663v1_genomic.fna -r examples/Escherichia.Achtman.references.fasta -k G749 -o stdout -d examples/Escherichia.Achtman.convert.tab\n~~~~~~~~~~~\n### Run EBEis (EnteroBase Escherichia in silico serotyping)\n~~~~~~~~~~~\npython EToKi.py EBEis -t Escherichia -q examples/GCF_000010485.1_ASM1048v1_genomic.fna -p SE15\n~~~~~~~~~~~\n### Cluster sequences into similarity-based groups \n~~~~~~~~~~~\npython EToKi.py clust -p examples/Escherichia.Achtman.alleles_clust -i examples/Escherichia.Achtman.alleles.fasta -d 0.95 -c 0.95\n~~~~~~~~~~~\n### Do a joint BLASTn-like search using BLASTn, uSearch (uBLASTp), Mimimap and mmseqs\n~~~~~~~~~~~\npython EToKi.py uberBlast -q examples/Escherichia.Achtman.alleles.fasta -r examples/GCF_001566635.1_ASM156663v1_genomic.fna -o examples/G749_7Gene.bsn --blastn --ublast --minimap --mmseq -s 2 -f\n~~~~~~~~~~~\n### align multiple genomes onto one reference\n~~~~~~~~~~~\npython EToKi.py align -r GCF_000010485:examples/GCF_000010485.1_ASM1048v1_genomic.fna.gz -p examples/phylo_out \\\nGCF_000005845:examples/GCF_000005845.2_ASM584v2_genomic.fna.gz \\\nGCF_000214765:examples/GCF_000214765.2_ASM21476v3_genomic.fna.gz \\\nGCF_001566635:examples/GCF_001566635.1_ASM156663v1_genomic.fna.gz\n~~~~~~~~~~~\n### Build ML tree using RAxML and place all SNPs onto branches in the tree\n~~~~~~~~~~~\ncd examples \u0026\u0026 python ../EToKi.py phylo -t snp2mut -p phylo_out -s phylo_out.matrix.gz --ng \u0026\u0026 cd ..\n~~~~~~~~~~~\n\n# USAGE:\nThe first argument passed into EToKi specifies the command to be called and the rest are the parameters for that command. To see all the commands available in EToKi, use\n\u003e python EToKi.py -h\n\nAnd to see the parameters for an individual command, use:\n\u003e EToKi.py \\\u003ccommand\\\u003e -h\n\n## configure - install and/or configure 3rd party programs\nSee the INSTALL section or the help page below.\n~~~~~~~~~~~~~~\nusage: EToKi.py configure [-h] [--install] [--usearch USEARCH]\n                          [--download_krakenDB]\n                          [--link_krakenDB KRAKEN_DATABASE] [--path PATH]\n\nInstall or modify the 3rd party programs.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --install             install 3rd party programs\n  --usearch USEARCH     usearch is required for ortho and MLSType. A 32-bit\n                        version of usearch can be downloaded from\n                        https://www.drive5.com/usearch/\n  --download_krakenDB   When specified, miniKraken2 (8GB) will be downloaded\n                        into the EToKi folder. You can also use\n                        --link_krakenDB to use a pre-installed kraken2\n                        database.\n  --link_krakenDB KRAKEN_DATABASE\n                        Kraken is optional in the assemble module. You can\n                        specify your own database here\n  --path PATH, -p PATH  Specify path to the 3rd party programs manually.\n                        Format: \u003cprogram\u003e=\u003cpath\u003e. This parameter can be\n                        specified multiple times\n~~~~~~~~~~~~~~~~~\n\n## prepare - trim, collapse, downsize and rename the short reads\n~~~~~~~~~~~~~\nusage: EToKi.py prepare [-h] [--pe PE] [--se SE] [-p PREFIX] [-q READ_QUAL]\n                        [-b MAX_BASE] [-m MEMORY] [--noTrim] [--merge]\n                        [--noRename]\n\nEToKi.py prepare\n(1) Concatenates reads of the same library together.\n(2) Merge pair-end sequences for metagenomic reads (bbmap).\n(3) Trims sequences based on base-qualities (bbduk).\n(4) Removes potential adapters and barcodes (bbduk).\n(5) Limits total amount of reads to be used.\n(6) Renames reads using sequential numbers.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --pe PE               comma delimited files of PE reads from the same library.\n                        e.g. --pe a_R1.fq.gz,a_R2.fq.gz,b_R1.fq.gz,b_R2.fq.gz\n                        This can be specified multiple times for different libraries.\n  --se SE               comma delimited files of SE reads from the same library.\n                        e.g. --se c_SE.fq.gz,d_SE.fq.gz\n                        This can be specified multiple times for different libraries.\n  -p PREFIX, --prefix PREFIX\n                        prefix for the outputs. Default: EToKi_prepare\n  -q READ_QUAL, --read_qual READ_QUAL\n                        Minimum quality to be kept in bbduk. Default: 6\n  -b MAX_BASE, --max_base MAX_BASE\n                        Total amount of bases (in BPs) to be kept.\n                        Default as -1 for no restriction.\n                        Suggest to use ~100X coverage for de novo assembly.\n  -m MEMORY, --memory MEMORY\n                        maximum amount of memory to be used in bbduk. Default: 30g\n  --noTrim              Do not do quality trim using bbduk\n  --merge               Try to merge PE reads by their overlaps using bbmap\n  --noRename            Do not rename reads\n~~~~~~~~~~~~~~~~\n\n## assemble - *de novo* or reference-guided assembly for genomic or metagenomic reads\n**EToKi assemble** is a joint method for both *de novo* assembly and reference-guided assembly. \n* *de novo* assembly approach calls either SPAdes (default) or MEGAHIT (default for metagenomic data) on short reads that have been cleaned up using **EToKi prepare**, and uses Pilon to polish the assembled scaffolds and evaluate the reliability of consensus bases of the scaffolds. \n\n* Reference-guided assembly is also called \"reference mapping\". Short reads are aligned to a user-specified reference genome using minimap2. Nucleotide bases of the reference genome are updated using Pilon, according to the consensus base calls of the covered reads. Non-specific metagenomic reads of closely related species can sometimes also align to the reference genome and confuse consensus calling. Two arguments, **--outgroup** and **--ingroup**, are given to pre-filter these non-specific reads and obtain clean SNP calls. \n~~~~~~~~~~~~~~~~~\nusage: EToKi.py assemble [-h] [--pe PE] [--se SE] [--pacbio PACBIO] [--ont ONT] [-p PREFIX] [-a ASSEMBLER] [-r REFERENCE] [-k KMERS] [-m MAPPER] [-d MAX_DIFF] [-i INGROUP] [-o OUTGROUP] [-S SNP] [-c CONT_DEPTH]\n                         [--excluded EXCLUDED] [--metagenome] [--numPolish NUMPOLISH] [--reassemble] [--onlySNP] [--noQuality] [--onlyEval] [--kraken]\n\nEToKi.py assemble\n(1.1) Assembles short reads into assemblies, or\n(1.2) Maps them onto a reference.\nAnd\n(2) Polishes consensus using polish,\n(3) Removes low level contaminations.\n(4) Estimates the base quality of the consensus.\n(5) Predicts taxonomy using Kraken.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --pe PE               comma delimited two files of PE reads.\n  --se SE               one file of SE read.\n  --pacbio PACBIO       one file of pacbio read.\n  --ont ONT             one file of nanopore read.\n  -p PREFIX, --prefix PREFIX\n                        prefix for the outputs. Default: EToKi_assemble\n  -a ASSEMBLER, --assembler ASSEMBLER\n                        Assembler used for de novo assembly.\n                        Disabled if you specify a reference.\n                        Default: spades for single colony isolates, megahit for metagenome.\n                         Long reads will always be assembled with Flye\n  -r REFERENCE, --reference REFERENCE\n                        Reference for read mapping. Specify this for reference mapping module.\n  -k KMERS, --kmers KMERS\n                        relative lengths of kmers used in SPAdes. Default: 30,50,70,90\n  -m MAPPER, --mapper MAPPER\n                        aligner used for read mapping.\n                        options are: miminap (default), bwa or bowtie2\n  -d MAX_DIFF, --max_diff MAX_DIFF\n                        Maximum proportion of variations allowed for a aligned reads.\n                        Default: 0.1 for single isolates, 0.05 for metagenome\n  -i INGROUP, --ingroup INGROUP\n                        Additional references presenting intra-population genetic diversities.\n  -o OUTGROUP, --outgroup OUTGROUP\n                        Additional references presenting genetic diversities outside of the studied population.\n                        Reads that are more similar to outgroups will be excluded from analysis.\n  -S SNP, --SNP SNP     Exclusive set of SNPs. This will overwrite the polish process.\n                        Required format:\n                        \u003ccont_name\u003e \u003csite\u003e \u003cbase_type\u003e\n                        ...\n  -c CONT_DEPTH, --cont_depth CONT_DEPTH\n                        Allowed range of read depth variations relative to average value.\n                        Default: 0.2,2.5\n                        Contigs with read depths outside of this range will be removed from the final assembly.\n  --excluded EXCLUDED   A name of the file that contains reads to be excluded from the analysis.\n  --metagenome          Reads are from metagenomic samples\n  --numPolish NUMPOLISH\n                        Number of Pilon polish iterations. Default: 1\n  --reassemble          Do local re-assembly in PILON. Suggest to use this flag with long reads.\n  --onlySNP             Only modify substitutions during the PILON polish.\n  --noQuality           Do not estimate base qualities.\n  --onlyEval            Do not run assembly/mapping. Only evaluate assembly status.\n  --kraken              Run kmer based species predicton on contigs.\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n## ortho  - pan-genome (and wgMLST scheme) prediction\n**EToKi ortho** has now been migrated to a [separate repository](https://github.com/zheminzhou/PEPPA) and renamed as **PEPPA**. \n\n## MLSTdb - Set up exemplar alleles and database for MLST schemes\n**EToKi MLSTdb** converts existing allelic sequences into two files: (1) a multi-fasta file of exemplar allelic sequences and (2) a lookup table for the **EToKi MLSType** method. \n* The exemplar alleles are defined as: \n   1. Over 40% identity to the allelic sequences of a reference genome specified by **--refstrain**\n   2. Less than 90% identity between different exemplar sequences of the same locus\n   3. Identity to sequences of any different locus that is at least 10% less than the similarity to sequences of the same locus.\n~~~~~~~~~~~\nusage: EToKi.py MLSTdb [-h] -i ALLELEFASTA [-r REFSET] [-d DATABASE]\n                       [-s REFSTRAIN] [-x MAX_IDEN] [-m MIN_IDEN] [-p PARALOG]\n                       [-c COVERAGE] [-e]\n\nMLSTdb. Create reference sets of alleles for nomenclature.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -i ALLELEFASTA, --input ALLELEFASTA\n                        [REQUIRED] A single file contains all known alleles in\n                        a MLST scheme.\n  -r REFSET, --refset REFSET\n                        [DEFAULT: No ref allele] Output - Reference alleles\n                        used for MLSType.\n  -d DATABASE, --database DATABASE\n                        [DEFAULT: No allele DB] Output - A lookup table of all\n                        alleles.\n  -s REFSTRAIN, --refstrain REFSTRAIN\n                        [DEFAULT: None] A single file contains alleles from\n                        the reference genome.\n  -x MAX_IDEN, --max_iden MAX_IDEN\n                        [DEFAULT: 0.9 ] Maximum identities between resulting\n                        refAlleles.\n  -m MIN_IDEN, --min_iden MIN_IDEN\n                        [DEFAULT: 0.4 ] Minimum identities between refstrain\n                        and resulting refAlleles.\n  -p PARALOG, --paralog PARALOG\n                        [DEFAULT: 0.1 ] Minimum differences between difference\n                        loci.\n  -c COVERAGE, --coverage COVERAGE\n                        [DEFAULT: 0.7 ] Proportion of aligned regions between\n                        alleles.\n  -e, --relaxEnd        [DEFAULT: False ] Allow changed ends (for pubmlst).\n~~~~~~~~~~~\n\n## MLSType - MLST nomenclature using a local set of references\n**EToKi MLSType** identities allelic sequences in a queried genome, by comparing it with the exemplar alleles generated by **MLSTdb**. \n ~~~~~~~~~~\nusage: EToKi.py MLSType [-h] -i GENOME -r REFALLELE -k UNIQUE_KEY\n                        [-d DATABASE] [-o OUTPUT] [-q] [-f] [-m MIN_IDEN]\n                        [-p MIN_FRAG_PROP] [-l MIN_FRAG_LEN] [-x INTERGENIC]\n                        [--overlap_prop OVERLAP_PROP]\n                        [--overlap_iden OVERLAP_IDEN] [--max_dist MAX_DIST]\n                        [--diag_diff DIAG_DIFF] [--max_diff MAX_DIFF]\n\nMLSType. Find and designate MLST alleles from a queried assembly.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -i GENOME, --genome GENOME\n                        [REQUIRED] Input - filename for genomic assembly.\n  -r REFALLELE, --refAllele REFALLELE\n                        [REQUIRED] Input - fasta file for reference alleles.\n  -k UNIQUE_KEY, --unique_key UNIQUE_KEY\n                        [REQUIRED] An unique identifier for the assembly.\n  -d DATABASE, --database DATABASE\n                        [OPTIONAL] Input - lookup table of existing alleles.\n  -o OUTPUT, --output OUTPUT\n                        [DEFAULT: No output] Output - filename for the\n                        generated alleles. Specify to STDOUT for screen\n                        output.\n  -q, --query_only      [DEFAULT: False] Do not submit new allele, only query.\n  -f, --force           [DEFAULT: False] Force to accept low quality alleles.\n  -m MIN_IDEN, --min_iden MIN_IDEN\n                        [DEFAULT: 0.65 ] Minimum identities between refAllele\n                        and genome.\n  -p MIN_FRAG_PROP, --min_frag_prop MIN_FRAG_PROP\n                        [DEFAULT: 0.6 ] Minimum covereage of a fragment.\n  -l MIN_FRAG_LEN, --min_frag_len MIN_FRAG_LEN\n                        [DEFAULT: 50 ] Minimum length of a fragment.\n  -x INTERGENIC, --intergenic INTERGENIC\n                        [DEFAULT: -1,-1 ] Call alleles in intergenic region if\n                        the distance between two closely located loci fall\n                        within the range defined by the two numbers. Suggest\n                        to use 50,500. This is diabled by default with minus\n                        numbers.\n  --overlap_prop OVERLAP_PROP\n                        [DEFAULT: 0.5 ] Given two hits, if \u003coverlap_prop\u003e of\n                        their regions overlap, and the sequence identities of\n                        one hits is \u003coverlap_iden\u003e lower than the other. The\n                        hit with lower identities will be removed.\n  --overlap_iden OVERLAP_IDEN\n                        [DEFAULT: 0.05 ] Given two hits, if \u003coverlap_prop\u003e of\n                        their regions overlap, and the sequence identities of\n                        one hits is \u003coverlap_iden\u003e lower than the other. The\n                        hit with lower identities will be removed.\n  --max_dist MAX_DIST   [DEFAULT: 300 ] Consider two closely located hits as a\n                        synteny block if their coordinates in both queried\n                        genomes and reference gene are seperated by no more\n                        than \u003cmax_dist\u003e bps.\n  --diag_diff DIAG_DIFF\n                        [DEFAULT: 1.2 ] Consider two closely located hits as a\n                        synteny block if, after merged, its covered region in\n                        the queried genome is no more than \u003cdiag_diff\u003e folds\n                        of the region in the reference gene.\n  --max_diff MAX_DIFF   [DEFAULT: 200 ] Consider two closely located hits as a\n                        synteny block if, after merged, the lengths of its\n                        covered regions in the queried genome and the\n                        reference gene are differed by no more than \u003cmax_diff\u003e\n                        bps.\n ~~~~~~~~~~\n\n## align - align multiple queried genomes to a single reference\n~~~~~~~~~~~\nusage: EToKi.py align [-h] -r REFERENCE [-p PREFIX] [-a] [-m] [-l] [-c CORE]\n                      [-n N_PROC]\n                      queries [queries ...]\n\nAlign multiple genomes onto a single reference.\n\npositional arguments:\n  queries               queried genomes. Use \u003cTag\u003e:\u003cFilename\u003e format to feed\n                        in a tag for each genome. Otherwise filenames will be\n                        used as tags for genomes.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -r REFERENCE, --reference REFERENCE\n                        [REQUIRED; INPUT] reference genomes to be aligned\n                        against. Use \u003cTag\u003e:\u003cFilename\u003e format to assign a tag\n                        to the reference.\n  -p PREFIX, --prefix PREFIX\n                        [OUTPUT] prefix for all outputs.\n  -a, --alignment       [OUTPUT] Generate core genomic alignments in FASTA\n                        format\n  -m, --matrix          [OUTPUT] Do not generate core SNP matrix\n  -l, --last            Activate to use LAST as aligner. [DEFAULT: minimap2]\n  -c CORE, --core CORE  [PARAM] percentage of presences for core genome.\n                        [DEFAULT: 0.95]\n  -n N_PROC, --n_proc N_PROC\n                        [PARAM] number of processes to use. [DEFAULT: 5]\n~~~~~~~~~~~\n\n## phylo - infer phylogeny and ancestral states from genomic alignments \n~~~~~~~~~~~\nusage: EToKi.py phylo [-h] [--tasks TASKS] --prefix PREFIX\n                      [--alignment ALIGNMENT] [--snp SNP] [--tree TREE]\n                      [--ancestral ANCESTRAL] [--core CORE] [--n_proc N_PROC]\n\nEToKi phylo runs to:\n(1) Generate SNP matrix from alignment (-t matrix)\n(2) Calculate ML phylogeny from SNP matrix using RAxML (-t phylogeny)\n(3) Workout the nucleotide sequences of internal nodes in the tree using ML estimation (-t ancestral or -t ancestral_proportion for ratio frequencies)\n(4) Place mutations onto branches of the tree (-t mutation)\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --tasks TASKS, -t TASKS\n                        Tasks to call. Allowed tasks are:\n                        matrix: generate SNP matrix from alignment.\n                        phylogeny: generate phylogeny from SNP matrix.\n                        ancestral: generate AS (ancestral state) matrix from SNP matrix and phylogeny\n                        ancestral_proportion: generate possibilities of AS for each site\n                        mutation: assign SNPs into branches from AS matrix\n\n                        You can run multiple tasks by sending a comma delimited task list.\n                        There are also some pre-defined task combo:\n                        all: matrix,phylogeny,ancestral,mutation\n                        aln2phy: matrix,phylogeny [default]\n                        snp2anc: phylogeny,ancestral\n                        mat2mut: ancestral,mutation\n  --prefix PREFIX, -p PREFIX\n                        prefix for all outputs.\n  --alignment ALIGNMENT, -m ALIGNMENT\n                        aligned sequences in either fasta format or Xmfa format. Required for \"matrix\" task.\n  --snp SNP, -s SNP     SNP matrix in specified format. Required for \"phylogeny\" and \"ancestral\" if alignment is not given\n  --tree TREE, -z TREE  phylogenetic tree. Required for \"ancestral\" task\n  --ancestral ANCESTRAL, -a ANCESTRAL\n                        Inferred ancestral states in a specified format. Required for \"mutation\" task\n  --core CORE, -c CORE  Core genome proportion. Default: 0.95\n  --n_proc N_PROC, -n N_PROC\n                        Number of processes. Default: 7.\n~~~~~~~~~~~\n\n\n## EB*Eis* - *in silico* serotype prediction for *Escherichia coli* \u0026 *Shigella spp.*\n**EB*Eis*** is a BLASTn based prediction tool for the O and H antigens of *Escherichia coli* and *Shigella*. It uses essential genes (*wzx, wzy, wzt \u0026 wzm* for O; *fliC* for H) as markers. **EB*Eis*** uses a database built from two sources:\n1. [SeroTypeFinder ](https://bitbucket.org/genomicepidemiology/serotypefinder_db/src)\n2. O-antigen gene sequences reported in [DebRoy et al., PLoS ONE, 2016](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0147434#pone.0147434.ref011)\n~~~~~~~~~~~\nusage: EToKi.py EBEis [-h] -q QUERY [-t TAXON] [-p PREFIX]\n\nEnteroBase Escherichia in silico serotyping\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -q QUERY, --query QUERY\n                        file name for the queried assembly in multi-FASTA format.\n  -t TAXON, --taxon TAXON\n                        Taxon database to compare with. \n                        Only support Escherichia (default) for the moment.\n  -p PREFIX, --prefix PREFIX\n                        prefix for intermediate files. Default: EBEis\n~~~~~~~~~~~\n\n## isCRISPOL - *in silico* prediction of CRISPOL array for *Salmonella enterica* serovar Typhimurium\nCRISPOL is an oligo based Typhimurium sub-typing method described in ([Fabre et al., PLoS ONE, 2012](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0036995)). We use the direct repeats (DRs) and spacers in the Typhimurium CPRISR array to predict CRISPOL types from genomic assemblies.\n~~~~~~~~~~~\nusage: EToKi.py isCRISPOL [-h] [N [N ...]]\n\nin silico Typhimurium subtyping using CRISPOL scheme (Fabre et al., PLoS ONE, 2012)\n\npositional arguments:\n  N           FASTA files containing assemblies of S. enterica Typhimurium.\n\noptional arguments:\n  -h, --help  show this help message and exit\n~~~~~~~~~~~\n\n## uberBlast - Use BLASTn, uBLASTp, minimap2 and/or mmseqs to identify similar sequences\n**EToKi uberBlast** is also internally called by **EToKi ortho** to align exemplar genes to queried genomes, using both BLASTn and uSearch-uBLASTp. Amino acid alignments are converted back to nucleotide sequences, meaning that genome coordinates remain consistent across different methods. \n\n* minimap2 --- Fastest alignment in nucleotide level. High accuracy in identities \u003e= 90%, but lose sensitivity quickly for lower identities. \n* blastn --- Fast alignment in nucleotide level.  Lose sensitivity for identities \u003c 80%\n* mmseqs --- Amino acid based alignment for identities \u003e= 70% (open source)\n* uBLASTp --- Amino acid based alignment for identities \u003c 50% (commercial software)\n~~~~~~~~~~~\nusage: EToKi.py uberBlast [-h] -r REFERENCE -q QUERY [-o OUTPUT] [--blastn]\n                          [--ublast] [--ublastSELF] [--minimap] [--minimapASM]\n                          [--mmseq] [--min_id MIN_ID] [--min_cov MIN_COV]\n                          [--min_ratio MIN_RATIO] [-s RE_SCORE] [-f]\n                          [--filter_cov FILTER_COV]\n                          [--filter_score FILTER_SCORE] [-m]\n                          [--merge_gap MERGE_GAP] [--merge_diff MERGE_DIFF]\n                          [-O] [--overlap_length OVERLAP_LENGTH]\n                          [--overlap_proportion OVERLAP_PROPORTION]\n                          [-e FIX_END] [-t N_THREAD] [-p]\n\nFive different alignment methods.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -r REFERENCE, --reference REFERENCE\n                        [INPUT; REQUIRED] filename for the reference. This is\n                        normally a genomic assembly.\n  -q QUERY, --query QUERY\n                        [INPUT; REQUIRED] filename for the query. This can be\n                        short-reads or genes or genomic assemblies.\n  -o OUTPUT, --output OUTPUT\n                        [OUTPUT; Default: None] save result to a file or to\n                        screen (stdout). Default do nothing.\n  --blastn              Run BLASTn. Slowest. Good for identities between [80,\n                        100]\n  --ublast              Run uBLAST in tBLASTn mode. Fast. Good for identities\n                        between [30-100]\n  --ublastSELF          Run uBLAST in tBLASTn mode. Fast. Good for identities\n                        between [30-100]\n  --minimap             Run minimap. Fast. Good for identities between\n                        [90-100]\n  --minimapASM          Run minimap on assemblies. Fast. Good for identities\n                        between [90-100]\n  --mmseq               Run mmseq2 in tBLASTn mode. Fast. Good for identities\n                        between [70-100]\n  --min_id MIN_ID       [DEFAULT: 0.3] Minimum identity before reScore for an\n                        alignment to be kept\n  --min_cov MIN_COV     [DEFAULT: 40] Minimum length for an alignment to be\n                        kept\n  --min_ratio MIN_RATIO\n                        [DEFAULT: 0.05] Minimum length for an alignment to be\n                        kept, proportional to the length of the query\n  -s RE_SCORE, --re_score RE_SCORE\n                        [DEFAULT: 0] Re-interpret alignment scores and\n                        identities. 0: No rescore; 1: Rescore with\n                        nucleotides; 2: Rescore with amino acid; 3: Rescore\n                        with codons\n  -f, --filter          [DEFAULT: False] Remove secondary alignments if they\n                        overlap with any other regions\n  --filter_cov FILTER_COV\n                        [DEFAULT: 0.9]\n  --filter_score FILTER_SCORE\n                        [DEFAULT: 0]\n  -m, --linear_merge    [DEFAULT: False] Merge consective alignments\n  --merge_gap MERGE_GAP\n                        [DEFAULT: 300]\n  --merge_diff MERGE_DIFF\n                        [DEFAULT: 1.2]\n  -O, --return_overlap  [DEFAULT: False] Report overlapped alignments\n  --overlap_length OVERLAP_LENGTH\n                        [DEFAULT: 300] Minimum overlap to report\n  --overlap_proportion OVERLAP_PROPORTION\n                        [DEFAULT: 0.6] Minimum overlap proportion to report\n  -e FIX_END, --fix_end FIX_END\n                        [FORMAT: L,R; DEFAULT: 0,0] Extend alignment to the\n                        edges if the un-aligned regions are \u003c= [L,R]\n                        basepairs.\n  -t N_THREAD, --n_thread N_THREAD\n                        [DEFAULT: 8] Number of threads to use.\n  -p, --process         [DEFAULT: False] Use processes instead of threads.\n~~~~~~~~~~~\n\n## clust - linear-time clustering of short sequences using mmseqs linclust\n**EToKi clust** is called internally by **EToKi ortho** to cluster seed genes into gene clusters. Given its linear-time complexity, it can cluster millions of gene sequences in minutes. \n~~~~~~~~~~~\nusage: EToKi.py clust [-h] -i INPUT -p PREFIX [-d IDENTITY] [-c COVERAGE]\n                      [-t N_THREAD]\n\nGet clusters and exemplars of clusters from gene sequences using mmseqs linclust.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -i INPUT, --input INPUT\n                        [INPUT; REQUIRED] name of the file containing gene sequneces in FASTA format.\n  -p PREFIX, --prefix PREFIX\n                        [OUTPUT; REQUIRED] prefix of the outputs.\n  -d IDENTITY, --identity IDENTITY\n                        [PARAM; DEFAULT: 0.9] minimum intra-cluster identity.\n  -c COVERAGE, --coverage COVERAGE\n                        [PARAM; DEFAULT: 0.9] minimum intra-cluster coverage.\n  -t N_THREAD, --n_thread N_THREAD\n                        [PARAM; DEFAULT: 8] number of threads to use.\n~~~~~~~~~~~\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzheminzhou%2Fetoki","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzheminzhou%2Fetoki","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzheminzhou%2Fetoki/lists"}