{"id":16713116,"url":"https://github.com/adamtaranto/mimeo","last_synced_at":"2025-07-10T23:32:30.360Z","repository":{"id":57441666,"uuid":"102555615","full_name":"Adamtaranto/mimeo","owner":"Adamtaranto","description":"Scan genomes for internally repeated sequences, elements which are repetitive in another species, or high-identity HGT candidate regions between species.","archived":false,"fork":false,"pushed_at":"2024-11-03T11:49:18.000Z","size":48,"stargazers_count":1,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-03T12:26:57.621Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Adamtaranto.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-09-06T02:57:48.000Z","updated_at":"2024-11-03T11:49:22.000Z","dependencies_parsed_at":"2022-09-06T02:30:11.902Z","dependency_job_id":null,"html_url":"https://github.com/Adamtaranto/mimeo","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adamtaranto%2Fmimeo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adamtaranto%2Fmimeo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adamtaranto%2Fmimeo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adamtaranto%2Fmimeo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Adamtaranto","download_url":"https://codeload.github.com/Adamtaranto/mimeo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225664088,"owners_count":17504439,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T20:45:33.696Z","updated_at":"2024-11-21T02:32:26.726Z","avatar_url":"https://github.com/Adamtaranto.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Mimeo\n\n# Table of contents\n\n* [Modules](#modules)\n* [Installing Mimeo](#installing-mimeo)\n* [Example usage](#example-usage)\n* [Standard options](#standard-options)\n  *  [mimeo-self](#mimeo-self)\n  *  [mimeo-x](#mimeo-x)\n  *  [mimeo-map](#mimeo-map)\n  *  [mimeo-filter](#mimeo-filter)\n* [Alternative alignment engines](#importing-alignments)\n* [License](#license)\n\n# Modules\n\nMimeo comprises three tools for parsing repeats from whole-genome alignments:\n\n## mimeo-self  \n\n**Internal repeat finder.** Mimeo-self aligns a genome to itself and extracts high-identity segments above\na coverage threshold. This method is less sensitive to disruption by indels and repeat-directed point mutations than\nkmer-based methods such as RepeatScout. Reported annotations indicate overlapping segments above the coverage threshold,\nmimeo-self does not attempt to separate nested repeats. Use this tool to identify candidate repeat regions for curated annotation.\n\n## mimeo-x  \n\n**Cross-species repeat finder.** A newly acquired or low-copy transposon may slip past copy-number based annotation tools. Mimeo-x searches for features which are abundant in an external reference genome, allowing for\nannotation of complete elements as they occur in a horizontal-transfer donor species, or of conserved coding segments\nof related transposon families.\n\n## mimeo-map  \n\n**Find all high-identity segments shared between genomes.** Mimeo-map identifies candidate horizontally\ntransferred segments between sufficiently diverged species. When comparing isolates of a single species, aligned segments correspond to directly homologous sequences and internally repetitive features.  \n\n\nIntra/Inter-genomic alignments from Mimeo-self or Mimeo-x can be reprocessed with Mimeo-map to generate annotations of\nunfiltered/uncollapsed alignments. These raw alignment annotations can be used to interrogate repetitive-segments for coverage breakpoints corresponding to nested transposons with differing abundances across the genome.  \n\n## mimeo-filter\n\nAn additional tool **mimeo-filter** is now included to allow post-filtering of SSR-rich sequences from FASTA formatted\ncandidate-repeat libraries.  \n\n\n# Installing Mimeo\n\nRequirements: \n  * [LASTZ](http://www.bx.psu.edu/~rsharris/lastz/) genome alignment tool from the Miller Lab, Penn State.\n  * [bedtools](http://bedtools.readthedocs.io/en/latest/content/installation.html)\n  * [trf](https://tandem.bu.edu/trf/trf.html)\n\n\nInstall from Bioconda:\n```bash\nconda install mimeo\n```\n\nInstall from PyPi:\n```bash\npip install mimeo\n```\n\nClone and install from this repository:\n```bash\ngit clone https://github.com/Adamtaranto/mimeo.git \u0026\u0026 cd mimeo \u0026\u0026 pip install -e .\n```\n\n# Example usage \n\n### mimeo-self\n\nAnnotate features in genome A which are \u003e 100bp and occur with \u003e=\n80% identity at least 3 times on other scaffolds OR at least 4 times\non the same scaffold.\n\n```bash\nmimeo-self --adir data/A_genome_Split --afasta data/A_genome.fasta \\\n-d MS_outdir --gffout A_genome_Inter3_Intra4_id80_len_100.gff3 \\\n--outfile A_genome_Self_Align.tab --label A_Rep3 --prefix A_Self --minIdt 80 \\\n--minLen 100 --minCov 3 --intraCov 4 --strictSelf\n```\n\nOutput: \n  - MS_outdir/A_genome_Inter3_Intra4_id80_len_100.gff3\n  - MS_outdir/A_genome_Self_Align.tab\n  - data/A_genome_Split/*.fa\n\n### mimeo-x\n\nAnnotate features in genome A which are \u003e 100bp and occur with \u003e=\n80% identity at least 5 times in genome B.\n\n```bash\nmimeo-x --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \\\n-d MX_outdir --gffout B_Rep5_in_A.gff3 --outfile B_Reps_in_A_id80_len100.tab \\\n--label B_Rep5 --prefix B_Rep5 --minIdt 80 --minLen 100 --minCov 5\n```\n\nOutput: \n  - MX_outdir/B_Rep5_in_A.gff3\n  - MX_outdir/B_Reps_in_A_id80_len100.tab\n\n### mimeo-map\n\nAnnotate features in genome A which are \u003e 100bp and occur with \u003e=\n90% identity in genome B. No coverage filter, all alignments are reported.\n\n```bash\nmimeo-map --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \\\n-d MM_outdir --gffout B_in_A_id90.gff3 --outfile B_in_A_id90.tab \\\n--label B_90 --prefix B_90 --minIdt 90 --minLen 100 \n```\n\nOutput: \n  - MM_outdir/B_in_A_id90.gff3\n  - MM_outdir/B_in_A_id90.tab\n\n### mimeo-map + SSR filter\n\nAnnotate features in genome A which are \u003e 100bp and occur with \u003e=\n98% identity in genome B. Reuse B to A-genome alignment from the previous run.\n\nFilter out hits which are \u003e= 40% tandem repeats. Write filtered hits\nas tab file and GFF3 annotation.\n\n```bash\nmimeo-map --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \\\n-d MM_outdir --gffout B_in_A_id98_maxSSR40.gff3 --outfile B_in_A_id98.tab \\\n--label B_98 --prefix B_98 --minIdt 98 --minLen 100 \\\n--recycle --maxtandem 40 --writeTRF\n```\n\nOutput: \n  - MM_outdir/B_in_A_id98_maxSSR40.gff3\n  - MM_outdir/B_in_A_id98.tab.trf\n\n### mimeo-filter\n\nFilter sequences comprised of \u003e= 40% short tandem repeats from a multifasta\nlibrary of candidate transposons.\n\n```bash\nmimeo-filter --infile data/candidate_TEs.fa\n```\n\nOutput:\n  - candidate_TEs_filtered.fa\n\n\n# Standard options\n\n### mimeo-self\n\n```\nUsage: mimeo-self [-h] [--adir ADIR] [--afasta AFASTA] [-r] [-d OUTDIR]\n                  [--gffout GFFOUT] [--outfile OUTFILE] [--verbose]\n                  [--label LABEL] [--prefix PREFIX] [--lzpath LZPATH]\n                  [--bedtools BEDTOOLS] [--minIdt MINIDT] [--minLen MINLEN]\n                  [--minCov MINCOV] [--hspthresh HSPTHRESH]\n                  [--intraCov INTRACOV] [--strictSelf]\n\nInternal repeat finder. Mimeo-self aligns a genome to itself and extracts\nhigh-identity segments above a coverage threshold.\n\nOptional arguments:\n  -h, --help      Show this help message and exit.\n  --adir          Name of the directory containing sequences from the genome.\n                  Write split files here if providing genome as\n                  multifasta.\n  --afasta        Genome as multifasta.\n  -r, --recycle   Use existing alignment \"--outfile\" if found.\n  -d , --outdir   Write output files to this directory. (Default: cwd)\n  --gffout        Name of GFF3 annotation file.\n  --outfile       Name of alignment result file.\n  --verbose       If set report LASTZ progress.\n  --label         Set annotation TYPE field in gff.\n  --prefix        ID prefix for internal repeats.\n  --lzpath        Custom path to LASTZ executable if not in $PATH.\n  --bedtools      Custom path to bedtools executable if not in $PATH.\n  --minIdt        Minimum alignment identity to report.\n  --minLen        Minimum alignment length to report.\n  --minCov        Minimum depth of aligned segments to report repeat\n                  feature.\n  --hspthresh     Set HSP min score threshold for LASTZ.\n  --intraCov      Minimum depth of aligned segments from the same scaffold\n                  to report feature. Used if \"--strictSelf\" mode is\n                  selected.\n  --strictSelf    If set process same-scaffold alignments separately\n                  with the option to use higher \"--intraCov\" threshold.\n                  Sometimes useful to avoid false repeat calls from\n                  staggered alignments over SSRs or short tandem\n                  duplication.\n```\n\n### mimeo-x\n\n```\nUsage: mimeo-x [-h] [--adir ADIR] [--bdir BDIR] [--afasta AFASTA]\n               [--bfasta BFASTA] [-r] [-d OUTDIR] [--gffout GFFOUT]\n               [--outfile OUTFILE] [--verbose] [--label LABEL]\n               [--prefix PREFIX] [--lzpath LZPATH] [--bedtools BEDTOOLS]\n               [--minIdt MINIDT] [--minLen MINLEN] [--minCov MINCOV]\n               [--hspthresh HSPTHRESH]\n\nCross-species repeat finder. Mimeo-x searches for features which are abundant\nin an external reference genome.\n\nOptional arguments:\n  -h, --help      Show this help message and exit.\n  --adir          Name of the directory containing sequences from A genome.\n  --bdir          Name of the directory containing sequences from B genome.\n  --afasta        A genome as multifasta.\n  --bfasta        B genome as multifasta.\n  -r, --recycle   Use existing alignment \"--outfile\" if found.\n  -d , --outdir   Write output files to this directory. (Default: cwd)\n  --gffout        Name of GFF3 annotation file.\n  --outfile       Name of alignment result file.\n  --verbose       If set report LASTZ progress.\n  --label         Set annotation TYPE field in GFF.\n  --prefix        ID prefix for B-genome repeats annotated in A-genome.\n  --lzpath        Custom path to LASTZ executable if not in $PATH.\n  --bedtools      Custom path to bedtools executable if not in $PATH.\n  --minIdt        Minimum alignment identity to report.\n  --minLen        Minimum alignment length to report.\n  --minCov        Minimum depth of B-genome hits to report feature in\n                  A-genome.\n  --hspthresh     Set HSP min score threshold for LASTZ.\n```\n\n### mimeo-map\n\n```\nUsage: mimeo-map [-h] [--adir ADIR] [--bdir BDIR] [--afasta AFASTA]\n                 [--bfasta BFASTA] [-r] [-d OUTDIR] [--gffout GFFOUT]\n                 [--outfile OUTFILE] [--verbose] [--label LABEL]\n                 [--prefix PREFIX] [--keeptemp] [--lzpath LZPATH]\n                 [--minIdt MINIDT] [--minLen MINLEN] [--hspthresh HSPTHRESH]\n                 [--TRFpath TRFPATH] [--tmatch TMATCH] [--tmismatch TMISMATCH]\n                 [--tdelta TDELTA] [--tPM TPM] [--tPI TPI]\n                 [--tminscore TMINSCORE] [--tmaxperiod TMAXPERIOD]\n                 [--maxtandem MAXTANDEM] [--writeTRF]\n\nFind all high-identity segments shared between genomes.\n\nOptional arguments:\n  -h, --help      Show this help message and exit.\n  --adir          Name of the directory containing sequences from A genome.\n  --bdir          Name of the directory containing sequences from B genome.\n  --afasta        A genome as multifasta.\n  --bfasta        B genome as multifasta.\n  -r, --recycle   Use existing alignment \"--outfile\" if found.\n  -d, --outdir    Write output files to this directory. (Default: cwd)\n  --gffout        Name of GFF3 annotation file. If not set, suppress\n                  output.\n  --outfile       Name of alignment result file.\n  --verbose       If set report LASTZ progress.\n  --label         Set annotation TYPE field in GFF.\n  --prefix        ID prefix for B-genome hits annotated in A-genome.\n  --keeptemp      If set does not remove temp files.\n  --lzpath        Custom path to LASTZ executable if not in $PATH.\n  --minIdt        Minimum alignment identity to report.\n  --minLen        Minimum alignment length to report.\n  --hspthresh     Set HSP min score threshold for LASTZ.\n  --TRFpath       Custom path to TRF executable if not in $PATH.\n  --tmatch        TRF matching weight.\n  --tmismatch     TRF mismatching penalty.\n  --tdelta        TRF indel penalty.\n  --tPM           TRF match probability.\n  --tPI           TRF indel probability.\n  --tminscore     TRF minimum alignment score to report.\n  --tmaxperiod    TRF maximum period size to report.\n  --maxtandem     Max percentage of an A-genome alignment which may be masked by TRF. \n                  If exceeded, the alignment will be discarded.\n  --writeTRF      If set write TRF filtered alignment file for use with\n                  other mimeo modules.\n```\n\n\n### mimeo-filter\n\n```\nUsage: mimeo-filter [-h] --infile INFILE [-d OUTDIR] [--outfile OUTFILE]\n                    [--keeptemp] [--verbose] [--TRFpath TRFPATH]\n                    [--tmatch TMATCH] [--tmismatch TMISMATCH]\n                    [--tdelta TDELTA] [--tPM TPM] [--tPI TPI]\n                    [--tminscore TMINSCORE] [--tmaxperiod TMAXPERIOD]\n                    [--maxtandem MAXTANDEM]\n\nFilter SSR containing sequences from FASTA library of repeats.\n\nOptional arguments:\n  -h, --help            Show this help message and exit.\n  --infile              Name of the directory containing sequences from A genome.\n  -d, --outdir          Write output files to this directory. (Default: cwd)\n  --outfile             Name of alignment result file.\n  --keeptemp            If set does not remove temp files.\n  --verbose             If set report LASTZ progress.\n  --TRFpath             Custom path to TRF executable if not in $PATH.\n  --tmatch              TRF matching weight\n  --tmismatch           TRF mismatching penalty.\n  --tdelta              TRF indel penalty.\n  --tPM                 TRF match probability.\n  --tPI                 TRF indel probability.\n  --tminscore           TRF minimum alignment score to report.\n  --tmaxperiod          TRF maximum period size to report. Note: Setting this\n                        score too high may exclude some LTR retrotransposons.\n                        Optimal len to exclude only SSRs is 10-50bp.\n  --maxtandem           Max percentage of a sequence which may be masked by\n                        TRF. If exceeded, the element will be discarded.\n\n```\n\n\n# Importing alignments\n\nWhole genome alignments generated by alternative tools (i.e. BLAT) can be provided to any of the Mimeo modules\nas a tab-delimited file with the columns:\n\n```\n[1]   name1     = Name of target sequence in genome A\n[2]   strand1   = Strand of alignment in target sequence\n[3]   start1    = 5-prime position of alignment in target (lower value irrespective of strand)\n[4]   end1      = 3-prime position of alignment in target (higher value irrespective of strand)\n[5]   name2     = Name of source sequence in genome B\n[6]   strand2   = Strand of alignment in source\n[7]   start2+   = 5-prime position of alignment in source (lower value irrespective of strand)\n[8]   end2+     = 3-prime position of alignment in source (higher value irrespective of strand)\n[9]   score     = Alignment score as int\n[10]  identity  = Identity of alignment as float\n```\n\nFile should be sorted by columns 1,3,4\n\n# License\n\nSoftware provided under MIT license.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadamtaranto%2Fmimeo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadamtaranto%2Fmimeo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadamtaranto%2Fmimeo/lists"}