{"id":25977599,"url":"https://github.com/nylander/fastagap","last_synced_at":"2025-03-05T04:38:41.728Z","repository":{"id":146538560,"uuid":"265008499","full_name":"nylander/fastagap","owner":"nylander","description":"Remove or replace gaps in fasta formatted files","archived":false,"fork":false,"pushed_at":"2025-01-18T11:41:18.000Z","size":350,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-18T12:30:01.652Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nylander.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-18T17:14:08.000Z","updated_at":"2025-01-18T11:41:19.000Z","dependencies_parsed_at":"2025-01-18T12:34:32.117Z","dependency_job_id":null,"html_url":"https://github.com/nylander/fastagap","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nylander%2Ffastagap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nylander%2Ffastagap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nylander%2Ffastagap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nylander%2Ffastagap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nylander","download_url":"https://codeload.github.com/nylander/fastagap/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241966989,"owners_count":20050324,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-05T04:38:41.227Z","updated_at":"2025-03-05T04:38:41.720Z","avatar_url":"https://github.com/nylander.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fastagap - Report, remove, or replace missing data in fasta\n\n## Description\n\nScripts for handling missing data (gaps) in fasta files.\n\n- [`fastagap.pl`](#fastagap) General script for counting, removing, or replacing missing data in fasta formatted files.\n\n- [`degap_fasta_alignment.pl`](#degap\\_fasta\\_alignment) Script for handling \"aligned\" fasta format (sequences of same length).\n\n- [`plot_missing_data.R`](#plot\\_missing\\_data) Script for plotting a heatmap of the output from `fastagap.pl -c -H *.fas`.\n\nSee below for description.\n\n## Installation\n\nThe script [`fastagap.pl`](#fastagap) requires [perl](https://www.perl.org/) with perl\nmodule [List::MoreUtils](https://metacpan.org/pod/List::MoreUtils).\n\nOn a Debian-based Linux system, the module can be installed using `sudo apt\ninstall -y liblist-moreutils-perl`.  The script `degap_fasta_alignment.pl` uses\nstandard perl modules, so no extra steps are required.\n\nThe script `fastagap.pl` can also be installed (as `fastagap`) using\n[conda](https://docs.conda.io/en/latest/) from the [bioconda\nchannel](https://bioconda.github.io/).\n\n    $ conda install -c bioconda fastagap\n\nThe script `plot_missing_data.R` requires `R` with R-packages `ggplot2`, `tidyr`.\nFor installation, see instructions on \u003chttps://cran.r-project.org/\u003e.\n\n## Usage\n\n### fastagap\n\n            FILE: fastagap.pl\n\n           USAGE: ./fastagap.pl [OPTIONS] fasta\n\n     DESCRIPTION: Report or replace/remove missing-data characters in fasta.\n\n                  Can identify and manipulate leading, trailing, and\n                  inner gap regions.\n\n                  Reads fasta-formatted files and writes to stdout as\n                  fasta or tab-separated output.\n\n                  Default behaviour (no options used) is to remove all\n                  occurrences of missing data represented by the symbol\n                  '-' from the sequences (corresponds to using the\n                  options -A and -G).\n\n                  If an \"all-gap\" sequence is encountered, it will be\n                  excluded from output.\n\n                  In addition, the script can also filter sequences\n                  on min and/or max lengths.\n\n                  See OPTIONS and EXAMPLES for more details.\n\n         OPTIONS:\n                  -c, --count\n                      Count and report. Do not print sequences.\n\n                  -m, --missing=\u003cstring\u003e\n                      Character \u003cstring\u003e, or Perl regex (experimental)\n                      to use as missing symbol. Default is '-'.\n\n                  -G\n                      Set missing symbol to hyphen ('-'). (Default)\n\n                  -N\n                      Set missing symbol to 'N' (case sensitive).\n\n                  -Q\n                      Set missing symbol to '?'.\n\n                  -X\n                      Set missing symbol to 'X'.\n\n                  -H, --no-header\n                      Suppress printing of header in table output (use\n                      together with '-c').\n\n                  -A, --remove-all\n                      Remove all missing symbols from sequences. (Default)\n\n                  -L, --remove-leading\n                      Remove all leading missing symbols from sequences.\n\n                  -T, --remove-trailing\n                      Remove all trailing missing symbols from sequences.\n\n                  -I, --remove-inner\n                      Remove all inner missing symbols from sequences.\n\n                  -E, --remove-empty\n                      Explicitly remove empty sequences, i.e., fasta entries\n                      with header only.\n\n                  -PA, --remove-allp=\u003cnumber\u003e\n                      Remove sequence if total amount of missing data exceeds\n                      \u003cnumber\u003e (in percentage). That is, allow 1 - \u003cnumber\u003e\n                      percent missing data.\n\n                  -PL, --remove-leadingp=\u003cnumber\u003e\n                      Remove sequence if total amount of leading missing data\n                      exceeds \u003cnumber\u003e percent.\n\n                  -PT, --remove-trailingp=\u003cnumber\u003e\n                      Remove sequence if total amount of missing trailing data\n                      exceeds \u003cnumber\u003e percent.\n\n                  -PI, --remove-innerp=\u003cnumber\u003e\n                      Remove sequence if total amount of missing inner data \n                      exceeds \u003cnumber\u003e percent.\n\n                  -PLT, --remove-leadingtrailingp=\u003cnumber\u003e\n                      Remove sequence if the sum of leading- and trailing missing\n                      data exceeds \u003cnumber\u003e percent.\n\n                  -a, --replace-all=\u003cchar\u003e\n                      Replace all missing symbols with \u003cchar\u003e in sequences.\n\n                  -l, --replace-leading=\u003cchar\u003e\n                      Replace all leading missing symbols with \u003cchar\u003e in\n                      sequences.\n\n                  -t, --replace-trailing=\u003cchar\u003e\n                      Replace all trailing missing symbols with \u003cchar\u003e in\n                      sequences.\n\n                  -i, --replace-inner=\u003cchar\u003e\n                      Replace all inner missing symbols with \u003cchar\u003e in sequences.\n\n                  -V, --Verbose\n                      Print warnings when replacements are attempted on empty\n                      sequences.\n\n                  -v, --version\n                      Print version number.\n\n                  -w, --wrap=\u003cnr\u003e\n                      Wrap fasta sequence to max length \u003cnr\u003e. Default is 60.\n\n                  -d, --decimals=\u003cnr\u003e\n                      Use \u003cnr\u003e decimals for ratios in output. Default is 4.\n\n                  -MIN=\u003cnr\u003e\n                      Print sequence if (unfiltered) length is minimum \u003cnr\u003e \n                      positions. This option can not be combined with the\n                      removal options.\n\n                  -MAX=\u003cnr\u003e\n                      Print sequence if (unfiltered) length is maximun \u003cnr\u003e\n                      positions.  This option can not be combined with the\n                      removal options.\n\n                  --tabulate\n                      Print tab-separated output (header tab sequence).\n\n                  -uc\n                    Convert sequence to uppercase. Note that the conversion\n                    is done before applying any (case sensitive)\n                    removal/replacements.\n\n                  -Z\n                      Shortcut for '-A -N -Q -G -X --noverbose'.\n\n                  -h\n                      Show brief help info.\n\n                  --help\n                      Show more help info.\n\n\n       EXAMPLES:  Remove all missing data ('-')\n\n                      $ ./fastagap.pl data/missing.fasta\n\n                  Count missing data\n\n                      $ ./fastagap.pl -c data/missing.fasta\n\n                  Count only 'N' as missing data\n\n                      $ ./fastagap.pl -c -N data/missing.fasta\n\n                  Count '-' and '?' as missing data\n\n                      $ ./fastagap.pl -c -G -Q data/missing.fasta\n\n                  Remove all '?'\n\n                      $ ./fastagap.pl -Q data/missing.fasta\n\n                  Remove all leading and trailing missing data\n\n                      $ ./fastagap.pl -L -T data/missing.fasta\n\n                  Replace leading and trailing missing data with 'N'\n\n                      $ ./fastagap.pl -l=N -t=N data/missing.fasta\n\n                  Replace leading, trailing, and inner missing data\n\n                      $ ./fastagap.pl -l=l -t=t -i=i data/missing.fasta\n\n                  Remove leading and trailing, and replace inner\n                  missing data\n\n                      $ ./fastagap.pl -L -T -i=N  data/missing.fasta\n\n                  Remove sequence if total amount of missing data\n                  exceeds 30 percent\n\n                      $ ./fastagap.pl -PA=30 data/missing.fasta\n\n                  Remove sequence if amount of leading- and trailing\n                  missing data exceeds 30 percent\n\n                      $ ./fastagap.pl -PLT=30 data/missing.fasta\n\n                  Convert input sequence to uppercase before removal\n\n                      $ ./fastagap.pl -uc -N data/missing.fasta\n\n                  Remove sequence if (unfiltered) length is less than\n                  5 positions\n\n                      $ ./fastagap.pl -MIN=5 data/length.fasta\n\n                  Remove sequence if (unfiltered) length is less than\n                  5 positions, and not longer than 10 positions\n\n                      $ ./fastagap.pl -MIN=5 -MAX=10 data/length.fasta\n\n                  Convert fasta to tab-separated output\n\n                      $ ./fastagap.pl -tabulate data/missing.fasta\n\n    REQUIREMENTS: Perl, and perldoc (for --help)\n\n           NOTES: The software will identify leading gaps as a contiguous region\n                  of missing data starting at the very first sequence position.\n                  Trailing gaps are the contiguous gap positions until the very\n                  end of the sequence. \"Inner\" gaps are then any gaps in between.\n                  Some examples:\n\n                      '-AAAAA-'   One leading, one trailing\n                      'A-----A'   No leading/trailing, five inner\n                      'A------'   No leading, six trailing (no inner)\n                      'A-A-A-A'   No leading/trailing, three inner\n                      '-A-A-A-'   One leading, one trailing, two inner\n\n                  When encountering a sequence with all missing data the program\n                  will currently not attempt to replace or remove leading and\n                  trailing gaps. Furthermore, if all data are removed for a fasta\n                  entry, the entry is skipped (deleted) in the output (with a\n                  warning written to stderr if '--verbose' is used).\n\n                  The capacity for supplying a regex to represent the missing \n                  characters is experimental. Please check the output carefully.\n\n                  Empty sequences, i.e., fasta entries with only a header, can\n                  be explicitly removed using the '-E' ('--remove-empty') option.\n                  They are also removed implicitly, along with \"all-gaps\" \n                  sequences, when using, e.g., '-A' ('--remove-all)'.\n\n                  To get tab-separated output instead of fasta, use '--tabulate'.\n\n                  To get an easy view of the table output in a terminal window,\n                  one could be helped by the program 'column':\n\n                      $ ./fastagap.pl -c data/missing.fasta | column -t\n\n### degap\\_fasta\\_alignment\n\nNote: for removing columns with missing data from very large alignments, see\nthe software [dfa](https://github.com/nylander/degap_fasta_alignment)\n\n\n             FILE: degap_fasta_alignment.pl\n\n            USAGE: ./degap_fasta_alignment.pl [--all][--any][--outfile=\u003cfile\u003e] fasta_file\n\n      DESCRIPTION: Removes columns with all gaps from aligned FASTA files.\n                   Default (no option arguments) is to remove columns\n                   where all taxa have gaps, thus preserving the\n                   alignment.\n\n          OPTIONS:\n                   --all\n                     Remove all gap characters from the sequences, thus\n                     not preserving the alignment.\n\n                   --any\n                     Remove all columns containing any gaps from the\n                     sequences, while preserving the alignment.\n\n                   --gap=\u003cchar\u003e\n                     Set the gap symbol to \u003cchar\u003e. Default is '-'.\n\n                   --outfile=\u003cfile\u003e\n                     Print to \u003cfile\u003e. Default is to print to STDOUT.\n\n### plot\\_missing\\_data\n\n             FILE: plot_missing_data.R\n\n            USAGE: ./plot_missing_data.R [-h[--help] infile.tsv\n\n      DESCRIPTION: Plots a heatmap showing amounts of missing data\n                   per locus.\n\n           OUTPUT: Heatmap-figure in PDF fomat with file ending `.missing_data.pdf`\n\n          OPTIONS:\n                   -h,--help Show help text\n\n## Summary counts for many files\n\nThe script `fastagap.pl` can be used on several input files containing, e.g.,\ndifferent genes for the same samples (fasta headers):\n\n    $ fastagap.pl -c -H *.fas\n\nThis output can be summarized in different ways. For example, say that you have\na number of separate gene files with a varying number of sequences, a quick\ncount of genes per taxon (fasta header) can be given by (using GNU awk):\n\n    $ fastagap.pl -c -H *.fas | \\\n        awk 'BEGIN{printf(\"label\\tcount\\t%\\n\")}\n             {L[$1]++;F[$NF]++}\n             END{for(l in L){printf(\"%s\\t%s\\t%s\\n\",l,L[l],L[l]/length(F))}}'\n\nSimilarly, to count the number of taxa (fasta labels) per gene:\n\n    $ fastagap.pl -c -H *.fas | \\\n        awk 'BEGIN{printf(\"file\\tcount\\t%\\n\")}\n        {L[$1]++;F[$NF]++}\n        END{for(f in F){printf(\"%s\\t%d\\t%.2f\\n\",f,F[f],F[f]/length(L))}}'\n\nFurthermore, to get a visual image of the  \"completeness\", or amount of missing\ndata for each sample, a heatmap can be useful (see Fig. 1):\n\n    $ fastagap.pl -c -H *.fas \u003e counts.tsv\n    $ plot_missing_data.R counts.tsv\n\n![Heatmap showing missing data per sample for 352 loci.](img/counts.tsv.missing_data.png)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnylander%2Ffastagap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnylander%2Ffastagap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnylander%2Ffastagap/lists"}