{"id":21651656,"url":"https://github.com/hoytech/bio-regexp","last_synced_at":"2025-03-20T03:59:53.011Z","repository":{"id":10301162,"uuid":"12423508","full_name":"hoytech/Bio-Regexp","owner":"hoytech","description":"Exhaustive DNA/RNA/protein regexp searches","archived":false,"fork":false,"pushed_at":"2014-01-26T18:10:01.000Z","size":216,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-25T05:42:46.466Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hoytech.png","metadata":{"files":{"readme":"README.pod","changelog":"Changes","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-08-28T03:16:34.000Z","updated_at":"2024-06-27T02:07:31.000Z","dependencies_parsed_at":"2022-08-31T11:23:08.874Z","dependency_job_id":null,"html_url":"https://github.com/hoytech/Bio-Regexp","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoytech%2FBio-Regexp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoytech%2FBio-Regexp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoytech%2FBio-Regexp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoytech%2FBio-Regexp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hoytech","download_url":"https://codeload.github.com/hoytech/Bio-Regexp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244547602,"owners_count":20470103,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-25T07:49:08.612Z","updated_at":"2025-03-20T03:59:52.984Z","avatar_url":"https://github.com/hoytech.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"=head1 NAME\n\nBio::Regexp - Exhaustive DNA/RNA/protein regexp searches\n\n=head1 SYNOPSIS\n\n    my @matches = Bio::Regexp-\u003enew-\u003edna\n                             -\u003eadd('A?GCYY[^G]{2,3}GCGC')\n                             -\u003eadd('GAATTC')\n                             -\u003ecircular\n                             -\u003ematch($input);\n\n    ## Example match:\n    {\n      'match' =\u003e 'AGCTCAAAGCGC',\n      'start' =\u003e '0',\n      'end' =\u003e '12',\n      'strand' =\u003e 1,\n      'regexp' =\u003e 'A?GCYY[^G]{2,3}GCGC'\n    }\n\n\n=head1 DESCRIPTION\n\nThis module is for searching inside DNA or RNA or protein sequences. The sequence to be found is specified by a restricted version of regular expressions. The restrictions allow us to manipulate the regexp in various ways described below. As well as regular expression character classes, bases can be expressed in IUPAC short form (which are kind of like character classes themselves).\n\nThe goal of this module is to provide a complete search. Given the particulars of a sequence (DNA/RNA/protein, linear molecule/circular plasmid, single/double stranded) it attempts to figure out all of the possible matches without any false-positive or duplicated matches.\n\nIt handles cases where matches overlap in the sequence and cases where the regular expression can match in multiple ways. For circular DNA (plasmids) it will find matches even if they span the arbitrary location in the circular sequence selected as the \"start\". For double-stranded DNA it will find matches on the reverse complement strand as well.\n\nThe typical use case of this module is to search for multiple small patterns in large amounts of input data. Although it is optimised for that task it is also efficient at others. For efficiency, none of the input sequence data is copied at all except to extract matches (but this can be disabled with C\u003cno_substr\u003e) and to implement circular searches (though the amount copied is usually very small).\n\n\n\n=head1 INPUT FORMAT\n\nThe input string passed to C\u003cmatch\u003e must be a nucleotide sequence for now (protein sequences will be supported soon). There must be no line breaks or other whitespace, or any other kind of FASTA-like header/data.\n\nIf your data does not conform to the description above then the results are undefined and you should sanitise your data before using this module.\n\nIf your data is anything other than DNA (the default) you must call one of the type functions like C\u003crna\u003e or C\u003cprotein\u003e:\n\n    my $re = Bio::Regexp-\u003enew-\u003erna-\u003eadd('GAUAUC')-\u003ecompile;\n\nNormally however C\u003cT\u003e and C\u003cU\u003e are both compiled into C\u003c[TU]\u003e so your patterns will work on DNA and RNA. If you wish to prevent this and throw an error while compiling your regexp, call C\u003cstrict_thymine_uracil\u003e.\n\nUnless C\u003cstrict_case\u003e is specified, the case of your patterns and the case of your input doesn't matter. I suggest using uppercase everywhere.\n\n\n\n\n=head1 EXHAUSTIVE SEARCH\n\nMost methods of searching nucleotide sequences will only find non-overlapping matches in the input. For example, when searching for the sequence C\u003cAA\u003e in the input C\u003cAAAA\u003e, perl's C\u003cm/AA/g\u003e searches will only return 2 matches:\n\n    AAAA\n    --\n      --\n\nWith this module you get all three matches:\n\n    AAAA\n    --\n     --\n      --\n\nFor DNA data this can be useful for finding the comprehensive set of possible molecules that could exist after a restriction enzyme cleaving.\n\n\n\n\n=head1 INTERBASE COORDINATES\n\nAll offsets returned by this module are in \"interbase coordinates\". Rather than the first base in a sequence being described as \"base 1\" as most biologists might think of it, or even \"base 0\" as computer scientists might, with interbase coordinates the first base is described as the sequence spanning coordinates 0 through 1.\n\nOne of the reasons this is useful is because it allows us to unambiguously specify 0-width sequences like for example endonuclease cut sites. If index-style coordinates are used it is ambiguous whether the cut is before or after.\n\nUnlike with string indices, the start coordinate can be greater than the end coordinate. This happens when C\u003cdouble_stranded\u003e is set (the default for DNA) and the pattern is found on the reverse complement strand. Use C\u003csingle_stranded\u003e if you don't want reverse complement matches.\n\nFor circular inputs, interbase coordinates can also be greater than the length of the input. This is interpreted as wrapping back around to the beginning in a modular arithmetic fashion. Similarly, negative coordinates wrap around to the end of the input. \"Out-of-range\" interbase coordinates are only defined for circular inputs and referencing them on linear inputs will throw errors.\n\n\n\n\n=head1 IUPAC SHORT FORMS\n\nFor DNA and RNA, IUPAC incompletely specified nucleotide sequences can be used. These are analogous to regular expression character classes. Just like perl's C\u003c\\s\u003e is short for C\u003c[ \\r\\n\\t]\u003e, in IUPAC form C\u003cV\u003e is short for C\u003c[ACG]\u003e, or C\u003c[^T]\u003e. Unless C\u003cstrict_thymine_uracil\u003e is in effect this will actually be like C\u003c[^TU]\u003e for both DNA and RNA inputs.\n\nSee L\u003cwikipedia|http://en.wikipedia.org/wiki/Nucleic_acid_notation\u003e for the list of IUPAC short forms.\n\n\n\n=head1 ADDING MULTIPLE SEARCH PATTERNS\n\nAn important feature of this module is that any number of regular expressions can be combined into one so that many patterns can be searched for simultaneously while doing a single pass over the data.\n\nDoing a single pass is generally more efficient because of memory locality and has other positive side-effects. For instance, we can also scan a strand's reverse complement during the pass and therefore avoid copying and reversing the input (which may be quite large).\n\nThis module should be able to support quite a large number of simultaneous search patterns although I have some ideas for future optimisations if they prove necessary. Large numbers of patterns may come in handy when building a list of all restriction enzymes that don't cut a target sequence, or finding all PCR primer sites accounting for IUPAC expanded primers.\n\nMultiple patterns can be added at once simply by calling C\u003cadd()\u003e multiple times before attempting a C\u003cmatch\u003e (or a C\u003ccompile\u003e):\n\n    my $re = Bio::Regexp-\u003enew;\n\n    $re-\u003eadd($_) for ('GAATTC', 'CCWGG');\n\n    my @matches = $re-\u003ematch($input);\n\nWhich pattern matched is returned as the C\u003cmatch\u003e key in the returned match results. You should probably have a hash of all your patterns so that you can look them up while processing matches. The way this is implemented is similar to the very useful L\u003cRegexp::Assemble\u003e except without the hacks needed for ancient perl versions.\n\nWhen matching, only a single pass will be made over the data so as to find all possible locations that either of the added sequences could have matched. Large numbers of patterns should be fairly efficient because the perl 5.10+ regular expression engine uses a trie data structure for such patterns (and 5.10 is the minimum required perl for other reasons).\n\n\n\n\n\n=head1 CIRCULAR INPUTS\n\nIf the C\u003ccircular\u003e method is called, the search sequence C\u003cGAATTC\u003e will match the following input:\n\n    ATTCGGGGGGGGGGGGGGGGGGA\n    ----                 --\n\nThe C\u003cstart\u003e and C\u003cend\u003e coordinates for one of the matches will be 21 and 27. Since the input's length is only 23, we know that it must have wrapped around. In this case there will be another match of coordinates at 27 and 21 because C\u003cGAATTC\u003e is a palindromic sequence.\n\nIn order to make this efficient even with really long input sequences, this module copies only the maximum length your search pattern could possibly be. Being able to figure out the minimum and maximum sequence lengths is one of the reasons why the types of regular expressions you can use with this module are limited.\n\n\n\n=head1 SEE ALSO\n\nL\u003cBio-Regexp github repo|https://github.com/hoytech/Bio-Regexp\u003e\n\nPresentation about Bio::Regexp and more: L\u003cGetting the most out of regular expressions|http://hoytech.github.io/regexp-presentation/\u003e\n\nL\u003cBio::Tools::SeqPattern\u003e from the BioPerl distribution also allows the manipulation of patterns but is less advanced than this module. Also, the way L\u003cBio::Tools::SeqPattern\u003e reverses a regular expression in order to match the reverse complement is... wow. Just wow. :)\n\nL\u003cBio::Grep\u003e is an interface to various programs that search biological sequences. L\u003cBio::Grep::Backend::RE\u003e is probably the most comparable to this module.\n\nL\u003cBio::DNA::Incomplete\u003e\n\n\n=head1 AUTHOR\n\nDoug Hoyte, C\u003c\u003c \u003cdoug@hcsw.org\u003e \u003e\u003e\n\n\n=head1 COPYRIGHT \u0026 LICENSE\n\nCopyright 2013 Doug Hoyte.\n\nThis module is licensed under the same terms as perl itself.\n\n\n=cut\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoytech%2Fbio-regexp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhoytech%2Fbio-regexp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoytech%2Fbio-regexp/lists"}