{"id":13482517,"url":"https://github.com/knowitall/openregex","last_synced_at":"2025-03-27T13:32:01.134Z","repository":{"id":2128642,"uuid":"3071700","full_name":"knowitall/openregex","owner":"knowitall","description":"An efficient and flexible token-based regular expression language and engine.","archived":false,"fork":false,"pushed_at":"2014-03-20T14:28:32.000Z","size":1014,"stargazers_count":75,"open_issues_count":1,"forks_count":16,"subscribers_count":25,"default_branch":"master","last_synced_at":"2024-10-30T16:40:53.837Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/knowitall.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-12-29T23:27:34.000Z","updated_at":"2024-09-21T14:56:08.000Z","dependencies_parsed_at":"2022-07-14T08:09:00.017Z","dependency_job_id":null,"html_url":"https://github.com/knowitall/openregex","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowitall%2Fopenregex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowitall%2Fopenregex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowitall%2Fopenregex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knowitall%2Fopenregex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/knowitall","download_url":"https://codeload.github.com/knowitall/openregex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245854474,"owners_count":20683359,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T17:01:02.848Z","updated_at":"2025-03-27T13:32:00.737Z","avatar_url":"https://github.com/knowitall.png","language":"Java","funding_links":[],"categories":["Packages","函式庫"],"sub_categories":["Libraries","書籍"],"readme":"# OpenRegex\n\nOpenRegex is written by Michael Schmitz at the Turing Center\n\u003chttp://turing.cs.washington.edu/\u003e.  It is licensed under the lesser GPL.\nPlease see the LICENSE file for more details.\n\n\n## Introduction\n\nOpenRegex is an efficient and flexible token-based regular expression language\nand engine.  Most regular expression implementations are closed to run only\nover characters.  Although this is the the most common application for regular\nexpressions, OpenRegex does not have this restriction.  OpenRegex is open to\nany sequences of user-defined objects.\n\n\n## Applied to Natural Language\n\nFor example, OpenRegex is used in the R2A2 extension to ReVerb, an open-domain\ninformation extractor, to determine argument boundaries.  In this case, tokens\nare words in English sentences with additional information (the string of the\nword, the part-of-speech tag, and the chunk tag).\n\n    case class WordToken(string: String, postag: String, chunk: String)\n\nNow that we have defined our token, we can build up a sentence (a NLP library\nsuch as OpenNLP can help out here).  We will also need to define a way to\ntranslate each token in the expression (text between \u003cangled brackets\u003e) into\nan expression that can be applied to a word token.\n\n```\n  def compile(string: String): RegularExpression[WordToken] = {\n    // create a parser for regular expression language that have\n    // the same token representation\n    val parser =\n      new RegularExpressionParser[WordToken]() {\n        // Translate an string \"part=value\" into a BaseExpression that\n        // checks whether the part of a WordToken has value 'value'.\n        override def factory(string: String): BaseExpression[WordToken] = {\n          new BaseExpression[WordToken](string) {\n            val Array(part, quotedValue) = string.split(\"=\")\n            val value = quotedValue.drop(1).take(quotedValue.size - 2)\n            override def apply(entity: WordToken) = {\n              part match {\n                case \"string\" =\u003e entity.string equalsIgnoreCase value\n                case \"postag\" =\u003e entity.postag equalsIgnoreCase value\n                case \"chunk\" =\u003e entity.chunk equalsIgnoreCase value\n              }\n            }\n          }\n        }\n      }\n\n    parser.parse(string)\n  }\n```\n\nNow we can compile a regular expression and apply it to a sentence.  Consider\nthe following pattern.  The first line defines a non-matching group that\nmatches a determiner (\"a\", \"an\", or \"the\").  The second line matches a sequence\nof part-of-speech tags (\"JJ\" is adjective, \"NNP\" is proper noun, and \"NN\" is\ncommon noun).\n\n    (?:\u003cstring='a'\u003e | \u003cstring='an'\u003e | \u003cstring='the'\u003e)?\n    \u003cpostag=\"JJ\"\u003e* \u003cpostag='NNP'\u003e+ \u003cpostag='NN'\u003e+ \u003cpostag='NNP'\u003e+\n\nWe can try applying it to a couple of sentences.\n\n1.  The US president Barack Obama is travelling to Mexico.\n\n```\n    regex.find(sentence).groups.get(0) matches \"The US president Barack Obama\"\n```\n\n2.  If all the ice melted from the frigid Earth continent Antarctica, sea\n    levels would rise hundreds of feet.\n\n```\n    regex.find(sentence).groups.get(0) matches \"the frigid Earth continent Antarctica\"\n```\n\nWe may want to pull out the text from certain parts of our match.  We can do\nthis with either named or unnamed groups.  Consider the following new form of\nthe pattern and the sentence in example 2.\n\n```\n      (?:\u003cstring=\"a\"\u003e | \u003cstring=\"an\"\u003e | \u003cstring=\"the\"\u003e)? \u003cpostag=\"JJ\"\u003e*\n      (\u003carg1\u003e:\u003cpostag='NNP'\u003e+) (\u003crel\u003e:\u003cpostag='NN'\u003e+) (\u003carg2\u003e:\u003cpostag='NNP'\u003e+)\n\n      regex.find(sentence).groups.get(0) matches \"the frigid Earth continent Antarctica\"\n      regex.find(sentence).groups.get(1) matches \"Earth\"\n      regex.find(sentence).groups.get(2) matches \"continent\"\n      regex.find(sentence).groups.get(2) matches \"Antarctica\"\n\n      regex.find(sentence).group(\"arg1\") matches \"Earth\"\n      regex.find(sentence).group(\"rel\")  matches \"continent\"\n      regex.find(sentence).group(\"arg2\") matches \"Antarctica\"\n```\n\n## Supported Constructs\n\nThe regular expression library supports the following constructs.\n\n```\n    | alternation\n    ? option\n    * Kleene-star\n    + plus\n    ^ beginning\n    $ end\n    {x,y}     match at least x but not more than y times\n    ()        matching groups\n    (?:)      non-matching groups\n    (\u003cname\u003e:) named groups\n```\n\nMost of these operators work the same as in java.util.regex.  Presently,\nhowever, alternation binds to its immediate neighbors.  This means that `\u003ca\u003e \u003cb\u003e | \u003cc\u003e`\nmeans `\u003ca\u003e (?:\u003cb\u003e | \u003cc\u003e)` whereas in Java it would mean `(?:\u003ca\u003e \u003cb\u003e) | \u003cc\u003e`.\nThis may change in a future release so it is advised that the\nalternation arguments be made explicit with non-matching groups.\n\nAll operators are greedy, and there are no non-greedy counterparts.\nBackreferences are not supported because the underlying representation only\nsupports regular languages (backreferences are not regular).\n\n\n## Simple Java Example\n\nThe NLP example is rather complex but it shows the power of OpenRegex.  For a\nsimpler example, look at RegularExpressions.word.  This is a static factory\nmethod for a simple word-based regular expression where only the string is\nconsidered.  This factory is used in the test cases.\n\nYou can also play around with RegularExpressions.word by running the main\nmethod in RegularExpression and specifying an expression with arg1.\n\n    sbt 'run-main edu.washington.cs.knowitall.regex.RegularExpression \"\u003cthe\u003e \u003cfat\u003e* \u003ccows\u003e \u003care\u003e \u003cmooing\u003e (?:\u003cloudly\u003e)?\"'\n\n\n## Logic Expressions\n\nIncluded is an engine for parsing and evaluating logic expressions.  For\nexample, you might want to extend the NLP regular expression language to be\nable to check multiple fields in a single regular expression token.  If you\nassumed each regular expression token to be a logic expression, you could\nwrite patterns such as the following.\n\n```\n    \u003cstring=\"the\" \u0026 postag=\"DT\"\u003e \u003cpostag=\"JJ\"\u003e \u003cstring=\"earth\" | postag=\"NNP\"\u003e\n```\n\nExtending the regular expression in this way is easy.  It only involves\nrewriting the apply method in BaseExpression inside the compile method.\nMost of the code below existed before--now it's just moved outside the\napply method.\n\n```\n    val logic = new LogicExpressionParser[WordToken] {\n      override def factory(expr: String) = {\n        new Arg.Pred[WordToken](expr) {\n          val Array(part, quotedValue) = expr.split(\"=\")\n          val value = quotedValue.drop(1).take(quotedValue.size - 2)\n          override def apply(entity: WordToken) = part match {\n            case \"string\" =\u003e entity.string == value\n            case \"postag\" =\u003e entity.postag == value\n            case \"chunk\" =\u003e entity.chunk == value\n          }\n        }\n      }\n    }.parse(value)\n\n    override def apply(entity: WordToken) = {\n      logic.apply(entity)\n    }\n```\n\nPlay around with logic expression by using the main method in LogicExpression.\n\n    sbt 'run-main edu.washington.cs.knowitall.logic.LogicExpression'\n \nYou can enter logic expressions such as \"true \u0026 false\" or \"true | false\" and\nhave them evaluated interactively.\n\n\n## Implementation\n\nRegular expressions are evaluated using Thomson NFA, which is fast and does not have\nthe pathological cases that most regular expression libraries have.  For more\ninformation about Thomson NFA in comparison to recursive backtracking, read\nhttp://swtch.com/~rsc/regexp/regexp1.html.  Future work may involve compiling\nNFAs to DFAs.\n\n\n## Future Work\n\n1.  Compile to DFA.\n2.  Use parser combinators for parsing regular expressions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknowitall%2Fopenregex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fknowitall%2Fopenregex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknowitall%2Fopenregex/lists"}