{"id":15546850,"url":"https://github.com/curious-odd-man/rgxgen","last_synced_at":"2025-04-04T23:09:12.156Z","repository":{"id":54520642,"uuid":"218779355","full_name":"curious-odd-man/RgxGen","owner":"curious-odd-man","description":"Regex: generate matching and non matching strings based on regex pattern.","archived":false,"fork":false,"pushed_at":"2024-04-25T19:58:14.000Z","size":34926,"stargazers_count":92,"open_issues_count":12,"forks_count":13,"subscribers_count":2,"default_branch":"dev","last_synced_at":"2025-04-04T23:09:00.909Z","etag":null,"topics":["generat","generator","java","maven","regex","regex-pattern","regexp","regular-expression","regular-expressions","text-generation"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/curious-odd-man.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-31T14:01:14.000Z","updated_at":"2025-03-31T15:29:03.000Z","dependencies_parsed_at":"2024-04-25T20:52:08.366Z","dependency_job_id":"617cdb1b-81d9-4b4e-8338-533f3124816b","html_url":"https://github.com/curious-odd-man/RgxGen","commit_stats":{"total_commits":376,"total_committers":2,"mean_commits":188.0,"dds":0.005319148936170248,"last_synced_commit":"4aa1542760f0f9af1619080d3d68ce38f54deef9"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/curious-odd-man%2FRgxGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/curious-odd-man%2FRgxGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/curious-odd-man%2FRgxGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/curious-odd-man%2FRgxGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/curious-odd-man","download_url":"https://codeload.github.com/curious-odd-man/RgxGen/tar.gz/refs/heads/dev","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247261612,"owners_count":20910108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generat","generator","java","maven","regex","regex-pattern","regexp","regular-expression","regular-expressions","text-generation"],"created_at":"2024-10-02T13:05:03.781Z","updated_at":"2025-04-04T23:09:12.139Z","avatar_url":"https://github.com/curious-odd-man.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Regex: generate matching and non-matching strings\n\nThis is a java library that, given a regex pattern, allows to:\n\n1. Generate matching strings\n1. Iterate through unique matching strings\n1. Generate not matching strings\n\n# Table of contents\n\n[Status](https://github.com/curious-odd-man/RgxGen#status) \u003cbr\u003e\n[Try it now](https://github.com/curious-odd-man/RgxGen#try-it-now) \u003cbr\u003e\n[Usage](https://github.com/curious-odd-man/RgxGen#usage) \u003cbr\u003e\n[Supported Syntax](https://github.com/curious-odd-man/RgxGen#supported-syntax) \u003cbr\u003e\n[Configuration](https://github.com/curious-odd-man/RgxGen#configuration) \u003cbr\u003e\n[Limitations](https://github.com/curious-odd-man/RgxGen#limitations) \u003cbr\u003e\n[Other similar libraries](https://github.com/curious-odd-man/RgxGen#other-tools-to-generate-values-by-regex-and-why-this-might-be-better) \u003cbr\u003e\n[Support](https://github.com/curious-odd-man/RgxGen#support)\n\n## Status\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=plastic)](https://opensource.org/licenses/Apache-2.0)\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.github.curious-odd-man/rgxgen/badge.svg?style=plastic)](https://search.maven.org/search?q=a:rgxgen)\n[![javadoc](https://javadoc.io/badge2/com.github.curious-odd-man/rgxgen/javadoc.svg?style=plastic)](https://javadoc.io/doc/com.github.curious-odd-man/rgxgen)\n\nBuild status:\n\n|                                                             Latest Release                                                             |                                                           Latest snapshot                                                           |\n|:--------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------:|\n|    [![Build Status](https://travis-ci.com/curious-odd-man/RgxGen.svg?branch=master)](https://travis-ci.com/curious-odd-man/RgxGen)     |    [![Build Status](https://travis-ci.com/curious-odd-man/RgxGen.svg?branch=dev)](https://travis-ci.com/curious-odd-man/RgxGen)     |\n| [![codecov](https://codecov.io/gh/curious-odd-man/RgxGen/branch/master/graph/badge.svg)](https://codecov.io/gh/curious-odd-man/RgxGen) | [![codecov](https://codecov.io/gh/curious-odd-man/RgxGen/branch/dev/graph/badge.svg)](https://codecov.io/gh/curious-odd-man/RgxGen) |\n\n## Try it now!!!\n\nFollow the link to Online IDE with created project: [JDoodle](https://www.jdoodle.com/ia/X63).\nEnter your pattern and see the results.\n\n## Usage\n\n### Maven dependency\n\n#### The Latest RELEASE:\n\n[mvnrepository.com](https://mvnrepository.com/artifact/com.github.curious-odd-man/rgxgen)\n\n```xml\n\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.github.curious-odd-man\u003c/groupId\u003e\n    \u003cartifactId\u003ergxgen\u003c/artifactId\u003e\n    \u003cversion\u003e2.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Code:\n\nNote - RgxGen is not thread safe - there were reports on errors - see [#91](https://github.com/curious-odd-man/RgxGen/issues/91).\n\n```java\npublic class Main {\n    public static void main(String[] args) {\n        RgxGen rgxGen = RgxGen.parse(\"[^0-9]*[12]?[0-9]{1,2}[^0-9]*\");       // Create generator\n        String s = rgxGen.generate();                                        // Generate new random value\n        Optional\u003cBigInteger\u003e estimation = rgxGen.getUniqueEstimation();      // The estimation (not accurate, see Limitations) how much unique values can be generated with that pattern.\n        StringIterator uniqueStrings = rgxGen.iterateUnique();               // Iterate over unique values (not accurate, see Limitations)\n        String notMatching = rgxGen.generateNotMatching();                   // Generate not matching string\n    }\n}\n```\n\n```java\npublic class Main {\n    public static void main(String[] args) {\n        RgxGen rgxGen = RgxGen.parse(\"[^0-9]*[12]?[0-9]{1,2}[^0-9]*\");       // Create generator\n        Random rnd = new Random(1234);\n        String s = rgxGen.generate(rnd);                                     // Generate first value\n        String s1 = rgxGen.generate(rnd);                                    // Generate second value\n        String s2 = rgxGen.generate(rnd);                                    // Generate third value\n        String notMatching = rgxGen.generateNotMatching(rnd);                // Generate not matching string\n        // On each launch s, s1 and s2 will be the same\n    }\n}\n```\n\n## Supported syntax\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eSupported syntax\u003c/b\u003e\u003c/summary\u003e\n\n|                        Pattern | Description                                                                                                                          |\n|-------------------------------:|--------------------------------------------------------------------------------------------------------------------------------------|\n|                            `.` | Any symbol. See below details - `Dot pattern generated symbols` section.                                                             |\n|                            `?` | One or zero occurrences                                                                                                              |\n|                            `+` | One or more occurrences                                                                                                              |\n|                            `*` | Zero or more occurrences                                                                                                             |\n|                           `\\r` | Carriage return `CR` character                                                                                                       |\n|                           `\\t` | Tab `\t` character                                                                                                                    |\n|                           `\\n` | Line feed `LF` character.                                                                                                            |\n|                           `\\d` | A digit. Equivalent to `[0-9]`                                                                                                       |\n|                           `\\D` | Not a digit. Equivalent to `[^0-9]`                                                                                                  |\n|                           `\\s` | Configurable. By default: Space or Tab. See `WHITESPACE_DEFINITION` property.                                                        |\n|                           `\\S` | Anything, but Carriage Return, Space, Tab, Newline, Vertical Tab, Form Feed                                                          |\n|                           `\\w` | Any word character. Equivalent to `[a-zA-Z0-9_]`                                                                                     |\n|                           `\\W` | Anything but a word character. Equivalent to `[^a-zA-Z0-9_]`                                                                         |\n|                           `\\i` | Places same value as capture group with index `i`. `i` is any integer number.                                                        |\n|                  `\\Q` and `\\E` | Any characters between `\\Q` and `\\E`, including metacharacters, will be treated as literals.                                         |\n|                  `\\b` and `\\B` | These characters are ignored. No validation is performed!                                                                            |\n|          `\\xXX` and `\\x{XXXX}` | Hexadecimal value of unicode characters 2 or 4 hexadecimal digits                                                                    |\n|                       `\\uXXXX` | Hexadecimal value of unicode characters 4 hexadecimal digits                                                                         |\n|                      `\\p{...}` | Any character in class. See details below before use.                                                                                |\n|                      `\\P{...}` | Any character not in class. See details below before use.                                                                            |\n|              `{a}` and `{a,b}` | Repeat a; or min a max b times. Use {n,} to repeat at least n times.                                                                 |\n|                        `[...]` | Single character from ones that are inside brackets. `[a-zA-Z]` (dash) also supported                                                |\n|                       `[^...]` | Single character except the ones in brackets. `[^a]` - any symbol except 'a'                                                         |\n|                           `()` | To group multiple characters for the repetitions                                                                                     |\n| `foo(?=bar)` and `(?\u003c=foo)bar` | Limited support. Positive lookahead and lookbehind. These are equivalent to `foobar`. Please see `Lookahead and Lookbehind` section. |\n| `foo(?!bar)` and `(?\u003c!foo)bar` | Limited support. Negative lookahead and lookbehind. Please see `Lookahead and Lookbehind` section.                                   |\n|        \u003ccode\u003e(a\u0026#124;b)\u003c/code\u003e | Alternatives                                                                                                                         |\n|                             \\\\ | Escape character (use \\\\\\\\ (double backslash) to generate single \\ character)                                                        |\n\nRgxGen treats any other characters as literals - those are generated as is.\n\n\u003c/details\u003e\n\n## Configuration\n\nRgxGen can be configured per instance.\n\nPlease refer to the following enum for all available\nproperties: [`com.github.curiousoddman.rgxgen.config.RgxGenOption`](src/main/java/com/github/curiousoddman/rgxgen/config/RgxGenOption.java).\n\n### Create and Use Configuration Properties\n\nUse `new RgxGenProperties()` to create properties object.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eCode example\u003c/b\u003e\u003c/summary\u003e\n\n```java\npublic class Main {\n    public static void main(String[] args) {\n        // Create properties object (RgxGenProperties extends java.util.Properties)\n        RgxGenProperties properties = new RgxGenProperties();\n        // Set value \"20\" for INFINITE_PATTERN_REPETITION option in properties\n        RgxGenOption.INFINITE_PATTERN_REPETITION.setInProperties(properties, 20);\n        // ... now properties can be passed to RgxGen\n        RgxGen rgxGen_3 = RgxGen.parse(properties, \"my-cool-pattern\");\n    }\n}\n```\n\n\u003c/details\u003e\n\n## Limitations\n\n### Dot pattern generated symbols\n\nIn regex dot `.` means any symbol.\n\nBy default, this would generate any value in a range defined in `ASCII_SYMBOL_RANGE`\nhere [`com.github.curiousoddman.rgxgen.parsing.dflt.ConstantsProvider.java`](src/main/java/com/github/curiousoddman/rgxgen/parsing/dflt/ConstantsProvider.java)\ni.e.: any character starting from `space` to `~`.\n\nYou can modify range of allowed values using `DOT_MATCHES_ONLY` configuration property.\n\nFor example:\n\n```java\npublic class Main {\n    public static void main(String[] args) {\n        RgxGenProperties properties = new RgxGenProperties();\n        RgxGenOption.DOT_MATCHES_ONLY.setInProperties(properties, RgxGenCharsDefinition.of(\"abc\"));\n        RgxGen rgxGen = RgxGen.parse(properties, \".\");\n        String generatedValue = rgxGen.generate();      // Will produce either \"a\" or \"b\" or \"c\".\n    }\n}\n```\n\n### Lookahead and Lookbehind\n\nCurrently, these two have very limited support. Please refer\nto [#63](https://github.com/curious-odd-man/RgxGen/issues/63).\nI'm currently working on the solution, but I cannot say when I come up with something.\n\n### Estimation\n\n`rgxGen.getUniqueEstimation()` - might not be accurate, because it does not count actual unique values, but only counts\ndifferent states of each building block of the expression.\nFor example: `\"(a{0,2}|b{0,2})\"`  will be estimated as 6, though actual number of unique values is 5.\nThat is because left and right alternative can produce same value.\nAt the same time `\"(|(a{1,2}|b{1,2}))\"` will be correctly estimated to 5, though it will generate same values.\n\n### Uniqueness\n\nFor the similar reasons as with estimations - requested unique values iterator can contain duplicates.\n\n### Infinite patterns\n\nBy design `a+`, `a*` and `a{n,}` patterns in regex imply infinite number of characters should be matched.\nWhen generating data, that would mean values of infinite length might be generated.\nIt is highly doubtful anyone would require a string of infinite length, thus I've artificially limited repetitions in\nsuch patterns to 100 symbols, when generating random values.\nThis value can be changed - please refer to [configuration](https://github.com/curious-odd-man/RgxGen#configuration)\nsection.\n\nOn the contrast, when generating **unique values** - the number of maximum repetitions is Integer.MAX_VALUE.\n\nUse `a{n,m}` if you require some specific number of repetitions.\nIt is suggested to avoid using such infinite patterns to generate data based on regex.\n\n### Not matching values generation\n\nThe general rule is - I am trying to generate not matching strings of same length as would be matching strings, though\nit is not always possible.\nFor example pattern `.` - any symbol - would yield empty string as not matching string.\nAnother example `a{0,2}` - for this pattern not matching string would be an empty string, but I would only generate\nthe resulting strings of 1 or 2 symbols long.\nI chose these approaches because they are predictable and, probably, desirable for users.\n\n#### Which values are used in non-matching generation\n\nWhenever non-matching result is requested, with either `RgxGen.parse(\".\").generateNotMatching()` method or with pattern,\nlike `\"[^a-z]\"` - there is a choice in generator which are characters that do not match mentioned characters.\nFor example - for `\"[^a-z]\"` - any unicode character except the ones in a range `a-z` would be ok. Though that would\ninclude non-printable, all kinds of blank characters and all the different wierd unicode characters. I see that as\nnot an expected behavior. Thus, I have defined 2 different universe ranges of characters that are used - one for\nthe ASCII only characters and another - for unicode characters.\n\nThese ranges are defined here:\n\n- ASCII: `ASCII_SYMBOL_RANGE` constant\n  in [`com.github.curiousoddman.rgxgen.parsing.dflt.ConstantsProvider.java`](src/main/java/com/github/curiousoddman/rgxgen/parsing/dflt/ConstantsProvider.java)\n- Unicode: `UNICODE_SYMBOL_RANGE` constant\n  in [`com.github.curiousoddman.rgxgen.parsing.dflt.ConstantsProvider.java`](src/main/java/com/github/curiousoddman/rgxgen/parsing/dflt/ConstantsProvider.java)\n\n`UNICODE_SYMBOL_RANGE` is currently used ONLY when Character Classes are used `\\p{}` ir `\\P{}` patterns.\nBy default `ASCII_SYMBOL_RANGE` is used.\n\nTo generate not matching characters I take one of the aforementioned constant ranges and subtract characters provided in\npattern - resulting range is the one that is used for non-matching generation.\nFor example for pattern `\"[^a-z]\"` `ASCII_SYMBOL_RANGE` will be used as a universe.\nThe result then will be `ASCII_SYMBOL_RANGE` except `A-z` = `space - @` union `{ - ~`\n\n### Unicode Categories\n\nBe warned - unicode categories might provide unexpectedly wrong result depending on Java version used:\n[#99](https://github.com/curious-odd-man/RgxGen/issues/99). To be absolutely sure that on your java version patterns are\ngenerated correctly I suggest running RgxGen tests with your java version.\n\nTo create categories I've used Java (corretto-17.0.10) `Pattern.compile()` to split characters into categories.\nUnfortunately there were several character categories that are not supported by Java `Pattern.compile()` as a result\nthese are not currently supported.\n\nFor complete list of characters per category please refer to [this](data/categories) directory.\nEach file represents one category. Each line in a file describes one symbol from the category.\n\nSupported keys for categories can be found\nin [`com.github.curiousoddman.rgxgen.model.UnicodeCategory`](src/main/java/com/github/curiousoddman/rgxgen/model/UnicodeCategory.java)\n\n## Other tools to generate values by regex and why this might be better\n\nThere are 2 more libraries available to achieve same goal:\n\n1. https://github.com/mifmif/Generex\n1. http://code.google.com/p/xeger\n\nThough I found they have the following issues:\n\n1. All of them build graph which can easily produce OOM exception. For example pattern `a{60000}`,\n   or [IPV6 regex pattern](https://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses).\n1. Alternatives - only 2 alternatives gives equal probability of each alternative to appear in generated values. For\n   example: `(a|b)` the probability of a and b is equal. For `(a|b|c)` it would be expected to have a or b or c with\n   probability 33.(3)% each. Though really the probabilities are a=50%, and b=25% and c=25% each. For longer\n   alternatives you might never get the last alternative.\n1. They are quite slow\n1. Lightweight. This library does not have any dependencies.\n\n## Support\n\nI plan to support this library, so you're welcome to open issues or reach me by e-mail in case of any questions.\nAny suggestions, feature requests or bug reports are welcome!\n\nPlease vote up my answer on [StackOverflow](https://stackoverflow.com/a/58813696/4174003) to help others find this\nlibrary.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcurious-odd-man%2Frgxgen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcurious-odd-man%2Frgxgen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcurious-odd-man%2Frgxgen/lists"}