https://github.com/mtumilowicz/java11-regex

Overview of java regex API.
https://github.com/mtumilowicz/java11-regex
java regex regex-match regex-pattern
Last synced: 12 months ago
JSON representation
Overview of java regex API.
Host: GitHub
URL: https://github.com/mtumilowicz/java11-regex
Owner: mtumilowicz
Created: 2018-12-08T08:53:48.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-12-08T23:24:37.000Z (over 7 years ago)
Last Synced: 2025-06-24T08:47:25.688Z (about 1 year ago)
Topics: java, regex, regex-match, regex-pattern
Language: Java
Homepage:
Size: 83 KB
Stars: 1
Watchers: 0
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          [![Build Status](https://travis-ci.com/mtumilowicz/java11-regex.svg?branch=master)](https://travis-ci.com/mtumilowicz/java11-regex)

# java11-regex

Overview of java regex API.

_Reference_: https://docs.oracle.com/javase/10/docs/api/java/util/regex/Pattern.html  

_Reference_: https://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers  

_Reference_: https://javascript.info/regexp-groups#example  

_Reference_: https://stackoverflow.com/questions/6664151/difference-between-b-and-b-in-regex  

_Reference_: https://stackoverflow.com/questions/4250062/what-is-the-difference-between-and-a-and-z-in-regex

# preface

A regular expression is a way to describe a pattern in a sequence 

of characters.

## characters

|Construct   |Matches   |

|---|---|

|`x`    |The character x   |

|`\\`   |The backslash character   |

|`\t`   |The tab character (`'\u0009'`)   |

|`\n`   |The newline (line feed) character (`'\u000A'`)   |

## character classes

|Construct   |Matches   |

|---|---|

|`[abc]`   |a, b, or c (simple class)   |

|`[^abc]`   |Any character except a, b, or c (negation)   |

|`[a-zA-Z]`   |a through z or A through Z, inclusive (range)   |

|`[a-d[m-p]]`   |a through d, or m through p: `[a-dm-p]` (union)   |

|`[a-z&&[def]]`   |d, e, or f (intersection)   |

|`[a-z&&[^bc]]`   |a through z, except for b and c: `[ad-z]` (subtraction)   |

|`[a-z&&[^m-p]]`   |a through z, and not m through p: `[a-lq-z]`(subtraction)   |

## predefined character classes

|Construct   |Matches   |

|---|---|

|`.`  |   Any character (may or may not match line terminators)|

|`\d` |   A digit: `[0-9]`|

|`\D` |   A non-digit: `[^0-9]`|

|`\s` |   A whitespace character: `[ \t\n\x0B\f\r]`|

|`\S` |   A non-whitespace character: `[^\s]`|

|`\w` |   A word character: `[a-zA-Z_0-9]`|

|`\W` |   A non-word character: `[^\w]`|

## boundary matchers

|Construct   |Matches   |

|---|---|

|`^`    |The beginning of a line|

|`$`    |The end of a line|

|`\b`   |A word boundary|

|`\B`   |A non-word boundary|

|`\A`   |The beginning of the input|

|`\z`   |The end of the input|

## linebreak matcher

`\R` - Any Unicode linebreak sequence

## greedy, reluctant, possessive

* A **greedy** quantifier first matches as much as possible and 

then "backtracs" one by one element towards the beginning.

* A **reluctant** or "non-greedy" quantifier first matches 

as little as possible then goes one by one element towards 

the end.

* A **possessive** quantifier is just like the greedy 

quantifier, but it doesn't backtrack.

### greedy quantifiers

|Construct   |Matches   |

|---|---|

|`X?`       |X, once or not at all|

|`X*`       |X, zero or more times|

|`X+`       |X, one or more times|

|`X{n}`     |X, exactly n times|

|`X{n,}`    |X, at least n times|

|`X{n,m}`   |X, at least n but not more than m times|

### reluctant quantifiers

|Construct   |Matches   |

|---|---|

|`X??`        |X, once or not at all|

|`X*?`        |X, zero or more times|

|`X+?`        |X, one or more times|

|`X{n}?`      |X, exactly n times|

|`X{n,}?`     |X, at least n times|

|`X{n,m}?`    |X, at least n but not more than m times|

### possessive quantifiers

|Construct   |Matches   |

|---|---|

|`X?+`        |X, once or not at all|

|`X*+`        |X, zero or more times|

|`X++`        |X, one or more times|

|`X{n}+`      |X, exactly n times|

|`X{n,}+`     |X, at least n times|

|`X{n,m}+`    |X, at least n but not more than m times|

## logical operators

|Construct   |Matches   |

|---|---|

|`XY`       |X, once or not at all|

|X|Y   |Either X or Y|

|`(X)`        |X, as a capturing group|

* Capturing group example:

    ```

    assertTrue("gogogogo regex".matches("(go)+\\sregex"));

    ```

    starts with "go" once or many times then space then 

    regex

    

## escaping

The backslash character (`'\'`) serves to introduce 

escaped constructs, as defined in the table above, 

as well as to quote characters that otherwise would 

be interpreted as unescaped constructs.

## typical invocation

A typical invocation sequence is:

```

Pattern p = Pattern.compile("a*b");

Matcher m = p.matcher("aaaaab");

boolean b = m.matches();

```

or

```

boolean b = Pattern.matches("a*b", "aaaaab");

```

But note that this method compiles an expression and matches an 

input sequence against it in a single invocation, so it is

equivalent to the three statements above, though for 

repeated matches it is less efficient since it does not 

allow the compiled pattern to be reused.

# project description

## string regex methods

* `public boolean matches(String regex)`

    ```

    // hour

    assertTrue("1:11".matches("[0-2]?\\d:[0-5]\\d"));

    assertTrue("9:11".matches("[0-2]?\\d:[0-5]\\d"));

    assertTrue("0:11".matches("[0-2]?\\d:[0-5]\\d"));

    

    // minute

    assertTrue("1:00".matches("[0-2]?\\d:[0-5]\\d"));

    assertTrue("9:01".matches("[0-2]?\\d:[0-5]\\d"));

    assertTrue("0:59".matches("[0-2]?\\d:[0-5]\\d"));

    assertFalse("0:60".matches("[0-2]?\\d:[0-5]\\d"));

    

    // hh

    assertTrue("00:00".matches("[0-2]?\\d:[0-5]\\d"));

    assertTrue("01:00".matches("[0-2]?\\d:[0-5]\\d"));

    assertTrue("21:00".matches("[0-2]?\\d:[0-5]\\d"));

    assertFalse("30:00".matches("[0-2]?\\d:[0-5]\\d"));

    ```

    * **remark**: `matches` is equivalent to the three 

    statements below, though for repeated matches it is 

    less efficient since it does not allow the compiled 

    pattern to be reused.

        ```

        Pattern p = Pattern.compile("[0-2]?\\d:[0-5]\\d");

        Matcher m = p.matcher("11:11");

        ```

* `public String[] split(String regex)`

    ```

    var namesContainer = "Michal--|--Marcin--|--Wojtek--|--Ania";

    String[] names = namesContainer.split("--\\|--");

   

    assertThat(names, is(new String[]{"Michal", "Marcin", "Wojtek", "Ania"}));

    ```

* `public String[] split(String regex, int limit)` - 

limit is size of returned array

    ```

    String[] names = "Michal--|--Marcin--|--Wojtek--|--Ania".split("--\\|--", 3);

    

    assertThat(names, is(new String[]{"Michal", "Marcin", "Wojtek--|--Ania"}));

    ```

* `public String replaceAll(String regex, String replacement)`

    ```

    String transformed = "Michal--|--Marcin--|--Wojtek--|--Ania".replaceAll("--\\|--", "|");

    

    assertThat(transformed, is("Michal|Marcin|Wojtek|Ania"));

    ```

* `public String replaceFirst(String regex, String replacement)`

    ```

    String transformed = "Michal--|--Marcin--|--Wojtek--|--Ania".replaceFirst("--\\|--", "|");

    

    assertThat(transformed, is("Michal|Marcin--|--Wojtek--|--Ania"));

    ```

## pattern methods

* `public static Pattern compile(String regex)`

    * very handy is method `public Predicate asMatchPredicate()`

    (since java11) which creates a predicate that 

    tests if this pattern matches a given input string

    ```

    var emailPattern = Pattern.compile("[_.\\w]+@([\\w]+\\.)+[\\w]{2,20}");

    

    List emails;

    

    try (var reader = Files.newBufferedReader(Path.of("emails.txt"))) {

        emails = reader.lines()

                .filter(emailPattern.asMatchPredicate())

                .collect(Collectors.toList());

    }

    

    assertThat(emails, hasSize(5));

    assertThat(emails, contains(

            "michaltumilowicz@tlen.pl",

            "michal_tumilowicz@tlen.pl",

            "MichalTumilowicz@gmail.com",

            "a.b_cD@a.b.c.d.pl",

            "m12@wp.com.pl"));

    ```

    * where `[_.\\w]+@([\\w]+\\.)+[\\w]{2,20}` is:

        * `[_.\\w]+` - either (`_`, `.`, letter/digit) once or more times

        * then `@`

        * `([\\w]+\\.)+` - (letter/digit once or more times with single dot) once or many times

        * `[\\w]{2,20}` - letter/digits twice to twenty times

* `public static Pattern compile(String regex, int flags)`

    * useful flags: 

        * `Pattern.MULTILINE` - 

        In multiline mode the expressions `^` and `$` match

        just after or just before, respectively, a line terminator or the end of

        the input sequence.  By default these expressions only match at the

        beginning and the end of the entire input sequence.

        * `Pattern.CASE_INSENSITIVE`

* `public static boolean matches(String regex, CharSequence input)`

    ```

    public static boolean matches(String regex, CharSequence input) {

        Pattern p = Pattern.compile(regex);

        Matcher m = p.matcher(input);

        return m.matches();

    }

    ```

## boundary matchers

* `\b` - A word boundary

    * end

        ```

        var txt = "catmania thiscat thiscatmania";

        

        String replaced = txt.replaceAll("cat\\b", "-");

        

        assertThat(replaced, is("catmania this- thiscatmania"));

        ```

    * beginning

        ```

        var txt = "catmania thiscat thiscatmania";

        

        String replaced = txt.replaceAll("\\bcat", "-");

        

        assertThat(replaced, is("-mania thiscat thiscatmania"));

        ```

* `\B` - A non-word boundary

    * not end

        ```

        var txt = "catmania thiscat thiscatmania";

        String replaced = txt.replaceAll("cat\\B", "-");

        assertThat(replaced, is("-mania thiscat this-mania"));

        ```

    * not beginning

        ```        

        var txt = "catmania thiscat thiscatmania";

        

        String replaced = txt.replaceAll("\\Bcat", "-");

        

        assertThat(replaced, is("catmania this- this-mania"));

        ```

    * neither beginning nor end

        ```

        var txt = "catmania thiscat thiscatmania";

        

        String replaced = txt.replaceAll("\\Bcat\\B", "-");

        

        assertThat(replaced, is("catmania thiscat this-mania"));

        ```

* `\A` - The beginning of the input and `\z` - The end of the input vs

`^` and `$`

    * it differs only when `Pattern.MULTILINE` was set:

        ```

        var pattern1 = Pattern.compile("^Michal$");

        var pattern2 = Pattern.compile("\\AMichal\\z");

        var pattern_multiline1 = Pattern.compile("^Michal$", Pattern.MULTILINE);

        var pattern_multiline2 = Pattern.compile("\\AMichal\\z", Pattern.MULTILINE);

        

        var txt = "Michal\nMarcin\nAnia";

        

        // matches

        assertFalse(pattern1.matcher(txt).matches());

        assertFalse(pattern2.matcher(txt).matches());

        assertFalse(pattern_multiline1.matcher(txt).matches());

        assertFalse(pattern_multiline2.matcher(txt).matches());

        

        // find

        assertFalse(pattern1.matcher(txt).find());

        assertFalse(pattern2.matcher(txt).find());

        assertTrue(pattern_multiline1.matcher(txt).find());

        assertFalse(pattern_multiline2.matcher(txt).find());

        ```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mtumilowicz/java11-regex

Awesome Lists containing this project

README