https://github.com/Mirkoddd/Sift
A Fluent Regex Builder for Java
https://github.com/Mirkoddd/Sift
android-library builder-pattern fluent-api hibernate-validator jakarta-validation java java-library regex regex-builder regex-dsl regular-expressions thread-safe type-safe validation
Last synced: 10 days ago
JSON representation
A Fluent Regex Builder for Java
- Host: GitHub
- URL: https://github.com/Mirkoddd/Sift
- Owner: Mirkoddd
- License: apache-2.0
- Created: 2026-02-18T15:52:59.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-09T13:41:49.000Z (about 1 month ago)
- Last Synced: 2026-04-09T15:26:45.957Z (about 1 month ago)
- Topics: android-library, builder-pattern, fluent-api, hibernate-validator, jakarta-validation, java, java-library, regex, regex-builder, regex-dsl, regular-expressions, thread-safe, type-safe, validation
- Language: Java
- Homepage:
- Size: 828 KB
- Stars: 77
- Watchers: 0
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
- fucking-awesome-java - Sift - Type-safe, AST-based Regex Builder focused on readability and ReDoS prevention. (Projects / Utility)
- awesome-java - Sift - Type-safe, AST-based Regex Builder focused on readability and ReDoS prevention. (Projects / Utility)
README
# Sift
[](https://adoptium.net/)
[](LICENSE)
[](https://github.com/mirkoddd/Sift/actions)
[](https://github.com/mirkoddd/Sift/actions)
**The Type-Safe Regex Builder for Java. If it compiles, it works.**
---
## The Problem
You've seen this before. Someone writes a regex, it works, and six months later nobody — including the author — can read it:
```java
// What does this even do?
Pattern p = Pattern.compile("^(?=[\\p{Lu}])[\\p{L}\\p{Nd}_]{3,15}+[0-9]?$");
```
You add a character class, break the balance of brackets, and find out at runtime. You copy a regex from Stack Overflow, miss an escape, and watch it fail silently in production. You duplicate the same validation pattern across DTOs and forget to update one of them.
**There is a better way.**
---
## The Solution
Sift is a fluent DSL that turns regex construction into readable, self-documenting Java code. Its state machine enforces grammatical correctness at **compile time** — if your pattern compiles, it is structurally valid.
```java
// The same pattern, written with Sift:
String regex = Sift.fromStart()
.exactly(1).upperCaseLettersUnicode() // Must start with an uppercase letter
.then()
.between(3, 15).wordCharactersUnicode().withoutBacktracking() // ReDoS-safe
.then()
.optional().digits() // May end with a digit
.andNothingElse()
.shake();
// Result: ^[\p{Lu}][\p{L}\p{Nd}_]{3,15}+[0-9]?$
```
Your IDE guides every step. Wrong transitions simply do not exist as methods.
---
## Installation
[](https://central.sonatype.com/artifact/com.mirkoddd/sift-core)
[](https://central.sonatype.com/artifact/com.mirkoddd/sift-annotations)
[](https://central.sonatype.com/artifact/com.mirkoddd/sift-engine-re2j)
[](https://central.sonatype.com/artifact/com.mirkoddd/sift-engine-graalvm)
**Gradle:**
```groovy
// Core engine — zero external dependencies
implementation 'com.mirkoddd:sift-core:'
// Optional: Jakarta Validation / Hibernate Validator integration
implementation 'com.mirkoddd:sift-annotations:'
// Optional: Engine RE2J
implementation 'com.mirkoddd:sift-engine-re2j:'
// Optional: Engine GraalVM
implementation 'com.mirkoddd:sift-engine-graalvm:'
```
**Maven:**
```xml
com.mirkoddd
sift-core
latest
com.mirkoddd
sift-annotations
latest
com.mirkoddd
sift-engine-graalvm
latest
com.mirkoddd
sift-engine-re2j
latest
```
> Sift targets **Java 8 bytecode** for maximum compatibility — including legacy Spring Boot 2.x and Android.
---
## Core Concepts
### Entry Points
| Method | Generates | Use when |
|---|---|---|
| `Sift.fromStart()` | `^...` | Validating from the start of a line (affected by `MULTILINE` flag) |
| `Sift.fromAbsoluteStart()` | `\A...` | Validating from the absolute start of the string (CRLF/Multi-Line safe) |
| `Sift.fromAnywhere()` | `...` | Building reusable fragments or searching within text |
| `Sift.fromWordBoundary()` | `\b...` | Matching whole words |
| `Sift.fromPreviousMatchEnd()` | `\G...` | Iterative parsing |
| `Sift.filteringWith(flag)` | `(?i)...` | Global flags (case-insensitive, multiline, dotall) |
### Terminal Methods
| Method | Effect |
|---|---|
| `.shake()` | Returns the raw regex `String` |
| `.sieve()` | Compiles with the default JDK engine → `SiftCompiledPattern` |
| `.sieveWith(engine)` | Compiles with a custom engine → `SiftCompiledPattern` |
| `.andNothingElse()` | Appends `$` and seals the pattern — affected by `MULTILINE` flag and trailing newlines |
| `.andNothingElseAbsolutely()` | Appends `\z` — absolute end of string, completely immune to multi-line and CRLF injection bypasses |
| `.andNothingElseBeforeFinalNewline()` | Appends `\Z` — end of string, or just before a final `\n` |
---
## Examples
### 1. Modular Composition — The LEGO Brick Approach
The real power of Sift is the ability to name your building blocks and compose them. Every `Sift.fromAnywhere()` call returns a reusable `SiftPattern` that can be embedded anywhere without carrying unwanted anchors.
```java
// Define named building blocks
SiftPattern year = Sift.fromAnywhere().exactly(4).digits();
SiftPattern month = Sift.fromAnywhere().exactly(2).digits();
SiftPattern day = Sift.fromAnywhere().exactly(2).digits();
SiftPattern dash = Sift.fromAnywhere().character('-');
// Compose them into a date block
SiftPattern dateBlock = year.followedBy(dash, month, dash, day);
// Embed inside a larger pattern
String logRegex = Sift.fromStart()
.of(dateBlock)
.followedBy(' ')
.then().oneOrMore().anyCharacter()
.andNothingElse()
.shake();
// Result: ^[0-9]{4}-[0-9]{2}-[0-9]{2} .+$
```
> **Root vs Fragment:** Patterns built with `fromStart()` or `fromAbsoluteStart()`, or closed with `andNothingElse()`, `andNothingElseAbsolutely()`, or `andNothingElseBeforeFinalNewline()` become `SiftPattern` — they are sealed and cannot be embedded.
---
### 2. Data Extraction — Beyond Pattern Matching
Sift patterns are not just validators. They are fully equipped extraction tools.
```java
// Define a structured pattern with named groups
NamedCapture yearGroup = SiftPatterns.capture("year", Sift.exactly(4).digits());
NamedCapture monthGroup = SiftPatterns.capture("month", Sift.exactly(2).digits());
NamedCapture dayGroup = SiftPatterns.capture("day", Sift.exactly(2).digits());
SiftPattern> datePattern = Sift.fromStart()
.namedCapture(yearGroup)
.followedBy('-')
.then().namedCapture(monthGroup)
.followedBy('-')
.then().namedCapture(dayGroup)
.andNothingElse();
// Extract structured data directly — no Matcher boilerplate
Map fields = datePattern.extractGroups("2026-03-13");
// → { "year": "2026", "month": "03", "day": "13" }
// Extract all matches from a larger text
List prices = Sift.fromAnywhere()
.oneOrMore().digits()
.sieve()
.extractAll("Order: 3 items at 25 and 40 euros");
// → ["3", "25", "40"]
// Stream results lazily for large inputs
Sift.fromAnywhere().oneOrMore().lettersUnicode()
.streamMatches(largeText)
.filter(word -> word.length() > 5)
.forEach(System.out::println);
```
**Full extraction API:**
| Method | Returns | Description |
|---|---|---|
| `containsMatchIn(input)` | `boolean` | Is there at least one match? |
| `matchesEntire(input)` | `boolean` | Does the entire string match? |
| `extractFirst(input)` | `Optional` | First match, or empty |
| `extractAll(input)` | `List` | All matches |
| `extractGroups(input)` | `Map` | Named groups from first match |
| `extractAllGroups(input)` | `List>` | Named groups from all matches |
| `replaceFirst(input, replacement)` | `String` | Replace first match |
| `replaceAll(input, replacement)` | `String` | Replace all matches |
| `splitBy(input)` | `List` | Split around matches |
| `streamMatches(input)` | `Stream` | Lazy stream of all matches |
---
### 3. Jakarta Validation Integration
Stop duplicating regex logic across your DTOs. Define a rule once, reuse it everywhere with `@SiftMatch`.
```java
// 1. Define a reusable rule
public class PromoCodeRule implements SiftRegexProvider {
@Override
public String getRegex() {
return Sift.fromStart()
.atLeast(4).letters()
.then()
.exactly(3).digits()
.andNothingElse()
.shake();
}
}
// 2. Apply it declaratively — pattern is compiled once at bootstrap
public record ApplyPromoRequest(
@SiftMatch(
value = PromoCodeRule.class,
flags = { SiftMatchFlag.CASE_INSENSITIVE },
message = "Invalid promo code format"
)
String promoCode
) {}
```
---
### 4. ReDoS Mitigation
Sift makes performance-safe patterns easy to express without memorizing obscure syntax.
```java
// Possessive quantifier — prevents catastrophic backtracking
Sift.fromAnywhere()
.oneOrMore().wordCharacters().withoutBacktracking(); // generates \w++
// Atomic group — locks a sub-pattern once matched
SiftPattern safe = Sift.fromAnywhere()
.oneOrMore().digits()
.preventBacktracking(); // wraps in (?>...)
// Lazy quantifier — matches as few characters as possible
Sift.fromAnywhere()
.oneOrMore().anyCharacter().asFewAsPossible(); // generates .+?
```
---
### 5. Alternative Regex Engines
By default, Sift compiles patterns using the standard `java.util.regex` engine via `.sieve()`.
For use cases where the JDK engine is not suitable — such as environments requiring
linear-time guarantees or GraalVM native images — Sift exposes a `SiftEngine` SPI that
accepts any compatible backend.
```java
// Default — uses java.util.regex internally
SiftCompiledPattern pattern = Sift.fromAnywhere()
.oneOrMore().digits()
.sieve();
// Custom engine — e.g. RE2J for linear-time, ReDoS-immune matching
SiftCompiledPattern pattern = Sift.fromAnywhere()
.oneOrMore().digits()
.sieveWith(Re2jEngine.INSTANCE); // sift-engine-re2j module
```
Sift tracks the advanced features used during pattern construction as a `Set`
and passes it to the engine at compile time. If an engine doesn't support a requested
feature — for example, RE2J does not support lookarounds or backreferences — it throws
`UnsupportedOperationException` immediately, before any input is processed.
Available engine modules:
- `sift-core` — includes `JdkEngine` (default, zero dependencies)
- `sift-engine-re2j` — RE2J backend, guarantees linear-time O(n) matching, immune to ReDoS
- `sift-engine-graalvm` — GraalVM TRegex backend, AOT-ready for native images
---
### 6. Lookarounds — Zero-Width Assertions
Lookarounds let you match based on what surrounds a position without consuming those characters. Sift exposes them as readable methods directly on `Connector`, so they flow naturally in the chain.
```java
// Positive lookahead — match "file" only if followed by ".pdf"
SiftPattern pdfFile = Sift.fromAnywhere()
.oneOrMore().wordCharacters()
.mustBeFollowedBy(SiftPatterns.literal(".pdf"));
// Negative lookahead — match a number NOT followed by "%"
SiftPattern absoluteValue = Sift.fromAnywhere()
.oneOrMore().digits()
.notFollowedBy(SiftPatterns.literal("%"));
// Positive lookbehind — match digits only if preceded by "$"
SiftPattern dollarAmount = Sift.fromAnywhere()
.oneOrMore().digits()
.mustBePrecededBy(SiftPatterns.literal("$"));
// Negative lookbehind — match "port" NOT preceded by "pass"
SiftPattern networkPort = Sift.fromAnywhere()
.of(SiftPatterns.literal("port"))
.notPrecededBy(SiftPatterns.literal("pass"));
```
All four lookaround types are available both on `Connector` — for inline chaining — and as standalone factories in `SiftPatterns` for use in composition with `followedByAssertion()` and `precededByAssertion()`.
---
### 7. Ready-Made Patterns — SiftCatalog
`SiftCatalog` provides a curated set of production-ready, ReDoS-safe patterns for common formats. All patterns are `Fragment`-typed — they compose cleanly with your own Sift chains.
```java
// Use standalone
boolean valid = SiftCatalog.email().matchesEntire("user@example.com");
// Or embed inside a larger pattern
String regex = Sift.fromStart()
.of(SiftCatalog.uuid())
.followedBy('/')
.then().of(SiftCatalog.isoDate())
.andNothingElse()
.shake();
```
Available patterns: `uuid()`, `ipv4()`, `macAddress()`, `email()`, `webUrl()`, `isoDate()`,
`iban()`, `jwt()`, `creditCard()`, `base64()`, `base64Url()`.
---
### 8. Recursive & Nested Structures
Parse arbitrarily deep balanced structures with `SiftPatterns.nesting()`.
```java
// Match nested parentheses: ((a)(b))
SiftPattern nested = SiftPatterns.nesting(5)
.using(Delimiter.PARENTHESES)
.containing(Sift.fromAnywhere().oneOrMore().lettersUnicode());
nested.containsMatchIn("((hello)(world))"); // true
```
---
### 9. Conditional Patterns
Sift exposes regex conditional logic — `if/then/else` branches — as a fully type-safe fluent API via `SiftPatterns`.
```java
// Match a price: if preceded by "USD" consume digits only,
// otherwise consume digits followed by a currency symbol
SiftPattern price = SiftPatterns
.ifPrecededBy(SiftPatterns.literal("USD"))
.thenUse(Sift.fromAnywhere().oneOrMore().digits())
.otherwiseUse(
Sift.fromAnywhere().oneOrMore().digits()
.followedBy(Sift.fromAnywhere().character('€'))
);
// Else-if chaining is also supported
SiftPattern format = SiftPatterns
.ifFollowedBy(SiftPatterns.literal("px"))
.thenUse(Sift.fromAnywhere().oneOrMore().digits())
.otherwiseIfFollowedBy(SiftPatterns.literal("%"))
.thenUse(Sift.fromAnywhere().between(1, 3).digits())
.otherwiseNothing(); // No else branch — engine moves forward silently
```
The state machine enforces the correct declaration order at compile time: `ifXxx` → `thenUse` → `otherwiseUse` / `otherwiseIfFollowedBy` / `otherwiseNothing`. An incomplete conditional is not expressible.
---
### 10. Pattern Explanation
Sift can translate any pattern into a human-readable ASCII tree — useful for debugging,
documentation, and onboarding. The explainer supports multiple languages via i18n.
```java
SiftPattern pattern = Sift.fromStart()
.oneOrMore().digits()
.andNothingElse();
// English (default)
System.out.println(pattern.explain());
// Italian
System.out.println(pattern.explain(Locale.ITALIAN));
// Spanish
System.out.println(pattern.explain(new Locale("es")));
```
Output (English):
```
┌─ Starts at the beginning of the line
├─ a digit one or more times
└─ Ends at the end of the line
```
`explain()` is available directly on every `SiftPattern` and delegates to `SiftExplainer`,
which can also be called standalone for more control over the locale resolution.
---
## Why Sift?
| | Raw Java Regex | Sift |
|---|---|---|
| Syntax errors | Discovered at runtime | Impossible to express |
| Readability | Cryptic symbols | Self-documenting method names |
| Reusability | Copy-paste | Named `SiftPattern` fragments |
| Thread safety | Manual | Guaranteed, all patterns are immutable |
| ReDoS protection | Requires expert knowledge | Built-in API |
| Jakarta Validation | Manual `@Pattern` duplication | `@SiftMatch` + `SiftRegexProvider` |
| Regex engine | JDK only | Pluggable (JDK, RE2J, GraalVM, etc...) |
| Dependencies | — | Zero (sift-core) |
| Human-readable explanation | Not possible | `pattern.explain()` with i18n support |
| CRLF / Header Injection | Manual anchor management (`\z`) | Native API (`andNothingElseAbsolutely()`) |
---
## Going Further
- **[Sift Cookbook](COOKBOOK.md)** — Advanced recipes: TSV log parsing, UUID validation, lookarounds, data extraction with named captures, conditional patterns, and more.
- **[Javadoc — sift-core](https://javadoc.io/doc/com.mirkoddd/sift-core)**
- **[Javadoc — sift-annotations](https://javadoc.io/doc/com.mirkoddd/sift-annotations)**
- **[Javadoc — sift-engine-re2j](https://javadoc.io/doc/com.mirkoddd/sift-engine-re2j)**
- **[Javadoc — sift-engine-graalvm](https://javadoc.io/doc/com.mirkoddd/sift-engine-graalvm)**
- **[Changelog](CHANGELOG.md)**
- **[Contributing](CONTRIBUTING.md)**
## Under the Hood
Curious how Sift converts a fluent API chain into a compiled regex without performance overhead? Or want to know how the Type-State pattern catches errors at compile-time?
Check out the [Sift Architecture](ARCHITECTURE.md) to see the magic behind the AST (Abstract Syntax Tree) and learn how easy it is to add your own DSL methods or execution engines.