An open API service indexing awesome lists of open source software.

https://github.com/coregx/coregex

Pure Go production-grade regex engine with SIMD optimizations. Up to 3-3000x+ faster than stdlib.
https://github.com/coregx/coregex

avx2 dfa go golang multi-engine nfa performance pikevm regex regex-engine regexp simd ssse3

Last synced: 29 days ago
JSON representation

Pure Go production-grade regex engine with SIMD optimizations. Up to 3-3000x+ faster than stdlib.

Awesome Lists containing this project

README

          

# coregex

[![GitHub Release](https://img.shields.io/github/v/release/coregx/coregex?style=flat-square&logo=github&color=blue)](https://github.com/coregx/coregex/releases/latest)
[![Go Version](https://img.shields.io/github/go-mod/go-version/coregx/coregex?style=flat-square&logo=go)](https://go.dev/dl/)
[![Go Reference](https://pkg.go.dev/badge/github.com/coregx/coregex.svg)](https://pkg.go.dev/github.com/coregx/coregex)
[![CI](https://img.shields.io/github/actions/workflow/status/coregx/coregex/test.yml?branch=main&style=flat-square&logo=github-actions&label=CI)](https://github.com/coregx/coregex/actions)
[![Go Report Card](https://goreportcard.com/badge/github.com/coregx/coregex?style=flat-square)](https://goreportcard.com/report/github.com/coregx/coregex)
[![codecov](https://codecov.io/gh/coregx/coregex/branch/main/graph/badge.svg)](https://codecov.io/gh/coregx/coregex)
[![License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat-square)](LICENSE)
[![GitHub Stars](https://img.shields.io/github/stars/coregx/coregex?style=flat-square&logo=github)](https://github.com/coregx/coregex/stargazers)
[![GitHub Issues](https://img.shields.io/github/issues/coregx/coregex?style=flat-square&logo=github)](https://github.com/coregx/coregex/issues)
[![GitHub Discussions](https://img.shields.io/github/discussions/coregx/coregex?style=flat-square&logo=github)](https://github.com/coregx/coregex/discussions)

High-performance regex engine for Go. Drop-in replacement for `regexp` with **3-3000x speedup**.\*

\* Typical speedup 15-240x on real-world patterns. 1000x+ achieved on [specific edge cases](https://github.com/kolkov/regex-bench#extreme-speedups-1000-3000x) where prefilters skip entire input (e.g., IP pattern on text with no digits).

## Why coregex?

Go's stdlib `regexp` is intentionally simple — single NFA engine, no optimizations. This guarantees O(n) time but leaves performance on the table.

coregex brings Rust regex-crate architecture to Go:
- **Multi-engine**: 17 strategies — Lazy DFA, PikeVM, OnePass, BoundedBacktracker, and more
- **SIMD prefilters**: AVX2/SSSE3 for fast candidate rejection
- **Reverse search**: Suffix/inner literal patterns run 1000x+ faster
- **O(n) guarantee**: No backtracking, no ReDoS vulnerabilities

## Installation

```bash
go get github.com/coregx/coregex
```

Requires Go 1.25+. Minimal dependencies (`golang.org/x/sys`, `github.com/coregx/ahocorasick`).

## Quick Start

```go
package main

import (
"fmt"
"github.com/coregx/coregex"
)

func main() {
re := coregex.MustCompile(`\w+@\w+\.\w+`)

text := []byte("Contact support@example.com for help")

// Find first match
fmt.Printf("Found: %s\n", re.Find(text))

// Check if matches (zero allocation)
if re.MatchString("test@email.com") {
fmt.Println("Valid email format")
}
}
```

## Performance

Cross-language benchmarks on 6MB input, AMD EPYC ([source](https://github.com/kolkov/regex-bench)):

| Pattern | Go stdlib | coregex | Rust regex | vs stdlib | vs Rust |
|---------|-----------|---------|------------|-----------|---------|
| Literal alternation | 554 ms | 4.5 ms | 0.72 ms | **122x** | 6.2x slower |
| Multi-literal | 1572 ms | 12.4 ms | 5.5 ms | **126x** | 2.2x slower |
| Inner `.*keyword.*` | 238 ms | 0.27 ms | 0.33 ms | **881x** | **1.2x faster** |
| Suffix `.*\.txt` | 239 ms | 1.9 ms | 1.2 ms | **125x** | 1.5x slower |
| Multiline `(?m)^/.*\.php` | 102 ms | 0.34 ms | 0.75 ms | **299x** | **2.2x faster** |
| Email validation | 257 ms | 0.46 ms | 0.31 ms | **557x** | 1.4x slower |
| URL extraction | 256 ms | 0.62 ms | 0.37 ms | **413x** | 1.6x slower |
| IP address | 494 ms | 0.72 ms | 13.5 ms | **685x** | **18.8x faster** |
| Version `\d+.\d+.\d+` | 164 ms | 0.62 ms | 0.79 ms | **263x** | **1.2x faster** |
| Char class `[\w]+` | 478 ms | 42.1 ms | 56.4 ms | **11x** | **1.3x faster** |
| Word repeat `(\w{2,8})+` | 690 ms | 180 ms | 54.7 ms | **3x** | 3.2x slower |

**Where coregex excels:**
- Multiline patterns (`(?m)^/.*\.php`) — **2.2x faster than Rust**, 299x vs stdlib
- IP/phone patterns (`\d+\.\d+\.\d+\.\d+`) — SIMD digit prefilter skips non-digit regions
- Suffix patterns (`.*\.log`, `.*\.txt`) — reverse search optimization (1000x+)
- Inner literals (`.*error.*`, `.*@example\.com`) — bidirectional DFA (900x+)
- Multi-pattern (`foo|bar|baz|...`) — Slim Teddy (≤32), Fat Teddy (33-64), or Aho-Corasick (>64)
- Anchored alternations (`^(\d+|UUID|hex32)`) — O(1) branch dispatch (5-20x)
- Concatenated char classes (`[a-zA-Z]+[0-9]+`) — DFA with byte classes (5-7x)
- **Zero-alloc iterators** (`AllIndex`, `AppendAllIndex`) — 0 heap allocs, up to **30% faster** than FindAll. Email pattern **faster than Rust** with `AppendAllIndex`.

## Features

### Engine Selection

coregex automatically selects the optimal engine:

| Strategy | Pattern Type | Speedup |
|----------|--------------|---------|
| **AnchoredLiteral** | `^prefix.*suffix$` | **32-133x** |
| **MultilineReverseSuffix** | `(?m)^/.*\.php` | **100-552x** ⚡ |
| ReverseInner | `.*keyword.*` | 100-900x |
| ReverseSuffix | `.*\.txt` | 100-1100x |
| BranchDispatch | `^(\d+\|UUID\|hex32)` | 5-20x |
| CompositeSequenceDFA | `[a-zA-Z]+[0-9]+` | 5-7x |
| LazyDFA | IP, complex patterns | 10-150x |
| AhoCorasick | `a\|b\|c\|...\|z` (>64 patterns) | 75-113x |
| CharClassSearcher | `[\w]+`, `\d+` | 4-25x |
| Slim Teddy | `foo\|bar\|baz` (2-32 patterns) | 15-240x |
| Fat Teddy | 33-64 patterns | 60-73x |
| OnePass | Anchored captures | 10x |
| BoundedBacktracker | Small patterns | 2-5x |

### API Compatibility

Drop-in replacement for `regexp.Regexp`:

```go
// stdlib
re := regexp.MustCompile(pattern)

// coregex — same API
re := coregex.MustCompile(pattern)
```

Supported methods:
- `Match`, `MatchString`, `MatchReader`
- `Find`, `FindString`, `FindAll`, `FindAllString`
- `FindIndex`, `FindStringIndex`, `FindAllIndex`
- `FindSubmatch`, `FindStringSubmatch`, `FindAllSubmatch`
- `ReplaceAll`, `ReplaceAllString`, `ReplaceAllFunc`
- `Split`, `SubexpNames`, `NumSubexp`
- `Longest`, `Copy`, `String`

### Zero-Allocation APIs

```go
// Zero allocations — boolean match
matched := re.IsMatch(text)

// Zero allocations — single match indices
start, end, found := re.FindIndices(text)

// Zero allocations — iterator over all matches (Go 1.23+)
for m := range re.AllIndex(data) {
fmt.Printf("match at [%d, %d]\n", m[0], m[1])
}

// Zero allocations — match content iterator
for s := range re.AllString(text) {
fmt.Println(s)
}

// Buffer-reuse — append to caller's slice (strconv.Append* pattern)
var buf [][2]int
for _, chunk := range chunks {
buf = re.AppendAllIndex(buf[:0], chunk, -1)
process(buf)
}
```

### Configuration

```go
config := coregex.DefaultConfig()
config.DFAMaxStates = 10000 // Limit DFA cache
config.EnablePrefilter = true // SIMD acceleration

re, err := coregex.CompileWithConfig(pattern, config)
```

### Thread Safety

A compiled `*Regexp` is safe for concurrent use by multiple goroutines:

```go
re := coregex.MustCompile(`\d+`)

// Safe: multiple goroutines sharing one compiled pattern
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
re.FindString("test 123 data") // thread-safe
}()
}
wg.Wait()
```

Internally uses `sync.Pool` (same pattern as Go stdlib `regexp`) for per-search state management.

## Syntax Support

Uses Go's `regexp/syntax` parser:

| Feature | Support |
|---------|---------|
| Character classes | `[a-z]`, `\d`, `\w`, `\s` |
| Quantifiers | `*`, `+`, `?`, `{n,m}` |
| Anchors | `^`, `$`, `\b`, `\B` |
| Groups | `(...)`, `(?:...)`, `(?P...)` |
| Unicode | `\p{L}`, `\P{N}` |
| Flags | `(?i)`, `(?m)`, `(?s)` |
| Backreferences | Not supported (O(n) guarantee) |

## Architecture

```
Pattern → Parse → NFA → Literal Extract → Strategy Select

┌────────────────────────────────────────────┐
│ Engines (17 strategies): │
│ LazyDFA, PikeVM, OnePass, │
│ BoundedBacktracker, ReverseAnchored, │
│ ReverseInner, ReverseSuffix, │
│ ReverseSuffixSet, MultilineReverseSuffix, │
│ AnchoredLiteral, CharClassSearcher, │
│ Teddy, DigitPrefilter, AhoCorasick, │
│ CompositeSearcher, BranchDispatch, Both │
└────────────────────────────────────────────┘

Input → Prefilter (SIMD) → Engine → Match Result
```

> For detailed architecture documentation, see [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
> For optimization details, see [docs/OPTIMIZATIONS.md](docs/OPTIMIZATIONS.md).

**SIMD Primitives** (AMD64):
- `memchr` — single byte search (AVX2)
- `memmem` — substring search (SSSE3)
- `Slim Teddy` — multi-pattern search, 2-32 patterns (SSSE3, 9+ GB/s)
- `Fat Teddy` — multi-pattern search, 33-64 patterns (AVX2, 9+ GB/s)

Pure Go fallback on other architectures.

## Battle-Tested

coregex was [tested in GoAWK](https://github.com/benhoyt/goawk/pull/264). This real-world testing uncovered 15+ edge cases that synthetic benchmarks missed.

### Powered by coregex: uawk

[uawk](https://github.com/kolkov/uawk) is a modern AWK interpreter built on coregex:

| Benchmark (10MB) | GoAWK | uawk | Speedup |
|------------------|-------|------|---------|
| Regex alternation | 1.85s | 97ms | **19x** |
| IP matching | 290ms | 99ms | **2.9x** |
| General regex | 320ms | 100ms | **3.2x** |

```bash
go install github.com/kolkov/uawk/cmd/uawk@latest
uawk '/error/ { print $0 }' server.log
```

**We need more testers!** If you have a project using `regexp`, try coregex and [report issues](https://github.com/coregx/coregex/issues).

## Documentation

- [API Reference](https://pkg.go.dev/github.com/coregx/coregex)
- [CHANGELOG](CHANGELOG.md)
- [Contributing](CONTRIBUTING.md)
- [Security Policy](SECURITY.md)

## Comparison

| | coregex | stdlib | regexp2 |
|---|---------|--------|---------|
| Performance | 3-3000x faster | Baseline | Slower |
| SIMD | AVX2/SSSE3 | No | No |
| O(n) guarantee | Yes | Yes | No |
| Backreferences | No | No | Yes |
| API | Drop-in | — | Different |

**Use coregex** for performance-critical code with O(n) guarantee.
**Use stdlib** for simple cases where performance doesn't matter.
**Use regexp2** if you need backreferences (accept exponential worst-case).

## Related

- [uawk](https://github.com/kolkov/uawk) — Ultra AWK interpreter powered by coregex
- [kolkov/regex-bench](https://github.com/kolkov/regex-bench) — Cross-language benchmarks
- [golang/go#26623](https://github.com/golang/go/issues/26623) — Go regexp performance discussion
- [golang/go#76818](https://github.com/golang/go/issues/76818) — Upstream path proposal

**Inspired by:**
- [Rust regex](https://github.com/rust-lang/regex) — Architecture
- [RE2](https://github.com/google/re2) — O(n) guarantees
- [Hyperscan](https://github.com/intel/hyperscan) — SIMD algorithms

## License

MIT — see [LICENSE](LICENSE).

---

**Status:** Pre-1.0 (API may change). Ready for testing and feedback.

[Releases](https://github.com/coregx/coregex/releases) · [Issues](https://github.com/coregx/coregex/issues) · [Discussions](https://github.com/coregx/coregex/discussions)

## Star History





Star History Chart