An open API service indexing awesome lists of open source software.

https://github.com/sushegaad/semantic-privacy-guard

Semantic Privacy Guard: A Java middleware that intercepts text, identifies PII using a three-layer hybrid pipeline (Regex + Naive Bayes ML + Apache OpenNLP NER), and redacts it before it reaches an LLM or leaves the corporate network — with stream-based processing for memory-efficient handling of large files and log streams.
https://github.com/sushegaad/semantic-privacy-guard

ai-firewall ai-safety compliance data-privacy eu-ai-act gdpr java java-library llm-privacy llm-security machine-learning maven middleware naive-bayes nlp pii-detection pii-redaction prompt-engineering regex zero-dependency

Last synced: about 2 months ago
JSON representation

Semantic Privacy Guard: A Java middleware that intercepts text, identifies PII using a three-layer hybrid pipeline (Regex + Naive Bayes ML + Apache OpenNLP NER), and redacts it before it reaches an LLM or leaves the corporate network — with stream-based processing for memory-efficient handling of large files and log streams.

Awesome Lists containing this project

README

          

# 🛡 Semantic Privacy Guard

[![CI](https://github.com/Sushegaad/Semantic-Privacy-Guard/actions/workflows/ci.yml/badge.svg)](https://github.com/Sushegaad/Semantic-Privacy-Guard/actions/workflows/ci.yml)
[![Maven Central](https://img.shields.io/maven-central/v/io.github.sushegaad/semantic-privacy-guard?color=blue&logo=apache-maven)](https://central.sonatype.com/artifact/io.github.sushegaad/semantic-privacy-guard)
[![Coverage](https://img.shields.io/badge/coverage-%E2%89%A580%25-brightgreen)](https://github.com/Sushegaad/Semantic-Privacy-Guard/actions)
[![Java](https://img.shields.io/badge/Java-17%2B-blue?logo=openjdk)](https://openjdk.org/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
[![Security Policy](https://img.shields.io/badge/security-policy-orange)](SECURITY.md)
[![Live Playground](https://img.shields.io/badge/playground-live-brightgreen)](https://sushegaad.github.io/Semantic-Privacy-Guard/docs/index.html)

> **A Java middleware that intercepts text, identifies PII using a three-layer hybrid pipeline
> (Regex + Naive Bayes ML + Apache OpenNLP NER), and redacts it before it reaches
> an LLM or leaves the corporate network — with stream-based processing for
> memory-efficient handling of large files and log streams.**

---

## 🚀 Live Playground

**[Try it in your browser →](https://sushegaad.github.io/Semantic-Privacy-Guard/docs/index.html)**

Paste any text, choose a redaction mode, and see instant results — 100% client-side, nothing sent to any server.

---

## Why Semantic Privacy Guard?

| Problem | How SPG helps |
|---|---|
| Employees paste customer data into ChatGPT | Intercept prompts at the API gateway layer |
| Cloud PII APIs cost $0.001/call at scale | SPG costs $0/call, runs fully offline |
| LLMs need context; full redaction breaks prompts | Structured tokens like `[EMAIL_1]` preserve sentence structure |
| 2026 EU AI Act: "Privacy by Design" required | SPG is the compliance middleware |
| 50 MB log file = 150–200 MB heap per request | Stream API processes one line at a time — constant memory |
| Naive regex fires on every title-cased word | Three-layer pipeline: regex + Naive Bayes + OpenNLP NER |

### The Disambiguation Advantage

```
"I ate an apple yesterday." → No match (fruit, not a name)
"Contact Apple at (800) 275-2273." → [ORG_1] (company + phone)
"The Gospel of John has 21 chapters" → No match (literary reference)
"Dear John, your SSN is 123-45-6789" → [PERSON_NAME_1] + [SSN_1]
"John Michael Smith confirmed." → [PERSON_NAME_1] (OpenNLP NER)
```

---

## Playground Screenshot

[![Semantic Privacy Guard Playground](docs/playground-screenshot.png)](https://sushegaad.github.io/Semantic-Privacy-Guard/docs/index.html)

*The live playground detecting an SSN and an email address in real time — redacted output, detection table with confidence bars, and reverse map all visible.*

---

## Quick Start

### Maven

```xml

io.github.sushegaad
semantic-privacy-guard
1.4.0

```

### Gradle

```groovy
implementation 'io.github.sushegaad:semantic-privacy-guard:1.4.0'
```

### One-liner usage

```java
import com.semanticprivacyguard.SemanticPrivacyGuard;
import com.semanticprivacyguard.model.RedactionResult;

SemanticPrivacyGuard spg = SemanticPrivacyGuard.create();

RedactionResult result = spg.redact(
"Email Alice at alice.doe@acme.com or call (555) 867-5309. SSN: 123-45-6789."
);

System.out.println(result.getRedactedText());
// → "Email [PERSON_NAME_1] at [EMAIL_1] or call [PHONE_1]. SSN: [SSN_1]."

System.out.println(result.getMatchCount()); // → 4
System.out.println(result.getProcessingTimeMs()); // → < 1 ms
```

---

## Stream-Based Processing

Loading a 50 MB log file into a `String` costs ~50 MB on the heap, and with ML tokenizer working strings you reach 150–200 MB _per concurrent request_. On a Lambda with 512 MB RAM and 10 concurrent calls that is an OOM event waiting to happen.

The `StreamProcessor` processes one line at a time. Each line is detected, redacted, written to the output, and immediately eligible for GC. Heap stays bounded by the longest single line — typically under 4 KB.

```java
SemanticPrivacyGuard spg = SemanticPrivacyGuard.create();

// File-to-file: constant heap regardless of file size
StreamRedactionSummary summary =
spg.redactPath(Path.of("access.log"), Path.of("access.clean.log"));

System.out.println(summary);
// → StreamRedactionSummary[lines=84231, linesWithPII=312, matches=389, timeMs=740]

// InputStream / OutputStream (e.g. in a servlet filter)
try (InputStream in = request.getInputStream();
OutputStream out = response.getOutputStream()) {
spg.redactStream(in, out);
}

// Reader / Writer
spg.redactStream(request.getReader(), response.getWriter());

// Lazy Java Stream — integrates with Files.lines()
try (Stream lines = Files.lines(inputPath)) {
spg.streamProcessor()
.redactLines(lines)
.forEach(outputWriter::println);
}
```

Token counters are **document-scoped**: `[EMAIL_1]` on line 3 and `[EMAIL_2]` on line 7 — never two `[EMAIL_1]` tokens in the same document.

---

## NLP Integration (Apache OpenNLP)

The third detection layer uses Apache OpenNLP Named Entity Recognition — a Maximum Entropy model trained on large NLP corpora. It excels at cases the Naive Bayes layer struggles with: multi-token person names, compound organisation names, and names appearing in varied syntactic positions.

### Enable NLP

```java
// Models loaded from classpath (src/main/resources/models/)
SPGConfig config = SPGConfig.builder()
.nlpEnabled(true)
.build();

// Models loaded from the filesystem
SPGConfig config = SPGConfig.builder()
.nlpEnabled(true)
.nlpModelsDirectory(Path.of("/opt/nlp-models"))
.nlpConfidenceThreshold(0.75) // default 0.70
.build();

SemanticPrivacyGuard spg = SemanticPrivacyGuard.create(config);
```

### NLP Setup — Model Download

OpenNLP models are large binary files not bundled in the jar. Download them from the [Apache OpenNLP model repository](https://opennlp.sourceforge.net/models-1.5/):

```
en-ner-person.bin (required, ~14 MB) — person name NER
en-ner-organization.bin (recommended, ~16 MB) — organisation name NER
en-token.bin (recommended, ~1 MB) — MaxEnt tokenizer
```

Place them on the classpath:

```
src/main/resources/
models/
en-ner-person.bin
en-ner-organization.bin
en-token.bin
```

Or point to a filesystem directory:

```java
.nlpModelsDirectory(Path.of("/opt/nlp-models"))
```

Add the OpenNLP runtime dependency (marked `optional` in SPG — you must add it yourself):

```xml

org.apache.opennlp
opennlp-tools
2.3.3

```

### NLP Detection Types

| Detected by OpenNLP | PIIType | Notes |
|---|---|---|
| Person names | `PERSON_NAME` | Multi-token names, varied positions |
| Organisation names | `ORGANIZATION` | Compound names, acronyms |

NLP results flow through the same `CompositeDetector` de-duplication as heuristic and ML results. When two layers agree on the same span the match is promoted to `DetectionSource.HYBRID` with the higher confidence score.

### Thread Safety with Virtual Threads

`NameFinderME` is not thread-safe. `NLPDetector` uses `ThreadLocal` to give each thread its own `NameFinderME` wrapper, all sharing the same immutable `TokenNameFinderModel`. Adaptive state is cleared after every `detect()` call so no state leaks between requests. The class is safe under Java 17+ virtual threads (Project Loom).

---

## PII Types Detected

| Type | Example | Detection method | Severity |
|---|---|---|---|
| `SSN` | `123-45-6789` | Regex + exclusion rules | 10 |
| `CREDIT_CARD` | `4532 0151 1283 0366` | Regex + Luhn checksum | 10 |
| `API_KEY` | `AKIAIOSFODNN7EXAMPLE` | Regex + entropy filter | 9 |
| `PASSWORD` | `password=MyS3cr3t` | Regex (keyword-prefixed) | 9 |
| `MEDICAL_RECORD` | `MRN123456` | Naive Bayes ML | 8 |
| `BANK_ACCOUNT` | `GB29NWBK60161331926819` | Regex (IBAN) | 8 |
| `EMAIL` | `alice@example.com` | Regex | 6 |
| `PHONE` | `(555) 867-5309` | Regex (NANP validated) | 6 |
| `PERSON_NAME` | `Alice Johnson` | Naive Bayes ML + OpenNLP NER | 6 |
| `DATE_OF_BIRTH` | `dob: 03/15/1985` | Regex (context-prefixed) | 6 |
| `IP_ADDRESS` | `192.168.1.100` | Regex (range-validated) | 4 |
| `ORGANIZATION` | `Barclays Bank PLC` | Naive Bayes ML + OpenNLP NER | 3 |
| `COORDINATES` | `51.5074, -0.1278` | Regex (bounds-checked) | 3 |
| `GENERIC_PII` | `EMP-042731` | Custom Pattern Registry | 5 |

---

## API Reference

### `SemanticPrivacyGuard.create()`

```java
SemanticPrivacyGuard spg = SemanticPrivacyGuard.create(); // defaults
SemanticPrivacyGuard spg = SemanticPrivacyGuard.create(config); // custom
```

### `redact(String text)` → `RedactionResult`

Full detection + replacement pass. Returns `getRedactedText()`, `getMatches()`, `getReverseMap()` (token → original, for post-LLM de-tokenisation), `getMatchCount()`, and `getProcessingTimeMs()`.

### `containsPII(String text)` → `boolean`

Fast pre-flight check (~30% faster than `redact()`) for yes/no answers.

### `analyse(String text)` → `List`

Detection without redaction — for audit and reporting pipelines.

### `redactJson(String json)` → `StructuredRedactionOutput`

Redacts PII inside a JSON document. String values are replaced in-place; keys, numbers, booleans, and arrays are preserved. Throws `UnsupportedOperationException` if `jackson-databind` is not on the classpath. Throws `IOException` for malformed JSON.

### `redactXml(String xml)` → `StructuredRedactionOutput`

Redacts PII inside an XML document. Text nodes and attribute values are replaced in-place; element names, structure, and non-string content are preserved. No extra dependency required (uses JDK `javax.xml`). Throws `IOException` for malformed XML.

### `SPGConfig.Builder.addPattern(PIIType, String regex, double confidence, String description)` → `Builder`

Registers a custom regex pattern applied by `HeuristicDetector` after all built-in patterns. Multiple calls accumulate. The 3-arg overload omits the description (defaults to the regex string).

### Stream methods

```java
// InputStream / OutputStream (UTF-8)
StreamRedactionSummary redactStream(InputStream in, OutputStream out)

// Reader / Writer
StreamRedactionSummary redactStream(Reader reader, Writer writer)

// File-to-file
StreamRedactionSummary redactPath(Path inputFile, Path outputFile)

// Access the full StreamProcessor for redactLines(Stream)
StreamProcessor streamProcessor()
```

### Configuration

```java
SPGConfig config = SPGConfig.builder()
.redactionMode(RedactionMode.TOKEN) // TOKEN | MASK | BLANK
.mlConfidenceThreshold(0.70) // Naive Bayes threshold, default 0.65
.nlpEnabled(true) // enable OpenNLP NER layer (opt-in)
.nlpModelsDirectory(Path.of("...")) // null = load from classpath
.nlpConfidenceThreshold(0.75) // OpenNLP min probability, default 0.70
.enabledTypes(Set.of(PIIType.EMAIL, // null / empty = all types
PIIType.SSN))
.minimumSeverity(6) // 1–10; filter low-severity types
.buildReverseMap(true) // disable for slight perf gain
.heuristicEnabled(true)
.mlEnabled(true)
// Custom organisation-specific patterns (see Custom Pattern Registry below)
.addPattern(PIIType.GENERIC_PII, "EMP-\\d{6}", 0.99, "Employee ID")
.addPattern(PIIType.GENERIC_PII, "MRN-[A-Z0-9]{8}", 0.98, "Medical Record Number")
.build();
```

### Redaction Modes

| Mode | Example output | Use case |
|---|---|---|
| `TOKEN` | `[EMAIL_1]` | LLM pipelines — structure preserved |
| `MASK` | `█████████████████` | Logs, audit trails |
| `BLANK` | `[REDACTED]` | Human-readable reports |

---

## Custom Pattern Registry

Register organisation-specific identifiers that built-in heuristics don't cover — employee IDs, medical record numbers, internal reference codes, or any proprietary format.

```java
SPGConfig config = SPGConfig.builder()
.addPattern(PIIType.GENERIC_PII, "EMP-\\d{6}", 0.99, "Employee ID")
.addPattern(PIIType.GENERIC_PII, "MRN-[A-Z0-9]{8}", 0.98, "Medical Record Number")
.addPattern(PIIType.GENERIC_PII, "POL-[A-Z]{2}-\\d{8}", 0.97, "Policy Number")
.build();

SemanticPrivacyGuard spg = SemanticPrivacyGuard.create(config);

RedactionResult r = spg.redact(
"Task EMP-042731 relates to policy POL-GB-00123456.");
// → "Task [PII_1] relates to policy [PII_2]."
```

Custom patterns are applied by `HeuristicDetector` after all built-in patterns, so built-in matches always win for overlapping spans. Token counters are document-scoped: two `EMP-` matches in the same call produce `[PII_1]` and `[PII_2]`, never two `[PII_1]` tokens.

Multiple calls to `.addPattern()` accumulate — they do not replace each other.

---

## JSON / XML Redaction

Redact PII directly inside structured documents. Text values are replaced in-place; keys, numbers, booleans, and markup structure are preserved exactly.

### JSON

Requires `jackson-databind` on the classpath (not bundled — add it to your own `pom.xml`):

```xml

com.fasterxml.jackson.core
jackson-databind
2.17.0

```

```java
SemanticPrivacyGuard spg = SemanticPrivacyGuard.create();

StructuredRedactionOutput out = spg.redactJson("""
{
"name": "Alice Johnson",
"email": "alice@example.com",
"account": 12345
}
""");

System.out.println(out.getRedactedContent());
// → {"name":"[PERSON_NAME_1]","email":"[EMAIL_1]","account":12345}

System.out.println(out.getMatchCount()); // → 2
System.out.println(out.getReverseMap()); // → {[PERSON_NAME_1]=Alice Johnson, [EMAIL_1]=alice@example.com}
```

### XML

Uses the JDK built-in `javax.xml` — no extra dependency required. XXE injection is hardened by disabling DOCTYPE declarations and external entity loading.

```java
StructuredRedactionOutput out = spg.redactXml("""


Alice Johnson
alice@example.com
12345

""");

System.out.println(out.getRedactedContent());
// → [PERSON_NAME_1][EMAIL_1]12345
```

`StructuredRedactionOutput` fields:

| Method | Returns |
|---|---|
| `getRedactedContent()` | Redacted JSON or XML string |
| `getReverseMap()` | `Map` token → original value |
| `getMatchCount()` | Total PII matches found |
| `hasPII()` | `true` if any PII was detected |

---

## Architecture

```
Input text


┌──────────────────────────────────────────────────┐
│ Layer 1: HeuristicDetector │
│ Regex patterns + Luhn checksum + entropy filter │
│ SSN, Email, Phone, CC, IPs, API Keys, Passwords │
└─────────────────────┬────────────────────────────┘


┌──────────────────────────────────────────────────┐
│ Layer 2: MLDetector │
│ Pure-Java Naive Bayes + FeatureExtractor │
│ Person names, Organisations (context-aware) │
└─────────────────────┬────────────────────────────┘


┌──────────────────────────────────────────────────┐
│ Layer 3: NLPDetector (optional, opt-in) │
│ Apache OpenNLP NameFinderME (MaxEnt NER) │
│ Multi-token person names, compound org names │
└─────────────────────┬────────────────────────────┘


┌──────────────────────────────────────────────────┐
│ CompositeDetector │
│ De-duplicate, resolve overlaps, HYBRID merging │
└─────────────────────┬────────────────────────────┘


┌──────────────────────────────────────────────────┐
│ PIITokenizer │
│ TOKEN / MASK / BLANK + reverse map │
└──────────────────────────────────────────────────┘


RedactionResult / StreamRedactionSummary
```

For stream processing, `StreamProcessor` replaces the final step: lines are processed one at a time, redacted, and written immediately, keeping heap usage constant regardless of document size.

---

## Virtual Threads (Project Loom)

SPG is stateless and thread-safe by design. On Java 21+:

```java
// Handle 10,000 concurrent LLM prompts with zero contention
try (var exec = Executors.newVirtualThreadPerTaskExecutor()) {
for (String prompt : promptBatch) {
exec.submit(() -> {
RedactionResult r = spg.redact(prompt);
forwardToLLM(r.getRedactedText());
});
}
}
```

---

## Performance

| Approach | Throughput | False Positives |
|---|---|---|
| Naive regex (2 patterns) | 580,000 sentences/s | 60% of clean sentences |
| SPG Heuristic-only | 390,000 sentences/s | 20% |
| **SPG Full (H + ML)** | **206,000 sentences/s** | **0%** |
| SPG Full + NLP | ~45,000 sentences/s* | 0% |

\* NLP throughput depends on model size and JVM warmup. Stream processing throughput is I/O-bound rather than CPU-bound. See the [CI benchmark runs](https://github.com/Sushegaad/Semantic-Privacy-Guard/actions) for latest numbers.

---

## Building from Source

```bash
git clone https://github.com/Sushegaad/Semantic-Privacy-Guard.git
cd Semantic-Privacy-Guard

# Compile + test + coverage check (must be ≥ 80%)
mvn verify

# Run benchmarks
mvn test -P benchmark

# Build JAR only
mvn package -DskipTests
```

Requirements: JDK 17+ and Maven 3.8+.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Contributions especially welcome for:

- Additional OpenNLP model integrations (dates, locations)
- Additional training examples for the Naive Bayes corpus
- New PII type patterns (medical codes, national IDs)
- Performance benchmarks against real-world log datasets

---

## Security

See [SECURITY.md](SECURITY.md) for the CVE response process and responsible disclosure policy.

The base library has zero runtime dependencies, eliminating supply-chain attack vectors. OpenNLP is an optional dependency and is only loaded when explicitly configured. All regex patterns are validated against catastrophic backtracking (ReDoS).

---

## License

Apache License 2.0 — see [LICENSE](LICENSE).

Copyright 2026 Hemant Naik / Sushegaad