https://github.com/datafog/datafog-python
Open source PII detection and anonymization tool: easy-to-use, configurable, and extensible
https://github.com/datafog/datafog-python
ai data-anonymization data-preprocessing devsecaiops llm-privacy open-source pii pii-detection privacy privacy-protection python
Last synced: 27 days ago
JSON representation
Open source PII detection and anonymization tool: easy-to-use, configurable, and extensible
- Host: GitHub
- URL: https://github.com/datafog/datafog-python
- Owner: DataFog
- License: mit
- Created: 2023-06-22T11:50:50.000Z (almost 2 years ago)
- Default Branch: dev
- Last Pushed: 2024-11-04T02:23:47.000Z (7 months ago)
- Last Synced: 2025-04-23T21:48:28.399Z (29 days ago)
- Topics: ai, data-anonymization, data-preprocessing, devsecaiops, llm-privacy, open-source, pii, pii-detection, privacy, privacy-protection, python
- Language: Python
- Homepage: https://www.datafog.ai
- Size: 78.4 MB
- Stars: 19
- Watchers: 1
- Forks: 5
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.MD
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
Open-source PII Detection & Anonymization.
## Installation
DataFog can be installed via pip:
```
pip install datafog
```# CLI
## 📚 Quick Reference
| Command | Description |
| ------------------- | ------------------------------------ |
| `scan-text` | Analyze text for PII |
| `scan-image` | Extract and analyze text from images |
| `redact-text` | Redact PII in text |
| `replace-text` | Replace PII with anonymized values |
| `hash-text` | Hash PII in text |
| `health` | Check service status |
| `show-config` | Display current settings |
| `download-model` | Get a specific spaCy model |
| `list-spacy-models` | Show available models |
| `list-entities` | View supported PII entities |---
## 🔍 Detailed Usage
### Scanning Text
To scan and annotate text for PII entities:
```bash
datafog scan-text "Your text here"
```**Example:**
```bash
datafog scan-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
```### Scanning Images
To extract text from images and optionally perform PII annotation:
```bash
datafog scan-image "path/to/image.png" --operations extract
```**Example:**
```bash
datafog scan-image "nokia-statement.png" --operations extract
```To extract text and annotate PII:
```bash
datafog scan-image "nokia-statement.png" --operations scan
```### Redacting Text
To redact PII in text:
```bash
datafog redact-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
```which should output:
```bash
[REDACTED] is the CEO of [REDACTED] and is based out of [REDACTED], [REDACTED]
```### Replacing Text
To replace detected PII:
```bash
datafog replace-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
```which should return something like:
```bash
[PERSON_B86CACE6] is the CEO of [UNKNOWN_445944D7] and is based out of [UNKNOWN_32BA5DCA], [UNKNOWN_B7DF4969]
```Note: a unique randomly generated identifier is created for each detected entity
### Hashing Text
You can select from SHA256, SHA3-256, and MD5 hashing algorithms to hash detected PII. Currently the hashed output does not match the length of the original entity, for privacy-preserving purposes. The default is SHA256.
```bash
datafog hash-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
```generating an output which looks like this:
```bash
5738a37f0af81594b8a8fd677e31b5e2cabd6d7791c89b9f0a1c233bb563ae39 is the CEO of f223faa96f22916294922b171a2696d868fd1f9129302eb41a45b2a2ea2ebbfd and is based out of ab5f41f04096cf7cd314357c4be26993eeebc0c094ca668506020017c35b7a9c, cad0535decc38b248b40e7aef9a1cfd91ce386fa5c46f05ea622649e7faf18fb
```### Utility Commands
#### 🏥 Health Check
```bash
datafog health
```#### ⚙️ Show Configuration
```bash
datafog show-config
```#### 📥 Download Model
```bash
datafog download-model en_core_web_sm
```#### 📂 Show Model Directory
```bash
datafog show-spacy-model-directory en_core_web_sm
```#### 📋 List Models
```bash
datafog list-spacy-models
```#### 🏷️ List Entities
```bash
datafog list-entities
```---
## ⚠️ Important Notes
- For `scan-image` and `scan-text` commands, use `--operations` to specify different operations. Default is `scan`.
- Process multiple images or text strings in a single command by providing multiple arguments.
- Ensure proper permissions and configuration of the DataFog service before running commands.---
💡 **Tip:** For more detailed information on each command, use the `--help` option, e.g., `datafog scan-text --help`.
# Python SDK
## Getting Started
To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:
```python
from datafog import DataFog# For text annotation
client = DataFog(operations="scan")# For OCR (Optical Character Recognition)
ocr_client = DataFog(operations="extract")
```## Text PII Annotation
Here's an example of how to annotate PII in a text document:
```
import requests# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)
```## OCR PII Annotation
For OCR capabilities, you can use the following:
```
import asyncio
import nest_asyncionest_asyncio.apply()
async def run_ocr_pipeline_demo():
image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
print("OCR Pipeline Results:", results)loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())
```Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the `async`/`await` syntax when calling the appropriate methods.
## Text Anonymization
DataFog provides various anonymization techniques to protect sensitive information. Here are examples of how to use them:
### Redacting Text
To redact PII in text:
```python
from datafog import DataFog
from datafog.config import OperationTypeclient = DataFog(operations=[OperationType.SCAN, OperationType.REDACT])
text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
redacted_text = client.run_text_pipeline_sync([text])[0]
print(redacted_text)
```Output:
```
[REDACTED] is the CEO of [REDACTED] and is based out of [REDACTED], [REDACTED]
```### Replacing Text
To replace detected PII with unique identifiers:
```python
from datafog import DataFog
from datafog.config import OperationTypeclient = DataFog(operations=[OperationType.SCAN, OperationType.REPLACE])
text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
replaced_text = client.run_text_pipeline_sync([text])[0]
print(replaced_text)
```Output:
```
[PERSON_B86CACE6] is the CEO of [UNKNOWN_445944D7] and is based out of [UNKNOWN_32BA5DCA], [UNKNOWN_B7DF4969]
```### Hashing Text
To hash detected PII:
```python
from datafog import DataFog
from datafog.config import OperationType
from datafog.models.anonymizer import HashTypeclient = DataFog(operations=[OperationType.SCAN, OperationType.HASH], hash_type=HashType.SHA256)
text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
hashed_text = client.run_text_pipeline_sync([text])[0]
print(hashed_text)
```Output:
```
5738a37f0af81594b8a8fd677e31b5e2cabd6d7791c89b9f0a1c233bb563ae39 is the CEO of f223faa96f22916294922b171a2696d868fd1f9129302eb41a45b2a2ea2ebbfd and is based out of ab5f41f04096cf7cd314357c4be26993eeebc0c094ca668506020017c35b7a9c, cad0535decc38b248b40e7aef9a1cfd91ce386fa5c46f05ea622649e7faf18fb
```You can choose from SHA256 (default), SHA3-256, and MD5 hashing algorithms by specifying the `hash_type` parameter
## Examples
For more detailed examples, check out our Jupyter notebooks in the `examples/` directory:
- `text_annotation_example.ipynb`: Demonstrates text PII annotation
- `image_processing.ipynb`: Shows OCR capabilities and text extraction from imagesThese notebooks provide step-by-step guides on how to use DataFog for various tasks.
### Dev Notes
For local development:
1. Clone the repository.
2. Navigate to the project directory:
```
cd datafog-python
```
3. Create a new virtual environment (using `.venv` is recommended as it is hardcoded in the justfile):
```
python -m venv .venv
```
4. Activate the virtual environment:
- On Windows:
```
.venv\Scripts\activate
```
- On macOS/Linux:
```
source .venv/bin/activate
```
5. Install the package in editable mode:
```
pip install -r requirements-dev.txt
```
6. Set up the project:
```
just setup
```Now, you can develop and run the project locally.
#### Important Actions:
- **Format the code**:
```
just format
```
This runs `isort` to sort imports.
- **Lint the code**:
```
just lint
```
This runs `flake8` to check for linting errors.
- **Generate coverage report**:
```
just coverage-html
```
This runs `pytest` and generates a coverage report in the `htmlcov/` directory.We use [pre-commit](https://marketplace.visualstudio.com/items?itemName=elagil.pre-commit-helper) to run checks locally before committing changes. Once installed, you can run:
```
pre-commit run --all-files
```#### Dependencies
For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under `.github/workflows/` in the following files:
- `dev-cicd.yml`
- `feature-cicd.yml`
- `main-cicd.yml`### Testing
- Python 3.10
## License
This software is published under the [MIT
license](https://en.wikipedia.org/wiki/MIT_License).