https://github.com/nxank4/loclean
⚡️ The All-in-One Local AI Data Cleaning Library. No GPU or API keys required.
https://github.com/nxank4/loclean
automated-cleaning data data-cleaning data-engineering data-preprocessing data-science data-wrangling etl llm normalization open-source polars privacy-preserving python semantic-analysis slm structured-data
Last synced: 4 months ago
JSON representation
⚡️ The All-in-One Local AI Data Cleaning Library. No GPU or API keys required.
- Host: GitHub
- URL: https://github.com/nxank4/loclean
- Owner: nxank4
- License: apache-2.0
- Created: 2025-12-11T15:31:18.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-01-13T17:00:17.000Z (5 months ago)
- Last Synced: 2026-01-13T22:51:24.551Z (5 months ago)
- Topics: automated-cleaning, data, data-cleaning, data-engineering, data-preprocessing, data-science, data-wrangling, etl, llm, normalization, open-source, polars, privacy-preserving, python, semantic-analysis, slm, structured-data
- Language: Python
- Homepage: https://nxank4.github.io/loclean/
- Size: 477 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README
The All-in-One Local AI Data Cleaner.
# Why Loclean?
> 📚 **Documentation:** [nxank4.github.io/loclean](https://nxank4.github.io/loclean)
Loclean bridges the gap between **Data Engineering** and **Local AI**, designed for production pipelines where privacy and stability are non-negotiable.
## Privacy-First & Zero Cost
Leverage the power of Small Language Models (SLMs) like **Phi-3** and **Llama-3** running locally via `llama.cpp`. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure.
## Deterministic Outputs
Forget about "hallucinations" or parsing loose text. Loclean uses **GBNF Grammars** and **Pydantic V2** to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.
## Structured Extraction with Pydantic
Extract structured data from unstructured text with guaranteed schema compliance:
```python
from pydantic import BaseModel
import loclean
class Product(BaseModel):
name: str
price: int
color: str
# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(item.name) # "t-shirt"
print(item.price) # 50000
# Extract from DataFrame (default: structured dict for performance)
import polars as pl
df = pl.DataFrame({"description": ["Selling red t-shirt for 50k"]})
result = loclean.extract(df, schema=Product, target_col="description")
# Query with Polars Struct (vectorized operations)
result.filter(pl.col("description_extracted").struct.field("price") > 50000)
```
The `extract()` function ensures 100% compliance with your Pydantic schema through:
- **Dynamic GBNF Grammar Generation**: Automatically converts Pydantic schemas to GBNF grammars
- **JSON Repair**: Automatically fixes malformed JSON output from LLMs
- **Retry Logic**: Retries with adjusted prompts when validation fails
## Backend Agnostic (Zero-Copy)
Built on **Narwhals**, Loclean supports **Pandas**, **Polars**, and **PyArrow** natively.
* Running Polars? We keep it lazy.
* Running Pandas? We handle it seamlessly.
* **No heavy dependency lock-in.**
# Installation
## Requirements
* Python 3.10, 3.11, 3.12, or 3.13
* No GPU required (runs on CPU by default)
## Basic Installation
**Using pip (recommended):**
```bash
pip install loclean
```
The basic installation includes **local inference** support (via `llama-cpp-python`).
> **📦 Installation Notice:**
> - **Fast (30-60 seconds):** Pre-built wheels are available for most platforms (Linux x86_64, macOS, Windows)
> - **Slow (5-10 minutes):** If you see "Building wheels for collected packages: llama-cpp-python", it's building from source. This is **normal** and only happens when no pre-built wheel is available for your platform. Please be patient - this is not an error!
>
> **💡 To ensure fast installation:**
> ```bash
> pip install --upgrade pip setuptools wheel
> pip install loclean
> ```
> This ensures pip can find and use pre-built wheels when available.
**Using uv (alternative, often faster):**
```bash
uv pip install loclean
```
**Using conda/mamba:**
```bash
conda install -c conda-forge loclean
# or
mamba install -c conda-forge loclean
```
## Optional Dependencies
The basic installation includes local inference support. Loclean uses **Narwhals** for backend-agnostic DataFrame operations, so if you already have **Pandas**, **Polars**, or **PyArrow** installed, the basic installation is sufficient.
**Install DataFrame libraries (if not already present):**
If you don't have any DataFrame library installed, or want to ensure you have all supported backends:
```bash
pip install loclean[data]
```
This installs: `pandas>=2.3.3`, `polars>=0.20.0`, `pyarrow>=22.0.0`
**For Cloud API support (OpenAI, Anthropic, Gemini):**
Cloud API support is planned for future releases. Currently, only local inference is available:
```bash
pip install loclean[cloud]
```
**Install all optional dependencies:**
```bash
pip install loclean[all]
```
This installs both `loclean[data]` and `loclean[cloud]`. Useful for production environments where you want all features available.
> **Note for developers:** If you're contributing to Loclean, use the [Development Installation](#development-installation) section below (git clone + `uv sync --dev`), not `loclean[all]`.
## Development Installation
To contribute or run tests locally:
```bash
# Clone the repository
git clone https://github.com/nxank4/loclean.git
cd loclean
# Install with development dependencies (using uv)
uv sync --dev
# Or using pip
pip install -e ".[dev]"
```
# Model Management
Loclean automatically downloads models on first use, but you can pre-download them using the CLI:
```bash
# Download a specific model
loclean model download --name phi-3-mini
# List available models
loclean model list
# Check download status
loclean model status
```
## Available Models
- **phi-3-mini**: Microsoft Phi-3 Mini (3.8B, 4K context) - Default, balanced
- **tinyllama**: TinyLlama 1.1B - Smallest, fastest
- **gemma-2b**: Google Gemma 2B Instruct - Balanced performance
- **qwen3-4b**: Qwen3 4B - Higher quality
- **gemma-3-4b**: Gemma 3 4B - Larger context
- **deepseek-r1**: DeepSeek R1 - Reasoning model
- **lfm2.5**: Liquid LFM2.5-1.2B Instruct (1.17B, 32K context) - Best-in-class 1B scale, optimized for agentic tasks and data extraction
Models are cached in `~/.cache/loclean` by default. You can specify a custom cache directory using the `--cache-dir` option.
# Quick Start
Loclean is best learned by example. We provide a set of Jupyter notebooks to help you get started:
- **[01-quick-start.ipynb](examples/01-quick-start.ipynb)**: Core features, structured extraction, and Privacy Scrubbing.
- **[02-data-cleaning.ipynb](examples/02-data-cleaning.ipynb)**: Comprehensive data cleaning strategies.
- **[03-privacy-scrubbing.ipynb](examples/03-privacy-scrubbing.ipynb)**: Deep dive into PII redaction.
Check out the **[examples/](examples/)** directory for more details.
# Contributing
We love contributions! Loclean is strictly open-source under the **Apache 2.0 License**.
Please read our **[Contributing Guide](CONTRIBUTION.md)** for details on how to set up your development environment, run tests, and submit Pull Requests.
_Built for the Data Community._
[](https://www.star-history.com/#nxank4/loclean&type=date&legend=top-left)