{"id":35137904,"url":"https://github.com/nxank4/loclean","last_synced_at":"2026-01-22T18:44:59.038Z","repository":{"id":328524599,"uuid":"1114586511","full_name":"nxank4/loclean","owner":"nxank4","description":"⚡️ The All-in-One Local AI Data Cleaning Library. No GPU or API keys required.","archived":false,"fork":false,"pushed_at":"2026-01-13T17:00:17.000Z","size":488,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-13T22:51:24.551Z","etag":null,"topics":["automated-cleaning","data","data-cleaning","data-engineering","data-preprocessing","data-science","data-wrangling","etl","llm","normalization","open-source","polars","privacy-preserving","python","semantic-analysis","slm","structured-data"],"latest_commit_sha":null,"homepage":"https://nxank4.github.io/loclean/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nxank4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-11T15:31:18.000Z","updated_at":"2026-01-13T17:00:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/nxank4/loclean","commit_stats":null,"previous_names":["nxank4/semantix","nxank4/loclean"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/nxank4/loclean","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nxank4%2Floclean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nxank4%2Floclean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nxank4%2Floclean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nxank4%2Floclean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nxank4","download_url":"https://codeload.github.com/nxank4/loclean/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nxank4%2Floclean/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28668315,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T17:07:18.858Z","status":"ssl_error","status_checked_at":"2026-01-22T17:05:02.040Z","response_time":144,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automated-cleaning","data","data-cleaning","data-engineering","data-preprocessing","data-science","data-wrangling","etl","llm","normalization","open-source","polars","privacy-preserving","python","semantic-analysis","slm","structured-data"],"created_at":"2025-12-28T10:15:48.232Z","updated_at":"2026-01-22T18:44:59.017Z","avatar_url":"https://github.com/nxank4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/nxank4/loclean\"\u003e\n    \u003cpicture\u003e\n      \u003csource srcset=\"assets/dark-loclean.svg\" media=\"(prefers-color-scheme: dark)\"\u003e\n      \u003csource srcset=\"assets/light-loclean.svg\" media=\"(prefers-color-scheme: light)\"\u003e\n      \u003cimg src=\"assets/light-loclean.svg\" alt=\"Loclean logo\" width=\"200\" height=\"200\"\u003e\n    \u003c/picture\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eThe All-in-One Local AI Data Cleaner.\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/loclean\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/loclean?color=blue\u0026style=flat-square\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/loclean\"\u003e\u003cimg src=\"https://img.shields.io/pypi/pyversions/loclean?style=flat-square\" alt=\"Python Versions\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/nxank4/loclean/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/nxank4/loclean/actions/workflows/ci.yml/badge.svg\" alt=\"CI Status\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/nxank4/loclean/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/github/license/nxank4/loclean?style=flat-square\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/astral-sh/uv\"\u003e\u003cimg src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json\" alt=\"uv\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://nxank4.github.io/loclean\"\u003e\u003cimg src=\"https://img.shields.io/badge/docs-loclean-blue?style=flat-square\" alt=\"Documentation\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n# Why Loclean?\n\n\u003e 📚 **Documentation:** [nxank4.github.io/loclean](https://nxank4.github.io/loclean)\n\nLoclean bridges the gap between **Data Engineering** and **Local AI**, designed for production pipelines where privacy and stability are non-negotiable.\n\n## Privacy-First \u0026 Zero Cost\n\nLeverage the power of Small Language Models (SLMs) like **Phi-3** and **Llama-3** running locally via `llama.cpp`. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure.\n\n## Deterministic Outputs\n\nForget about \"hallucinations\" or parsing loose text. Loclean uses **GBNF Grammars** and **Pydantic V2** to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.\n\n## Structured Extraction with Pydantic\n\nExtract structured data from unstructured text with guaranteed schema compliance:\n\n```python\nfrom pydantic import BaseModel\nimport loclean\n\nclass Product(BaseModel):\n    name: str\n    price: int\n    color: str\n\n# Extract from text\nitem = loclean.extract(\"Selling red t-shirt for 50k\", schema=Product)\nprint(item.name)  # \"t-shirt\"\nprint(item.price)  # 50000\n\n# Extract from DataFrame (default: structured dict for performance)\nimport polars as pl\ndf = pl.DataFrame({\"description\": [\"Selling red t-shirt for 50k\"]})\nresult = loclean.extract(df, schema=Product, target_col=\"description\")\n\n# Query with Polars Struct (vectorized operations)\nresult.filter(pl.col(\"description_extracted\").struct.field(\"price\") \u003e 50000)\n```\n\nThe `extract()` function ensures 100% compliance with your Pydantic schema through:\n- **Dynamic GBNF Grammar Generation**: Automatically converts Pydantic schemas to GBNF grammars\n- **JSON Repair**: Automatically fixes malformed JSON output from LLMs\n- **Retry Logic**: Retries with adjusted prompts when validation fails\n\n## Backend Agnostic (Zero-Copy)\n\nBuilt on **Narwhals**, Loclean supports **Pandas**, **Polars**, and **PyArrow** natively.\n\n* Running Polars? We keep it lazy.\n* Running Pandas? We handle it seamlessly.\n* **No heavy dependency lock-in.**\n\n# Installation\n\n## Requirements\n\n* Python 3.10, 3.11, 3.12, or 3.13\n* No GPU required (runs on CPU by default)\n\n## Basic Installation\n\n**Using pip (recommended):**\n\n```bash\npip install loclean\n```\n\nThe basic installation includes **local inference** support (via `llama-cpp-python`). \n\n\u003e **📦 Installation Notice:** \n\u003e - **Fast (30-60 seconds):** Pre-built wheels are available for most platforms (Linux x86_64, macOS, Windows)\n\u003e - **Slow (5-10 minutes):** If you see \"Building wheels for collected packages: llama-cpp-python\", it's building from source. This is **normal** and only happens when no pre-built wheel is available for your platform. Please be patient - this is not an error!\n\u003e \n\u003e **💡 To ensure fast installation:**\n\u003e ```bash\n\u003e pip install --upgrade pip setuptools wheel\n\u003e pip install loclean\n\u003e ```\n\u003e This ensures pip can find and use pre-built wheels when available.\n\n**Using uv (alternative, often faster):**\n\n```bash\nuv pip install loclean\n```\n\n**Using conda/mamba:**\n\n```bash\nconda install -c conda-forge loclean\n# or\nmamba install -c conda-forge loclean\n```\n\n## Optional Dependencies\n\nThe basic installation includes local inference support. Loclean uses **Narwhals** for backend-agnostic DataFrame operations, so if you already have **Pandas**, **Polars**, or **PyArrow** installed, the basic installation is sufficient.\n\n**Install DataFrame libraries (if not already present):**\n\nIf you don't have any DataFrame library installed, or want to ensure you have all supported backends:\n\n```bash\npip install loclean[data]\n```\n\nThis installs: `pandas\u003e=2.3.3`, `polars\u003e=0.20.0`, `pyarrow\u003e=22.0.0`\n\n**For Cloud API support (OpenAI, Anthropic, Gemini):**\n\nCloud API support is planned for future releases. Currently, only local inference is available:\n\n```bash\npip install loclean[cloud]\n```\n\n**Install all optional dependencies:**\n\n```bash\npip install loclean[all]\n```\n\nThis installs both `loclean[data]` and `loclean[cloud]`. Useful for production environments where you want all features available.\n\n\u003e **Note for developers:** If you're contributing to Loclean, use the [Development Installation](#development-installation) section below (git clone + `uv sync --dev`), not `loclean[all]`.\n\n## Development Installation\n\nTo contribute or run tests locally:\n\n```bash\n# Clone the repository\ngit clone https://github.com/nxank4/loclean.git\ncd loclean\n\n# Install with development dependencies (using uv)\nuv sync --dev\n\n# Or using pip\npip install -e \".[dev]\"\n```\n\n# Model Management\n\nLoclean automatically downloads models on first use, but you can pre-download them using the CLI:\n\n```bash\n# Download a specific model\nloclean model download --name phi-3-mini\n\n# List available models\nloclean model list\n\n# Check download status\nloclean model status\n```\n\n## Available Models\n\n- **phi-3-mini**: Microsoft Phi-3 Mini (3.8B, 4K context) - Default, balanced\n- **tinyllama**: TinyLlama 1.1B - Smallest, fastest\n- **gemma-2b**: Google Gemma 2B Instruct - Balanced performance\n- **qwen3-4b**: Qwen3 4B - Higher quality\n- **gemma-3-4b**: Gemma 3 4B - Larger context\n- **deepseek-r1**: DeepSeek R1 - Reasoning model\n- **lfm2.5**: Liquid LFM2.5-1.2B Instruct (1.17B, 32K context) - Best-in-class 1B scale, optimized for agentic tasks and data extraction\n\nModels are cached in `~/.cache/loclean` by default. You can specify a custom cache directory using the `--cache-dir` option.\n\n# Quick Start\n\nLoclean is best learned by example. We provide a set of Jupyter notebooks to help you get started:\n\n- **[01-quick-start.ipynb](examples/01-quick-start.ipynb)**: Core features, structured extraction, and Privacy Scrubbing.\n- **[02-data-cleaning.ipynb](examples/02-data-cleaning.ipynb)**: Comprehensive data cleaning strategies.\n- **[03-privacy-scrubbing.ipynb](examples/03-privacy-scrubbing.ipynb)**: Deep dive into PII redaction.\n\nCheck out the **[examples/](examples/)** directory for more details.\n\n# Contributing\n\nWe love contributions! Loclean is strictly open-source under the **Apache 2.0 License**.\n\nPlease read our **[Contributing Guide](CONTRIBUTION.md)** for details on how to set up your development environment, run tests, and submit Pull Requests.\n\n_Built for the Data Community._\n\n[![Star History Chart](https://api.star-history.com/svg?repos=nxank4/loclean\u0026type=date\u0026legend=top-left)](https://www.star-history.com/#nxank4/loclean\u0026type=date\u0026legend=top-left)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnxank4%2Floclean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnxank4%2Floclean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnxank4%2Floclean/lists"}