{"id":49717191,"url":"https://github.com/rbmuller/scherlok","last_synced_at":"2026-05-15T13:01:10.443Z","repository":{"id":348587441,"uuid":"1198573394","full_name":"rbmuller/scherlok","owner":"rbmuller","description":"A detective for your data. Zero-config data quality monitoring — works with dbt, Postgres, BigQuery, Snowflake. No YAML.","archived":false,"fork":false,"pushed_at":"2026-05-11T13:00:13.000Z","size":2981,"stargazers_count":1,"open_issues_count":13,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-11T14:40:36.982Z","etag":null,"topics":["anomaly-detection","bigquery","cli","data-engineering","data-observability","data-quality","dbt","etl","monitoring","open-source","postgres","postgresql","python","snowflake"],"latest_commit_sha":null,"homepage":"https://github.com/rbmuller/scherlok","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rbmuller.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-01T14:51:35.000Z","updated_at":"2026-05-11T13:09:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/rbmuller/scherlok","commit_stats":null,"previous_names":["rbmuller/scherlok"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/rbmuller/scherlok","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fscherlok","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fscherlok/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fscherlok/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fscherlok/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rbmuller","download_url":"https://codeload.github.com/rbmuller/scherlok/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbmuller%2Fscherlok/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33067476,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-15T11:35:32.926Z","status":"ssl_error","status_checked_at":"2026-05-15T11:35:31.362Z","response_time":103,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anomaly-detection","bigquery","cli","data-engineering","data-observability","data-quality","dbt","etl","monitoring","open-source","postgres","postgresql","python","snowflake"],"created_at":"2026-05-08T21:00:46.217Z","updated_at":"2026-05-15T13:01:10.438Z","avatar_url":"https://github.com/rbmuller.png","language":"Python","funding_links":[],"categories":["Data Quality","Testing"],"sub_categories":["Data Profiler"],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://img.shields.io/badge/python-3.10+-blue?logo=python\u0026logoColor=white\" alt=\"Python 3.10+\"\u003e\n\u003cimg src=\"https://img.shields.io/pypi/v/scherlok?color=green\" alt=\"PyPI\"\u003e\n\u003cimg src=\"https://img.shields.io/github/license/rbmuller/scherlok\" alt=\"MIT License\"\u003e\n\u003ca href=\"https://github.com/rbmuller/scherlok/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/rbmuller/scherlok/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n\n\u003cbr\u003e\u003cbr\u003e\n\n\u003cimg src=\"assets/scherlok-logo.png\" alt=\"Scherlok\" width=\"120\"\u003e\n\n\u003ch1\u003eScherlok\u003c/h1\u003e\n\n\u003cp\u003e\u003cstrong\u003eYour data broke in production. Again.\u003c/strong\u003e\u003cbr\u003e\nScherlok makes sure it doesn't happen next time.\u003c/p\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"examples/demo.svg\" alt=\"Scherlok Demo\" width=\"700\"\u003e\n\n**Zero config. Zero YAML. Zero rules to write.**\u003cbr\u003e\nScherlok learns what \"normal\" looks like, then tells you when something changes.\n\n\u003c/div\u003e\n\n---\n\n## The Problem\n\nEvery data team has the same nightmare:\n\n\u003e A source API silently changes from **dollars to cents**. Revenue dashboards show wrong numbers for **3 weeks** before anyone notices.\n\u003e\n\u003e A column starts returning **NULLs**. A table stops updating. Row counts drop **40% on a Tuesday**. Nobody knows until the CEO asks why the report looks weird.\n\nCurrent tools (Great Expectations, Soda, dbt tests) require you to **define what \"correct\" looks like** before you can detect what's wrong. Hundreds of rules. Dozens of YAML files. And you still miss things — because you can't write rules for problems you haven't imagined yet.\n\n## The Solution\n\nScherlok takes the opposite approach: **learn first, then detect.**\n\n```bash\nscherlok connect postgres://user:pass@host/db   # connect once\nscherlok investigate                              # learn your data\nscherlok watch                                    # detect anomalies\n```\n\nThree commands. Five minutes. Done.\n\n## What It Catches\n\n| Anomaly | What Happened | Severity |\n|---------|---------------|----------|\n| **Volume drop** | Row count dropped 40% overnight | CRITICAL |\n| **Volume spike** | 3x more rows than normal | WARNING |\n| **Freshness alert** | Table hasn't updated in 12h (normally every 2h) | CRITICAL |\n| **Schema drift** | Column removed or type changed | CRITICAL |\n| **NULL surge** | NULL rate jumped from 2% to 45% | WARNING |\n| **Distribution shift** | Column mean shifted 5+ standard deviations | WARNING |\n| **Cardinality explosion** | Status column went from 5 values to 500 | CRITICAL |\n\nEvery anomaly is auto-scored: **INFO**, **WARNING**, or **CRITICAL**. No thresholds to configure.\n\n## Works with dbt\n\nAlready running dbt? Scherlok complements `dbt test` with **automatic** anomaly detection — no rules to write.\n\n```bash\npip install scherlok[dbt]\n\n# After `dbt run`, point Scherlok at your project\nscherlok dbt --project-dir ./my_dbt_project\n```\n\nScherlok reads `target/manifest.json`, discovers every materialized model (`table`, `incremental`, `view`), auto-resolves the connection from your `profiles.yml`, and profiles each model:\n\n```\nInvestigating 4 dbt models in ./my_dbt_project (postgres)\n  ✓ stg_customers                  (12,345 rows)\n  ✓ stg_orders                     (98,765 rows)\n  ✗ fct_orders                     CRITICAL: Row count dropped 42% (98,765 → 57,283)\n  ✓ dim_customers_inc              (12,300 rows)\n\nSummary: 4 profiled, 1 anomalies (1 critical, 0 warning)\n```\n\nUse it as a CI gate after `dbt run`:\n\n```yaml\n- run: dbt run --target prod\n- run: scherlok dbt --project-dir . --target prod --fail-on critical\n```\n\n**Supported adapters:** `postgres`, `bigquery`, `snowflake`, `mysql`. For others, pass `--connection-string` explicitly.\n\n📖 Full docs: [dbt integration guide →](src/scherlok/dbt/README.md)\n\n## HTML dashboard\n\n![scherlok dashboard](assets/dashboard-screenshot.png)\n\n```bash\nscherlok dashboard --out report.html\n```\n\nOne self-contained HTML file (~28 KB): KPIs, per-table incidents grouped with first-seen timestamps, `+`/`−`/`~` schema-drift diff, sparklines, and full anomaly history. Auto dark/light theme via `prefers-color-scheme`.\n\n📖 Full docs: [dashboard guide →](src/scherlok/dashboard/README.md)\n\n## How It Works\n\n### 1. `investigate` — Learn the patterns\n\n```bash\n$ scherlok investigate\n\n  Profiling 12 tables...\n  ✓ users         — 45,231 rows, 8 columns\n  ✓ orders        — 1,203,847 rows, 15 columns\n  ✓ products      — 892 rows, 12 columns\n  ...\n  Done. Profiles saved.\n```\n\nScherlok profiles every table: row counts, column types, NULL rates, value distributions, freshness cadence, cardinality. Stores everything locally in SQLite.\n\n### 2. `watch` — Detect anomalies\n\n```bash\n$ scherlok watch\n\n  Checking 12 tables against learned profiles...\n\n  🔴 CRITICAL  orders    volume_drop     Row count dropped 52% (1,203,847 → 578,412)\n  🟡 WARNING   users     null_increase   Column \"email\": NULL rate 2.1% → 18.7%\n  🔵 INFO      products  distribution    Column \"price\": mean shifted 3.2σ\n\n  3 anomalies detected. Exit code: 1\n```\n\n### 3. Alert — Slack, CI/CD, or both\n\n```bash\n# Slack\nscherlok watch --webhook https://hooks.slack.com/services/...\n\n# Discord\nscherlok watch --webhook https://discord.com/api/webhooks/...\n\n# Microsoft Teams\nscherlok watch --webhook https://outlook.office.com/webhook/...\n\n# Any endpoint (generic JSON payload)\nscherlok watch --webhook https://my-api.com/alerts\n\n# CI/CD gate (fails pipeline on CRITICAL)\nscherlok watch --exit-code --fail-on critical\n```\n\nAuto-detects Slack, Discord, and Teams from the URL and formats the payload accordingly. Any other URL receives a generic JSON payload.\n\n## CI/CD Integration\n\nUse Scherlok as a data quality gate. The `ci` command does it in one line:\n\n```yaml\n# GitHub Actions\n- name: Data quality check\n  run: |\n    pip install scherlok\n    scherlok config --store s3://my-bucket/scherlok/profiles.db\n    scherlok ci ${{ secrets.DATABASE_URL }} \\\n      --webhook ${{ secrets.SLACK_WEBHOOK }} \\\n      --fail-on critical\n```\n\nIf Scherlok detects a critical anomaly, the pipeline fails. Bad data never reaches production.\n\n## Email alerts\n\n```bash\nexport SCHERLOK_SMTP_HOST=smtp.gmail.com\nexport SCHERLOK_SMTP_USER=alerts@company.com\nexport SCHERLOK_SMTP_PASSWORD=app-specific-password\n\nscherlok watch --email team@company.com --email cto@company.com\n```\n\n## Connectors\n\n```bash\n# PostgreSQL\nscherlok connect postgres://user:pass@host:5432/db\n\n# BigQuery\npip install scherlok[bigquery]\nscherlok connect bigquery://project-id/dataset-name\n\n# Snowflake\npip install scherlok[snowflake]\nexport SNOWFLAKE_USER=...\nexport SNOWFLAKE_PASSWORD=...\nexport SNOWFLAKE_WAREHOUSE=...\nscherlok connect snowflake://account/database/schema\n\n# MySQL\npip install scherlok[mysql]\nscherlok connect mysql://user:pass@host:3306/dbname\n```\n\n| Database | Status |\n|----------|--------|\n| PostgreSQL | Available |\n| BigQuery | Available |\n| Snowflake | Available |\n| MySQL | Available |\n| DuckDB | Planned |\n\n## Remote Storage\n\nShare profiles across CI runs and team members:\n\n```bash\n# AWS S3\nscherlok config --store s3://my-bucket/scherlok/profiles.db\n\n# Google Cloud Storage\nscherlok config --store gs://my-bucket/scherlok/profiles.db\n\n# Azure Blob Storage\nscherlok config --store az://my-container/scherlok/profiles.db\n```\n\n## Why Not [Other Tool]?\n\n| | Great Expectations | Soda | Monte Carlo | **Scherlok** |\n|---|---|---|---|---|\n| Setup time | Hours | 30 min | Weeks | **5 minutes** |\n| Config required | Hundreds of rules | YAML checks | Dashboard setup | **None** |\n| Anomaly detection | Manual thresholds | Paid feature | Yes | **Yes, free** |\n| Self-hosted | Yes | Limited | No (SaaS) | **Yes** |\n| CI/CD gate | Yes | Yes | No | **Yes** |\n| Price | Free | Freemium | $50-200K/yr | **Free, forever** |\n\n## CLI Reference\n\n```\nscherlok connect \u003curl\u003e          Connect to a database\nscherlok investigate            Profile all tables (learn patterns)\nscherlok watch [-w \u003curl\u003e] [-e \u003cemail\u003e]  Detect anomalies and alert\nscherlok ci \u003curl\u003e [opts]        All-in-one CI/CD command (connect + watch + exit code)\nscherlok status                 Quick health dashboard\nscherlok report                 Detailed profile summary\nscherlok history [--days N]     Timeline of past anomalies\nscherlok config --store \u003curl\u003e   Set remote storage\nscherlok version                Show version\n```\n\n## Install\n\n```bash\npip install scherlok\n\n# With BigQuery support\npip install scherlok[bigquery]\n```\n\nRequires Python 3.10+.\n\n### Run via Docker\n\nA pre-built image with every warehouse extra (`dbt`, `bigquery`, `snowflake`) is published to GitHub Container Registry on every release tag:\n\n```bash\ndocker run --rm ghcr.io/rbmuller/scherlok:latest version\n```\n\nMount your project directory and inject connection details the same way your CI does it; the entrypoint is the `scherlok` CLI:\n\n```bash\ndocker run --rm \\\n  -v \"$PWD:/work\" -w /work \\\n  -e SCHERLOK_CONNECTION=postgres://... \\\n  ghcr.io/rbmuller/scherlok:latest watch\n```\n\nThe image is built from `python:3.12-slim` and runs unprivileged (`USER scherlok`).\n\n## Contributing\n\nContributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md).\n\nWe're especially looking for:\n- New database connectors (Snowflake, MySQL, DuckDB)\n- Anomaly detection improvements\n- Documentation and examples\n\n## License\n\n[MIT](LICENSE) — Developed by [Robson Bayer Müller](https://github.com/rbmuller)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frbmuller%2Fscherlok","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frbmuller%2Fscherlok","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frbmuller%2Fscherlok/lists"}