https://github.com/rbmuller/scherlok
A detective for your data. Zero-config data quality monitoring — works with dbt, Postgres, BigQuery, Snowflake. No YAML.
https://github.com/rbmuller/scherlok
anomaly-detection bigquery cli data-engineering data-observability data-quality dbt etl monitoring open-source postgres postgresql python snowflake
Last synced: about 1 month ago
JSON representation
A detective for your data. Zero-config data quality monitoring — works with dbt, Postgres, BigQuery, Snowflake. No YAML.
- Host: GitHub
- URL: https://github.com/rbmuller/scherlok
- Owner: rbmuller
- License: mit
- Created: 2026-04-01T14:51:35.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-05-11T13:00:13.000Z (about 1 month ago)
- Last Synced: 2026-05-11T14:40:36.982Z (about 1 month ago)
- Topics: anomaly-detection, bigquery, cli, data-engineering, data-observability, data-quality, dbt, etl, monitoring, open-source, postgres, postgresql, python, snowflake
- Language: Python
- Homepage: https://github.com/rbmuller/scherlok
- Size: 2.84 MB
- Stars: 1
- Watchers: 0
- Forks: 2
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
- awesome-dbt - Scherlok - Zero-config data quality CLI that complements `dbt test` with auto-detected anomalies (volume, schema drift, freshness, distribution, cardinality) on every materialized model after `dbt run`. (Data Quality)
- awesome-data-engineering - Scherlok - Zero-config data quality CLI. Profiles every table on first run, then auto-detects anomalies (volume drops, schema drift, freshness misses, distribution shifts) on subsequent runs. No YAML, no rules to write. Works with Postgres, BigQuery, Snowflake, and dbt. (Testing / Data Profiler)
README

**Zero config. Zero YAML. Zero rules to write.**
Scherlok learns what "normal" looks like, then tells you when something changes.
---
## The Problem
Every data team has the same nightmare:
> A source API silently changes from **dollars to cents**. Revenue dashboards show wrong numbers for **3 weeks** before anyone notices.
>
> A column starts returning **NULLs**. A table stops updating. Row counts drop **40% on a Tuesday**. Nobody knows until the CEO asks why the report looks weird.
Current tools (Great Expectations, Soda, dbt tests) require you to **define what "correct" looks like** before you can detect what's wrong. Hundreds of rules. Dozens of YAML files. And you still miss things — because you can't write rules for problems you haven't imagined yet.
## The Solution
Scherlok takes the opposite approach: **learn first, then detect.**
```bash
scherlok connect postgres://user:pass@host/db # connect once
scherlok investigate # learn your data
scherlok watch # detect anomalies
```
Three commands. Five minutes. Done.
## What It Catches
| Anomaly | What Happened | Severity |
|---------|---------------|----------|
| **Volume drop** | Row count dropped 40% overnight | CRITICAL |
| **Volume spike** | 3x more rows than normal | WARNING |
| **Freshness alert** | Table hasn't updated in 12h (normally every 2h) | CRITICAL |
| **Schema drift** | Column removed or type changed | CRITICAL |
| **NULL surge** | NULL rate jumped from 2% to 45% | WARNING |
| **Distribution shift** | Column mean shifted 5+ standard deviations | WARNING |
| **Cardinality explosion** | Status column went from 5 values to 500 | CRITICAL |
Every anomaly is auto-scored: **INFO**, **WARNING**, or **CRITICAL**. No thresholds to configure.
## Works with dbt
Already running dbt? Scherlok complements `dbt test` with **automatic** anomaly detection — no rules to write.
```bash
pip install scherlok[dbt]
# After `dbt run`, point Scherlok at your project
scherlok dbt --project-dir ./my_dbt_project
```
Scherlok reads `target/manifest.json`, discovers every materialized model (`table`, `incremental`, `view`), auto-resolves the connection from your `profiles.yml`, and profiles each model:
```
Investigating 4 dbt models in ./my_dbt_project (postgres)
✓ stg_customers (12,345 rows)
✓ stg_orders (98,765 rows)
✗ fct_orders CRITICAL: Row count dropped 42% (98,765 → 57,283)
✓ dim_customers_inc (12,300 rows)
Summary: 4 profiled, 1 anomalies (1 critical, 0 warning)
```
Use it as a CI gate after `dbt run`:
```yaml
- run: dbt run --target prod
- run: scherlok dbt --project-dir . --target prod --fail-on critical
```
**Supported adapters:** `postgres`, `bigquery`, `snowflake`, `mysql`. For others, pass `--connection-string` explicitly.
📖 Full docs: [dbt integration guide →](src/scherlok/dbt/README.md)
## HTML dashboard

```bash
scherlok dashboard --out report.html
```
One self-contained HTML file (~28 KB): KPIs, per-table incidents grouped with first-seen timestamps, `+`/`−`/`~` schema-drift diff, sparklines, and full anomaly history. Auto dark/light theme via `prefers-color-scheme`.
📖 Full docs: [dashboard guide →](src/scherlok/dashboard/README.md)
## How It Works
### 1. `investigate` — Learn the patterns
```bash
$ scherlok investigate
Profiling 12 tables...
✓ users — 45,231 rows, 8 columns
✓ orders — 1,203,847 rows, 15 columns
✓ products — 892 rows, 12 columns
...
Done. Profiles saved.
```
Scherlok profiles every table: row counts, column types, NULL rates, value distributions, freshness cadence, cardinality. Stores everything locally in SQLite.
### 2. `watch` — Detect anomalies
```bash
$ scherlok watch
Checking 12 tables against learned profiles...
🔴 CRITICAL orders volume_drop Row count dropped 52% (1,203,847 → 578,412)
🟡 WARNING users null_increase Column "email": NULL rate 2.1% → 18.7%
🔵 INFO products distribution Column "price": mean shifted 3.2σ
3 anomalies detected. Exit code: 1
```
### 3. Alert — Slack, CI/CD, or both
```bash
# Slack
scherlok watch --webhook https://hooks.slack.com/services/...
# Discord
scherlok watch --webhook https://discord.com/api/webhooks/...
# Microsoft Teams
scherlok watch --webhook https://outlook.office.com/webhook/...
# Any endpoint (generic JSON payload)
scherlok watch --webhook https://my-api.com/alerts
# CI/CD gate (fails pipeline on CRITICAL)
scherlok watch --exit-code --fail-on critical
```
Auto-detects Slack, Discord, and Teams from the URL and formats the payload accordingly. Any other URL receives a generic JSON payload.
## CI/CD Integration
Use Scherlok as a data quality gate. The `ci` command does it in one line:
```yaml
# GitHub Actions
- name: Data quality check
run: |
pip install scherlok
scherlok config --store s3://my-bucket/scherlok/profiles.db
scherlok ci ${{ secrets.DATABASE_URL }} \
--webhook ${{ secrets.SLACK_WEBHOOK }} \
--fail-on critical
```
If Scherlok detects a critical anomaly, the pipeline fails. Bad data never reaches production.
## Email alerts
```bash
export SCHERLOK_SMTP_HOST=smtp.gmail.com
export SCHERLOK_SMTP_USER=alerts@company.com
export SCHERLOK_SMTP_PASSWORD=app-specific-password
scherlok watch --email team@company.com --email cto@company.com
```
## Connectors
```bash
# PostgreSQL
scherlok connect postgres://user:pass@host:5432/db
# BigQuery
pip install scherlok[bigquery]
scherlok connect bigquery://project-id/dataset-name
# Snowflake
pip install scherlok[snowflake]
export SNOWFLAKE_USER=...
export SNOWFLAKE_PASSWORD=...
export SNOWFLAKE_WAREHOUSE=...
scherlok connect snowflake://account/database/schema
# MySQL
pip install scherlok[mysql]
scherlok connect mysql://user:pass@host:3306/dbname
```
| Database | Status |
|----------|--------|
| PostgreSQL | Available |
| BigQuery | Available |
| Snowflake | Available |
| MySQL | Available |
| DuckDB | Planned |
## Remote Storage
Share profiles across CI runs and team members:
```bash
# AWS S3
scherlok config --store s3://my-bucket/scherlok/profiles.db
# Google Cloud Storage
scherlok config --store gs://my-bucket/scherlok/profiles.db
# Azure Blob Storage
scherlok config --store az://my-container/scherlok/profiles.db
```
## Why Not [Other Tool]?
| | Great Expectations | Soda | Monte Carlo | **Scherlok** |
|---|---|---|---|---|
| Setup time | Hours | 30 min | Weeks | **5 minutes** |
| Config required | Hundreds of rules | YAML checks | Dashboard setup | **None** |
| Anomaly detection | Manual thresholds | Paid feature | Yes | **Yes, free** |
| Self-hosted | Yes | Limited | No (SaaS) | **Yes** |
| CI/CD gate | Yes | Yes | No | **Yes** |
| Price | Free | Freemium | $50-200K/yr | **Free, forever** |
## CLI Reference
```
scherlok connect Connect to a database
scherlok investigate Profile all tables (learn patterns)
scherlok watch [-w ] [-e ] Detect anomalies and alert
scherlok ci [opts] All-in-one CI/CD command (connect + watch + exit code)
scherlok status Quick health dashboard
scherlok report Detailed profile summary
scherlok history [--days N] Timeline of past anomalies
scherlok config --store Set remote storage
scherlok version Show version
```
## Install
```bash
pip install scherlok
# With BigQuery support
pip install scherlok[bigquery]
```
Requires Python 3.10+.
### Run via Docker
A pre-built image with every warehouse extra (`dbt`, `bigquery`, `snowflake`) is published to GitHub Container Registry on every release tag:
```bash
docker run --rm ghcr.io/rbmuller/scherlok:latest version
```
Mount your project directory and inject connection details the same way your CI does it; the entrypoint is the `scherlok` CLI:
```bash
docker run --rm \
-v "$PWD:/work" -w /work \
-e SCHERLOK_CONNECTION=postgres://... \
ghcr.io/rbmuller/scherlok:latest watch
```
The image is built from `python:3.12-slim` and runs unprivileged (`USER scherlok`).
## Contributing
Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md).
We're especially looking for:
- New database connectors (Snowflake, MySQL, DuckDB)
- Anomaly detection improvements
- Documentation and examples
## License
[MIT](LICENSE) — Developed by [Robson Bayer Müller](https://github.com/rbmuller)
