https://github.com/v-cth/database_audit
Audit your data quality in seconds
https://github.com/v-cth/database_audit
bigquery database dataengineering snowflake
Last synced: 7 months ago
JSON representation
Audit your data quality in seconds
- Host: GitHub
- URL: https://github.com/v-cth/database_audit
- Owner: v-cth
- Created: 2025-10-09T13:44:20.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-11-19T11:16:01.000Z (7 months ago)
- Last Synced: 2025-11-19T13:12:31.646Z (7 months ago)
- Topics: bigquery, database, dataengineering, snowflake
- Language: Python
- Homepage:
- Size: 2.11 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Audit: audit.py
Awesome Lists containing this project
README
# Data Warehouse Table Auditor
**High-performance data quality auditing for BigQuery & Snowflake with automatic relationship detection.**
✅ Find data issues before they cause problems
🔗 Discover table relationships automatically
🎨 Beautiful HTML reports with ER diagrams
---
## 🚀 Quick Start
```bash
# 1. Install uv (one-time setup)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or on Windows: powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or with pip: pip install uv
# 2. Clone and setup
git clone https://github.com/v-cth/database_audit.git
cd database_audit
# 3. Install dependencies (creates venv automatically)
uv sync
# 4. Create config file in current directory
uv run dw_auditor init
# 5. Create envrionment variable your credentials (use single quotes for passwords with special chars)
In your .env file:
export SNOWFLAKE_ACCOUNT='your-account'
export SNOWFLAKE_USER='your-username'
export SNOWFLAKE_PASSWORD='your-password'
# 6. Edit audit_config.yaml with your database details
# 7. Run the audit (load env vars first)
source .env
uv run dw_auditor run
# 8. Open the HTML report
open audit_results/audit_run_*/summary.html
```
> **Note**: If you prefer pip, you can still use: `pip install -e .`
---
## ✨ Key Features
- **Quality Checks** - Detect trailing spaces, case duplicates, regex patterns, range violations, future dates, and more
- **Automatic Profiling** - Distributions, top values, quantiles, string lengths, date ranges
- **Fields With Wrong Type** - Detect string columns that contain only dates, integer, booleans ...
- **Relationship Detection** - Automatically discover foreign keys
- **Rich HTML Reports** - 4-tab interface (Summary/Insights/Checks/Metadata) with visual gradients and timelines
- **Secure by Design** - Zero data exports, database-native operations via Ibis, PII masking
---
## 📋 What You Can Audit
- **Tables & Views** - Tables, views, and materialized views
- **Multiple Schemas** - Audit across datasets/databases in one run
- **Custom Queries** - Audit filtered data (e.g., "last 7 days only")
---
## 🎯 Use Cases
- **Data Migration** - Validate data before/after migrations
- **Post-ETL Quality Gates** - Catch issues in transformation pipelines
- **Schema Discovery** - Fast metadata exploration with `--discover` mode
- **Relationship Mapping** - Understand foreign keys in legacy systems
- **Compliance Audits** - PII detection and masking for governance
---
## 📊 Example Output
### Console
```
📋 Column Summary (All Columns):
==================================================
Column Name Type Status Nulls
--------------------------------------------------
user_id int64 ✓ OK 0 (0.0%)
email string ✗ ERROR 2 (1.2%)
created_at datetime ✓ OK 0 (0.0%)
🔍 Issues Found:
⚠️ EMAIL REGEX: 2 values don't match pattern
Examples: 'invalid.email@', 'user@domain'
```
### HTML Report Tabs
1. **Summary** - Overview, primary keys, table metadata
2. **Insights** - Visual distributions with gradient bars, top values
3. **Quality Checks** - Issues with examples and primary key context
4. **Metadata** - Audit config, duration
---
## ⚙️ Configuration Examples
### Minimal Setup
#### BigQuery
```yaml
database:
backend: "bigquery"
connection_params:
default_database: "my-project"
default_schema: "analytics"
tables:
- name: users
- name: orders
```
#### Snowflake
```yaml
database:
backend: "snowflake"
connection_params:
default_database: "MY_DB"
default_schema: "MY_SCHEMA"
account: "ACCOUNT"
user: "USER"
password: "PWD"
tables:
- name: users
- name: orders
```
### Using Environment Variables (Recommended for Credentials)
**Protect sensitive credentials by using environment variables instead of hardcoding them in YAML:**
#### Supported Formats
```yaml
database:
backend: "snowflake"
connection_params:
default_database: "MY_DB"
default_schema: "MY_SCHEMA"
account: "${SNOWFLAKE_ACCOUNT}" # Basic format
user: "$SNOWFLAKE_USER" # Short format
password: "${SNOWFLAKE_PASSWORD}"
warehouse: "${SNOWFLAKE_WAREHOUSE:-COMPUTE_WH}" # With default value
```
#### Usage
**Option 1: Using .env file (recommended)**
```bash
# Create .env file (use single quotes for passwords with special chars like $)
cat > .env << 'EOF'
export SNOWFLAKE_ACCOUNT='your-account'
export SNOWFLAKE_USER='your-username'
export SNOWFLAKE_PASSWORD='your-password'
EOF
# Load and run
source .env
dw_auditor run
```
**Option 2: Export directly**
```bash
# Set environment variables (use single quotes for special chars)
export SNOWFLAKE_ACCOUNT='OOQYWEC-ND51384'
export SNOWFLAKE_USER='my_user'
export SNOWFLAKE_PASSWORD='my_password'
# Run audit
dw_auditor run
```
**Option 3: Inline (for one-time use)**
```bash
SNOWFLAKE_PASSWORD='secret' dw_auditor run
```
**Benefits:**
- ✅ Keep credentials out of version control
- ✅ Different credentials per environment (dev/staging/prod)
- ✅ Works with CI/CD secrets management
- ✅ Supports default values: `${VAR:-default}`
### Multi-Schema Auditing
```yaml
tables:
- name: raw_customers
schema: raw_data
- name: stg_customers
schema: staging
- name: prod_customers
schema: production
```
### Custom Quality Checks
```yaml
column_checks:
tables:
users:
email:
regex_patterns:
pattern: "^[\\w._%+-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"
mode: "match"
age:
greater_than_or_equal: 18
less_than: 120
```
### Relationship Detection
```yaml
relationship_detection:
enabled: true
confidence_threshold: 0.7 # 70% confidence to detect
min_confidence_display: 0.5 # Show relationships >= 50%
```
**Full configuration guide**: See inline comments in [`audit_config.yaml`](./audit_config.yaml)
---
## 🔧 Advanced Usage
### Initialize Config
```bash
dw_auditor init # Create in current directory (./audit_config.yaml)
dw_auditor init --force # Overwrite existing config
dw_auditor init --path ./my.yaml # Create in custom location
```
### Run Audit
```bash
dw_auditor run # Auto-discover config
dw_auditor run custom.yaml # Use specific config file
dw_auditor run --yes # Auto-confirm prompts
```
### Audit Modes
```bash
dw_auditor run --discover # Metadata only (fast)
dw_auditor run --check # Quality checks only
dw_auditor run --insight # Profiling only
```
---
## 📚 Documentation
- **[Configuration Reference](./audit_config.yaml)** - Inline documentation for all options
- **[Quality Checks Guide](./doc/checks.md)** - All checks with examples
- **[Data Insights Guide](./doc/insights.md)** - All insights with examples
---
## 🛠️ Troubleshooting
### Installation
**Using uv**: Make sure uv is installed: `curl -LsSf https://astral.sh/uv/install.sh | sh`
**Using pip**: You can still install with: `pip install -e .` (reads from pyproject.toml)
### Authentication
**BigQuery**: Use `gcloud auth application-default login` or set `credentials_path` in config
**Snowflake**: Use environment variables for credentials (see Configuration Examples) or `authenticator: externalbrowser` for SSO
**Security**: Always use environment variables for passwords - never commit credentials to git
### Performance
- Sampling is always database-native via Ibis (fast & secure)
- Increase `sample_size` carefully (default: 100,000 rows)
- Use `--discover` for metadata-only scans
### Memory Issues
- Reduce `sample_size` in config
- Audit fewer tables per run
- Disable expensive insights (e.g., reduce `quantiles` count)
---
## 🏗️ Architecture
Built on modern Python data tools:
- **Ibis** - Database abstraction (lazy SQL generation, no data exports)
- **Polars** - Fast DataFrame processing
- **Pydantic** - Type-safe configuration validation
**Design**: All computation happens in your database. No data is exported to files.
---
## 🔐 Security Features
**Built-in security controls to protect sensitive data:**
### 1. **Automatic PII Masking**
- Auto-detects 32+ PII keywords (email, phone, SSN, credit card, etc.)
- Replaces values with `***PII_MASKED***` before analysis
- Customizable keyword list per your compliance needs
```yaml
security:
mask_pii: true
custom_pii_keywords: ["employee_id", "internal_code"]
```
### 2. **Zero Data Export Architecture**
- **Database-native queries** - All computation happens in your database (via Ibis)
- **No intermediate files** - Data never written to disk
- **Metadata-only exports** - Reports contain statistics, not raw data
### 3. **Data Minimization**
- **Column filtering** - Exclude sensitive columns entirely
- **Sampling** - Analyze subset of data (database-native TABLESAMPLE)
- **Temporary in-memory only** - Data discarded after analysis
### 4. **What's Exported vs Protected**
✅ **Exported** (Safe for Reports):
- Column metadata (names, types, descriptions)
- Statistics (nulls, distinct counts, ranges)
- Quality check results
- Top values (with PII masked)
❌ **Never Exported**:
- Raw column data
- Full table contents
- PII values
- Credentials (use environment variables to keep them out of config files)
---
## 📝 License
MIT License - See [LICENSE](./LICENSE) file
---