{"id":33330255,"url":"https://github.com/v-cth/database_audit","last_synced_at":"2025-11-20T18:02:52.316Z","repository":{"id":325073518,"uuid":"1072999880","full_name":"v-cth/database_audit","owner":"v-cth","description":"Audit your data quality in seconds","archived":false,"fork":false,"pushed_at":"2025-11-19T11:16:01.000Z","size":2216,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-19T13:12:31.646Z","etag":null,"topics":["bigquery","database","dataengineering","snowflake"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/v-cth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":"audit.py","citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-09T13:44:20.000Z","updated_at":"2025-11-19T11:13:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/v-cth/database_audit","commit_stats":null,"previous_names":["v-cth/database_audit"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/v-cth/database_audit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-cth%2Fdatabase_audit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-cth%2Fdatabase_audit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-cth%2Fdatabase_audit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-cth%2Fdatabase_audit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/v-cth","download_url":"https://codeload.github.com/v-cth/database_audit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-cth%2Fdatabase_audit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285484486,"owners_count":27179744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-20T02:00:05.334Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","database","dataengineering","snowflake"],"created_at":"2025-11-20T18:01:28.385Z","updated_at":"2025-11-20T18:02:52.305Z","avatar_url":"https://github.com/v-cth.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Warehouse Table Auditor\n\n**High-performance data quality auditing for BigQuery \u0026 Snowflake with automatic relationship detection.**\n\n✅ Find data issues before they cause problems\n\n🔗 Discover table relationships automatically\n\n🎨 Beautiful HTML reports with ER diagrams\n\n---\n\n## 🚀 Quick Start\n\n```bash\n# 1. Install uv (one-time setup)\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n# Or on Windows: powershell -c \"irm https://astral.sh/uv/install.ps1 | iex\"\n# Or with pip: pip install uv\n\n# 2. Clone and setup\ngit clone https://github.com/v-cth/database_audit.git\ncd database_audit\n\n# 3. Install dependencies (creates venv automatically)\nuv sync\n\n# 4. Create config file in current directory\nuv run dw_auditor init\n\n# 5. Create envrionment variable your credentials (use single quotes for passwords with special chars)\nIn your .env file:\nexport SNOWFLAKE_ACCOUNT='your-account'\nexport SNOWFLAKE_USER='your-username'\nexport SNOWFLAKE_PASSWORD='your-password'\n\n# 6. Edit audit_config.yaml with your database details\n\n# 7. Run the audit (load env vars first)\nsource .env\nuv run dw_auditor run\n\n# 8. Open the HTML report\nopen audit_results/audit_run_*/summary.html\n```\n\n\n\u003e **Note**: If you prefer pip, you can still use: `pip install -e .`\n\n---\n\n## ✨ Key Features\n\n- **Quality Checks** - Detect trailing spaces, case duplicates, regex patterns, range violations, future dates, and more\n\n- **Automatic Profiling** - Distributions, top values, quantiles, string lengths, date ranges\n\n- **Fields With Wrong Type** - Detect string columns that contain only dates, integer, booleans ...\n\n- **Relationship Detection** - Automatically discover foreign keys\n\n- **Rich HTML Reports** - 4-tab interface (Summary/Insights/Checks/Metadata) with visual gradients and timelines\n\n- **Secure by Design** - Zero data exports, database-native operations via Ibis, PII masking\n\n\n---\n\n## 📋 What You Can Audit\n\n- **Tables \u0026 Views** - Tables, views, and materialized views\n- **Multiple Schemas** - Audit across datasets/databases in one run\n- **Custom Queries** - Audit filtered data (e.g., \"last 7 days only\")\n\n---\n\n## 🎯 Use Cases\n\n- **Data Migration** - Validate data before/after migrations\n- **Post-ETL Quality Gates** - Catch issues in transformation pipelines\n- **Schema Discovery** - Fast metadata exploration with `--discover` mode\n- **Relationship Mapping** - Understand foreign keys in legacy systems\n- **Compliance Audits** - PII detection and masking for governance\n\n---\n\n## 📊 Example Output\n\n### Console\n```\n📋 Column Summary (All Columns):\n==================================================\nColumn Name          Type        Status      Nulls\n--------------------------------------------------\nuser_id             int64       ✓ OK        0 (0.0%)\nemail               string      ✗ ERROR     2 (1.2%)\ncreated_at          datetime    ✓ OK        0 (0.0%)\n\n🔍 Issues Found:\n⚠️  EMAIL REGEX: 2 values don't match pattern\n   Examples: 'invalid.email@', 'user@domain'\n```\n\n### HTML Report Tabs\n1. **Summary** - Overview, primary keys, table metadata\n2. **Insights** - Visual distributions with gradient bars, top values\n3. **Quality Checks** - Issues with examples and primary key context\n4. **Metadata** - Audit config, duration\n\n---\n\n## ⚙️ Configuration Examples\n\n### Minimal Setup\n#### BigQuery\n```yaml\ndatabase:\n  backend: \"bigquery\"\n  connection_params:\n    default_database: \"my-project\"\n    default_schema: \"analytics\"\n\ntables:\n  - name: users\n  - name: orders\n```\n\n#### Snowflake\n```yaml\ndatabase:\n  backend: \"snowflake\"\n  connection_params:\n    default_database: \"MY_DB\"\n    default_schema: \"MY_SCHEMA\"\n    account: \"ACCOUNT\"\n    user: \"USER\"\n    password: \"PWD\"\n\ntables:\n  - name: users\n  - name: orders\n```\n\n### Using Environment Variables (Recommended for Credentials)\n\n**Protect sensitive credentials by using environment variables instead of hardcoding them in YAML:**\n\n#### Supported Formats\n```yaml\ndatabase:\n  backend: \"snowflake\"\n  connection_params:\n    default_database: \"MY_DB\"\n    default_schema: \"MY_SCHEMA\"\n    account: \"${SNOWFLAKE_ACCOUNT}\"              # Basic format\n    user: \"$SNOWFLAKE_USER\"                      # Short format\n    password: \"${SNOWFLAKE_PASSWORD}\"\n    warehouse: \"${SNOWFLAKE_WAREHOUSE:-COMPUTE_WH}\"  # With default value\n```\n\n#### Usage\n\n**Option 1: Using .env file (recommended)**\n```bash\n# Create .env file (use single quotes for passwords with special chars like $)\ncat \u003e .env \u003c\u003c 'EOF'\nexport SNOWFLAKE_ACCOUNT='your-account'\nexport SNOWFLAKE_USER='your-username'\nexport SNOWFLAKE_PASSWORD='your-password'\nEOF\n\n# Load and run\nsource .env\ndw_auditor run\n```\n\n**Option 2: Export directly**\n```bash\n# Set environment variables (use single quotes for special chars)\nexport SNOWFLAKE_ACCOUNT='OOQYWEC-ND51384'\nexport SNOWFLAKE_USER='my_user'\nexport SNOWFLAKE_PASSWORD='my_password'\n\n# Run audit\ndw_auditor run\n```\n\n**Option 3: Inline (for one-time use)**\n```bash\nSNOWFLAKE_PASSWORD='secret' dw_auditor run\n```\n\n**Benefits:**\n- ✅ Keep credentials out of version control\n- ✅ Different credentials per environment (dev/staging/prod)\n- ✅ Works with CI/CD secrets management\n- ✅ Supports default values: `${VAR:-default}`\n\n### Multi-Schema Auditing\n```yaml\ntables:\n  - name: raw_customers\n    schema: raw_data\n  - name: stg_customers\n    schema: staging\n  - name: prod_customers\n    schema: production\n```\n\n### Custom Quality Checks\n```yaml\ncolumn_checks:\n  tables:\n    users:\n      email:\n        regex_patterns:\n          pattern: \"^[\\\\w._%+-]+@[\\\\w.-]+\\\\.[a-zA-Z]{2,}$\"\n          mode: \"match\"\n      age:\n        greater_than_or_equal: 18\n        less_than: 120\n```\n\n### Relationship Detection\n```yaml\nrelationship_detection:\n  enabled: true\n  confidence_threshold: 0.7   # 70% confidence to detect\n  min_confidence_display: 0.5 # Show relationships \u003e= 50%\n```\n\n**Full configuration guide**: See inline comments in [`audit_config.yaml`](./audit_config.yaml)\n\n---\n\n## 🔧 Advanced Usage\n\n### Initialize Config\n```bash\ndw_auditor init                      # Create in current directory (./audit_config.yaml)\ndw_auditor init --force              # Overwrite existing config\ndw_auditor init --path ./my.yaml     # Create in custom location\n```\n\n### Run Audit\n```bash\ndw_auditor run                       # Auto-discover config\ndw_auditor run custom.yaml           # Use specific config file\ndw_auditor run --yes                 # Auto-confirm prompts\n```\n\n### Audit Modes\n```bash\ndw_auditor run --discover            # Metadata only (fast)\ndw_auditor run --check               # Quality checks only\ndw_auditor run --insight             # Profiling only\n```\n\n---\n\n## 📚 Documentation\n\n- **[Configuration Reference](./audit_config.yaml)** - Inline documentation for all options\n- **[Quality Checks Guide](./doc/checks.md)** - All checks with examples\n- **[Data Insights Guide](./doc/insights.md)** - All insights with examples\n\n\n\n---\n\n## 🛠️ Troubleshooting\n\n### Installation\n**Using uv**: Make sure uv is installed: `curl -LsSf https://astral.sh/uv/install.sh | sh`\n**Using pip**: You can still install with: `pip install -e .` (reads from pyproject.toml)\n\n### Authentication\n**BigQuery**: Use `gcloud auth application-default login` or set `credentials_path` in config\n**Snowflake**: Use environment variables for credentials (see Configuration Examples) or `authenticator: externalbrowser` for SSO\n**Security**: Always use environment variables for passwords - never commit credentials to git\n\n### Performance\n- Sampling is always database-native via Ibis (fast \u0026 secure)\n- Increase `sample_size` carefully (default: 100,000 rows)\n- Use `--discover` for metadata-only scans\n\n### Memory Issues\n- Reduce `sample_size` in config\n- Audit fewer tables per run\n- Disable expensive insights (e.g., reduce `quantiles` count)\n\n---\n\n## 🏗️ Architecture\n\nBuilt on modern Python data tools:\n- **Ibis** - Database abstraction (lazy SQL generation, no data exports)\n- **Polars** - Fast DataFrame processing\n- **Pydantic** - Type-safe configuration validation\n\n**Design**: All computation happens in your database. No data is exported to files.\n\n---\n\n## 🔐 Security Features\n\n**Built-in security controls to protect sensitive data:**\n\n### 1. **Automatic PII Masking**\n- Auto-detects 32+ PII keywords (email, phone, SSN, credit card, etc.)\n- Replaces values with `***PII_MASKED***` before analysis\n- Customizable keyword list per your compliance needs\n\n```yaml\nsecurity:\n  mask_pii: true\n  custom_pii_keywords: [\"employee_id\", \"internal_code\"]\n```\n\n### 2. **Zero Data Export Architecture**\n- **Database-native queries** - All computation happens in your database (via Ibis)\n- **No intermediate files** - Data never written to disk\n- **Metadata-only exports** - Reports contain statistics, not raw data\n\n### 3. **Data Minimization**\n- **Column filtering** - Exclude sensitive columns entirely\n- **Sampling** - Analyze subset of data (database-native TABLESAMPLE)\n- **Temporary in-memory only** - Data discarded after analysis\n\n### 4. **What's Exported vs Protected**\n\n✅ **Exported** (Safe for Reports):\n- Column metadata (names, types, descriptions)\n- Statistics (nulls, distinct counts, ranges)\n- Quality check results\n- Top values (with PII masked)\n\n❌ **Never Exported**:\n- Raw column data\n- Full table contents\n- PII values\n- Credentials (use environment variables to keep them out of config files)\n\n---\n\n## 📝 License\n\nMIT License - See [LICENSE](./LICENSE) file\n\n---\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fv-cth%2Fdatabase_audit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fv-cth%2Fdatabase_audit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fv-cth%2Fdatabase_audit/lists"}