{"id":31542272,"url":"https://github.com/koddachad/dq_tester","last_synced_at":"2025-10-08T08:40:49.038Z","repository":{"id":317632375,"uuid":"1068107591","full_name":"koddachad/dq_tester","owner":"koddachad","description":"A lightweight simple data quality testing tool. ","archived":false,"fork":false,"pushed_at":"2025-10-03T17:22:35.000Z","size":8024,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-05T20:44:16.624Z","etag":null,"topics":["data","database","dataengineering","dataquality","dataqualitycheck"],"latest_commit_sha":null,"homepage":"https://www.kodda.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/koddachad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-01T21:38:28.000Z","updated_at":"2025-10-05T18:24:57.000Z","dependencies_parsed_at":"2025-10-04T11:45:04.380Z","dependency_job_id":null,"html_url":"https://github.com/koddachad/dq_tester","commit_stats":null,"previous_names":["koddachad/dq_tester"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/koddachad/dq_tester","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koddachad%2Fdq_tester","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koddachad%2Fdq_tester/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koddachad%2Fdq_tester/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koddachad%2Fdq_tester/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/koddachad","download_url":"https://codeload.github.com/koddachad/dq_tester/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koddachad%2Fdq_tester/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278708028,"owners_count":26031932,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","database","dataengineering","dataquality","dataqualitycheck"],"created_at":"2025-10-04T11:34:55.293Z","updated_at":"2025-10-07T01:55:34.777Z","avatar_url":"https://github.com/koddachad.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DQ Tester\n\nA lightweight, simple data quality testing tool for CSV files and databases.\n\n## Overview\n\nDQ Tester allows you to define data quality checks in SQL and organize them into reusable test catalogs. Execute tests against CSV files or databases, and monitor results through an interactive Streamlit dashboard.\n\n**Key Features:**\n- 📝 Simple YAML configuration for tests and catalogs\n- 🗄️ Support for CSV files and databases (via ODBC)\n- 📊 Built-in Streamlit dashboard for monitoring test results\n- 🔍 Customizable SQL-based data quality checks\n- 💾 Test results stored in DuckDB for analysis\n\n## Requirements\n\n- Python 3.9+\n- ODBC driver for your database (currently tested with PostgreSQL)\n\n## Installation\n\n```bash\npip install dq_tester\n```\n\n## Setup with Claude\n\nTo use DQ Tester with Claude:\n\n1. Upload `prompt_custom_instructions.txt` as a project instruction file\n2. Upload the following example files to your project:\n   - `examples/catalog.yaml`\n   - `examples/connections.yaml`\n   - `examples/db_test_plan.yaml`\n   - `examples/file_test_plan.yaml`\n\nClaude will use these files to help you create and manage data quality tests.\n\n## Python API\n\nDQ Tester can be used directly in Python for custom workflows and integrations:\n\n```python\nimport dq_tester\nimport sys\n\nresults = dq_tester.run_tests('examples/catalog.yaml', 'examples/file_test_plan.yaml')\nfailed = [r for r in results if r['status'] == 'FAIL']\n\nif failed:\n    print(f\"{len(failed)} tests failed\")\n    sys.exit(1)\nprint(\"All tests passed\")\n```\n\n## Quick Start\n\n### 1. Configure Database Connections\n\nCreate a `connections.yaml` file:\n\n```yaml\nconnections:\n  - name: sample_db\n    driver: \"PostgreSQL\"\n    server: \"myserver\"\n    database: \"demo_source\"\n    username: \"user\"\n    password: \"password\"\n\n  - name: results_db  # Required: results stored here\n    driver: \"DuckDB\"\n    database: \"./results/test_results.duckdb\"\n```\n\n**Note:** The `results_db` connection is required for storing test results.\n\n### 2. Create a Catalog of DQ Checks\n\nCreate a `catalog.yaml` file defining reusable data quality checks:\n\n```yaml\ndq_checks:\n  - name: null_values\n    type: sql\n    sql: |\n      select count(1)\n      from {table_name}\n      where {column_name} is null\n  \n  - name: duplicate_key\n    type: sql\n    sql: |\n      select count(1)\n      from (\n        select {key_cols}\n        from {table_name}\n        group by {key_cols}\n        having count(1) \u003e 1\n      ) t1\n  \n  - name: invalid_email_duckdb\n    type: sql\n    sql: |\n      select count(1)\n      from {table_name}\n      WHERE NOT REGEXP_FULL_MATCH(\n          {column_name},\n          '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{{2,}}$'\n      )\n```\n\n**Parameters in SQL:** Use `{parameter_name}` in your SQL to define parameters. These must be provided in your test plan.\n\n### 3. Create a Test Plan\n\n#### For CSV Files\n\n**Quick Start with csv-to-yaml:**\n\nDQ Tester can generate the YAML structure for your CSV file automatically:\n\n```bash\ndq_tester -a csv-to-yaml --csv-path examples/datasets/customers.csv\n```\n\nThis outputs the column definitions which you can copy into your test plan:\n\n```yaml\nhas_header: true\ndelimiter: ','\ncolumns:\n  - name: Customer Id\n    type: VARCHAR\n  - name: First Name\n    type: VARCHAR\n  - name: Email\n    type: VARCHAR\n  # ... etc\n```\n\n**Complete Test Plan:**\n\nCreate a test plan (e.g., `file_test_plan.yaml`) by adding the generated output plus source, file_name, and dq_tests:\n\n```yaml\nsource: csv\nfile_name: examples/datasets/customers.csv\nhas_header: true\ndelimiter: ','\ncolumns:\n  - name: Customer Id\n    type: VARCHAR\n  - name: First Name\n    type: VARCHAR\n  - name: Email\n    type: VARCHAR\n\ndq_tests:\n  - dq_test_name: null_values\n    column_name: '\"Customer Id\"'\n    threshold:\n      type: count\n      operator: \"==\"\n      value: 0\n  \n  - dq_test_name: invalid_email_duckdb\n    column_name: '\"Email\"'\n    threshold:\n      type: count\n      operator: \"==\"\n      value: 0\n  \n  - dq_test_name: total_records\n    threshold:\n      type: count\n      operator: \"\u003e\"\n      value: 1000\n```\n\n**CSV Requirements:**\n- `file_name`: Path to the CSV file\n- `has_header`: Whether the CSV has a header row\n- `delimiter`: Column delimiter (usually `,`)\n- `columns`: Column definitions (types are optional)\n\n#### For Databases\n\nCreate a test plan (e.g., `db_test_plan.yaml`):\n\n```yaml\nsource: database\nconnection_name: sample_db\ntable_name: sales.salesorderdetail\n\ndq_tests:\n  - dq_test_name: null_values\n    column_name: salesorderid\n    threshold:\n      type: count\n      operator: \"==\"\n      value: 0\n  \n  - dq_test_name: duplicate_key\n    key_cols: salesorderdetailid\n    threshold:\n      type: count\n      operator: \"==\"\n      value: 0\n```\n\n**Database Requirements:**\n- `connection_name`: Name from `connections.yaml`\n- `table_name`: Fully qualified table name (e.g., `schema.table`)\n\n## CLI Commands\n\n### Run Tests\n\nExecute a test plan against your data:\n\n```bash\ndq_tester -a run \\\n  -c examples/catalog.yaml \\\n  -t examples/file_test_plan.yaml \\\n  --connections-path connections.yaml\n```\n\n**Required Options:**\n- `-a run` or `--action run`: Action to execute\n- `-c` or `--catalog-path`: Path to catalog YAML file\n- `-t` or `--test-plan-path`: Path to test plan YAML file\n\n**Optional:**\n- `--connections-path`: Path to connections YAML file (defaults to searching standard locations)\n\nValidate your catalog and test plan files without running tests:\n\n```bash\ndq_tester -a validate \\\n  -c examples/catalog.yaml \\\n  -t examples/file_test_plan.yaml\n```\n\n**Required Options:**\n- `-a validate` or `--action validate`: Action to execute\n- `-c` or `--catalog-path`: Path to catalog YAML file\n- `-t` or `--test-plan-path`: Path to test plan YAML file\n\nThis checks for:\n- Valid YAML syntax\n- Required fields present\n- Proper test configuration\n- Parameter matching between catalog and test plan\n\n### Generate CSV YAML Structure\n\nAutomatically generate the YAML structure for a CSV file:\n\n```bash\ndq_tester -a csv-to-yaml --csv-path examples/datasets/customers.csv\n```\n\n**Required Options:**\n- `-a csv-to-yaml` or `--action csv-to-yaml`: Action to execute\n- `--csv-path`: Path to the CSV file to analyze\n\nThis analyzes your CSV file and outputs the column definitions with inferred data types. Copy this output into your test plan to get started quickly.\n\n**Output example:**\n```yaml\nhas_header: true\ndelimiter: ','\ncolumns:\n  - name: Index\n    type: BIGINT\n  - name: Customer Id\n    type: VARCHAR\n  - name: Email\n    type: VARCHAR\n  # ... additional columns\n```\n\n### Launch Dashboard\n\nStart the interactive Streamlit dashboard:\n\n```bash\n# Use default connections.yaml locations\ndq_dashboard\n\n# Specify connections file\ndq_dashboard connections.yaml\n\n# Use custom port\ndq_dashboard --port 8080 connections.yaml\n```\n\n**Options:**\n- `connections_file` (positional, optional): Path to connections YAML file\n- `--port`: Port number for the dashboard (default: 8501)\n\n**Default connections.yaml locations:**\n\nIf no connections file is specified, the dashboard searches in order:\n1. `./connections.yaml` (current directory)\n2. `~/.dq_tester/connections.yaml` (user home directory)\n3. `/etc/dq_tester/connections.yaml` (system-wide)\n\n### Command Reference\n\n```bash\n# Show help\ndq_tester -h\n\n# Available actions\ndq_tester -a {validate,run,csv-to-yaml}\n\n# Full options\ndq_tester [-h] \n          [-c CATALOG_PATH] \n          [-t TEST_PLAN_PATH] \n          [--csv-path CSV_PATH]\n          [-a {validate,run,csv-to-yaml}]\n          [--connections-path CONNECTIONS_PATH]\n```\n\n## Built-in DQ Checks\n\n### For CSV Files\n- `invalid_records` - Count of records that fail to parse\n- `total_records` - Total number of records\n- `expected_delimiter` - Validates the delimiter used\n- `valid_header` - Validates header structure\n\n### For Databases\n- `total_records` - Total number of records in the table\n\n**Note:** Thresholds for built-in checks can be configured in your test plan.\n\n## Thresholds\n\nTests use thresholds to determine PASS/FAIL status:\n\n### Threshold Types\n- **`count`**: Compare absolute count values\n- **`pct`**: Compare percentage values (0-100)\n\n### Threshold Operators\n- `==`: Equal to\n- `!=`: Not equal to\n- `\u003c`: Less than\n- `\u003c=`: Less than or equal to\n- `\u003e`: Greater than\n- `\u003e=`: Greater than or equal to\n\n### Example Thresholds\n\n```yaml\n# Count-based threshold\nthreshold:\n  type: count\n  operator: \"==\"\n  value: 0\n\n# Percentage-based threshold\nthreshold:\n  type: pct\n  operator: \"\u003c=\"\n  value: 5  # 5% or less\n```\n\n## Test Results\n\nEach test produces one of three statuses:\n\n- **PASS**: The test result meets the threshold criteria (threshold comparison returns TRUE)\n- **FAIL**: The test result does not meet the threshold criteria\n- **ERROR**: The test execution failed (SQL error, connection issue, etc.)\n\nResults are stored in the `results_db` DuckDB database and can be viewed in the dashboard.\n\n## Example Project Structure\n\n```\nmy-dq-project/\n├── connections.yaml\n├── catalog.yaml\n├── test_plans/\n│   ├── customers_test_plan.yaml\n│   └── orders_test_plan.yaml\n└── results/\n    └── test_results.duckdb\n```\n\n## Configuration Reference\n\n### Catalog Configuration\n\n```yaml\ndq_checks:\n  - name: check_name           # Unique name for the check\n    type: sql                  # Currently only 'sql' is supported\n    sql: |                     # SQL query template\n      select count(1)\n      from {table_name}\n      where {parameter_name} condition\n```\n\n### Test Plan Configuration\n\n```yaml\n# For CSV\nsource: csv | database\nfile_name: path/to/file.csv    # CSV only\nhas_header: true | false       # CSV only\ndelimiter: ','                 # CSV only\ncolumns:                       # CSV only\n  - name: column_name\n    type: data_type            # Optional\n\n# For Database\nconnection_name: name          # Database only\ntable_name: schema.table       # Database only\n\ndq_tests:\n  - dq_test_name: check_name   # References catalog\n    parameter_name: value      # Match parameters in SQL\n    threshold:\n      type: count | pct\n      operator: \"==|!=|\u003c|\u003c=|\u003e|\u003e=\"\n      value: number\n```\n\n### Connections Configuration\n\n```yaml\nconnections:\n  - name: connection_name\n    driver: \"PostgreSQL\"       # ODBC driver name\n    server: \"hostname\"\n    database: \"database_name\"\n    username: \"user\"\n    password: \"password\"\n  \n  - name: results_db           # Required for storing results\n    driver: \"DuckDB\"\n    database: \"./path/to/results.duckdb\"\n```\n\n## ODBC Driver Setup\n\n### PostgreSQL\n\n**Linux:**\n```bash\nsudo apt-get install unixodbc unixodbc-dev odbc-postgresql\n```\n\n**macOS:**\n```bash\nbrew install unixodbc psqlodbc\n```\n\n**Windows:**\nDownload and install the PostgreSQL ODBC driver from the official website.\n\n### Other Databases\n\nDQ Tester should work with any database that has an ODBC driver. Install the appropriate ODBC driver for your database and configure it in `connections.yaml`.\n\n## Dashboard Features\n\nThe Streamlit dashboard provides:\n\n- 📊 **Key Metrics**: Total tests, pass rate, recent failures, columns tested\n- 🎯 **Status Distribution**: Visual breakdown of PASS/FAIL/ERROR\n- 📈 **Trends Over Time**: Historical test results\n- 🔗 **Connection Health**: Test results by connection\n- 📋 **Object Analysis**: Test results by tested object\n- 🏷️ **Column-Level Health**: Identify problematic columns\n- 🔍 **Cascading Filters**: Filter by connection → object → column\n- 📥 **CSV Export**: Download filtered results\n\n## Examples\n\nComplete examples are available in the `examples/` directory:\n- `catalog.yaml` - Sample DQ check definitions\n- `file_test_plan.yaml` - CSV testing example\n- `db_test_plan.yaml` - Database testing example\n- `connections.yaml` - Connection configuration template\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Support\n\nFor questions, issues, or feature requests, please contact: chad@kodda.io\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoddachad%2Fdq_tester","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkoddachad%2Fdq_tester","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoddachad%2Fdq_tester/lists"}