{"id":13564288,"url":"https://github.com/sodadata/soda-core","last_synced_at":"2025-05-14T04:08:18.479Z","repository":{"id":36951702,"uuid":"321458274","full_name":"sodadata/soda-core","owner":"sodadata","description":":zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io","archived":false,"fork":false,"pushed_at":"2025-05-12T09:34:47.000Z","size":4061,"stargazers_count":2084,"open_issues_count":137,"forks_count":234,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-05-12T10:44:58.167Z","etag":null,"topics":["data-contracts","data-engineering","data-governance","data-monitoring","data-observability","data-profiling","data-quality","data-quality-checks","data-quality-monitoring","data-quality-testing","data-reliability","data-testing","data-unit-tests","data-validation","dataquality","datatesting","dbt","pipeline-testing","python","snowflake"],"latest_commit_sha":null,"homepage":"https://go.soda.io/core-docs","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sodadata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING-DATA-SOURCE.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-12-14T19:59:19.000Z","updated_at":"2025-05-08T22:14:37.000Z","dependencies_parsed_at":"2023-10-14T22:27:49.710Z","dependency_job_id":"6cb7aca8-577f-4de6-af85-1c702ec63680","html_url":"https://github.com/sodadata/soda-core","commit_stats":{"total_commits":764,"total_committers":41,"mean_commits":"18.634146341463413","dds":0.6492146596858639,"last_synced_commit":"5021b74bfd4d7191c64613ad93b70cda1eb910b2"},"previous_names":[],"tags_count":119,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-core","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-core/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-core/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-core/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sodadata","download_url":"https://codeload.github.com/sodadata/soda-core/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254069688,"owners_count":22009558,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-contracts","data-engineering","data-governance","data-monitoring","data-observability","data-profiling","data-quality","data-quality-checks","data-quality-monitoring","data-quality-testing","data-reliability","data-testing","data-unit-tests","data-validation","dataquality","datatesting","dbt","pipeline-testing","python","snowflake"],"created_at":"2024-08-01T13:01:29.230Z","updated_at":"2025-05-14T04:08:13.458Z","avatar_url":"https://github.com/sodadata.png","language":"Python","readme":"\n\u003ch1 align=\"center\"\u003eSoda Core\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\u003cb\u003eData quality testing for SQL-, Spark-, and Pandas-accessible data.\u003c/b\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/sodadata/soda-core/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202-blue.svg\" alt=\"License: Apache 2.0\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://join.slack.com/t/soda-community/shared_invite/zt-m77gajo1-nXJF7JtbbRht2zwaiLb9pg\"\u003e\u003cimg alt=\"Slack\" src=\"https://img.shields.io/badge/chat-slack-green.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://static.pepy.tech/personalized-badge/soda-core?period=total\u0026units=international_system\u0026left_color=black\u0026right_color=green\u0026left_text=Downloads\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003chr /\u003e\n\n\u003e [!IMPORTANT]  \n\u003e **🚀 We're hiring! Are you passionate about open-source and love working on projects like Soda Core? Join our team as a Senior Data Engineer and help shape the future of data quality tools. [Apply now!](https://careers.soda.io/o/senior-data-engineer-python?source=gh-core)**\n\n\u003chr /\u003e\n\n\u0026#10004;  An open-source, CLI tool and Python library for data quality testing\u003cbr /\u003e\n\u0026#10004;  Compatible with the \u003ca href=\"https://docs.soda.io/soda-cl/soda-cl-overview.html\" target=\"_blank\"\u003eSoda Checks Language (SodaCL)\u003c/a\u003e  \u003cbr /\u003e\n\u0026#10004;  Enables data quality testing both in and out of your data pipelines and development workflows\u003cbr /\u003e\n\u0026#10004;  Integrated to allow a Soda scan in a data pipeline, or programmatic scans on a time-based schedule \u003cbr /\u003e\n\n\nSoda Core is a free, open-source, command-line tool and Python library that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries. \n\nWhen it runs a scan on a dataset, Soda Core executes the checks to find invalid, missing, or unexpected data. When your Soda Checks fail, they surface the data that you defined as bad-quality.\n\n#### Soda Library \n\nConsider migrating to **[Soda Library](https://docs.soda.io/soda/quick-start-sip.html)**, an extension of Soda Core that offers more features and functionality, and enables you to connect to a [Soda Cloud](https://docs.soda.io/soda-cloud/overview.html) account to collaborate with your team on data quality.\n* Use [Group by](https://docs.soda.io/soda-cl/group-by.html) and [Group Evolution](https://docs.soda.io/soda-cl/group-evolution.html) configurations to intelligently group check results\n* Leverage [Reconciliation checks](https://docs.soda.io/soda-cl/recon.html) to compare data between data sources for data migration projects.\n* Use [Schema Evolution](https://docs.soda.io/soda-cl/schema.html#define-schema-evolution-checks) checks to automatically validate schemas.\n* Set up [Anomaly Detection](https://docs.soda.io/soda-cl/anomaly-detection.html) checks to automatically learn patterns and discover anomalies in your data.\n\n[Install Soda Library](https://docs.soda.io/soda-library/install.html) and get started with a 45-day free trial.\n\n\u003cbr /\u003e\n\n## Get started\n\nSoda Core currently supports connections to several data sources. See [Compatibility](/docs/installation.md#compatibility) for a complete list.\n\n**Requirements**\n* Python 3.8 or greater\n* Pip 21.0 or greater\n\n\n**Install and run**\n1. To get started, use the install command, replacing `soda-core-postgres` with the package that matches your data source.  See [Install Soda Core](/docs/installation.md) for a complete list.\u003cbr /\u003e\n    ```shell\n    pip install soda-core-postgres\n    ```\n\n2. Prepare a `configuration.yml` file to connect to your data source. Then, write data quality checks in a `checks.yml` file. See [Configure Soda Core](/docs/configuration.md).\n\n3. Run a scan to review checks that passed, failed, or warned during a scan. See [Run a Soda Core scan](/docs/scan-core.md).\n    ```shell\n    soda scan -d your_datasource -c configuration.yml checks.yml\n    ```\n\n#### Example checks\n```yaml\n# Checks for basic validations\nchecks for dim_customer:\n  - row_count between 10 and 1000\n  - missing_count(birth_date) = 0\n  - invalid_percent(phone) \u003c 1 %:\n      valid format: phone number\n  - invalid_count(number_cars_owned) = 0:\n      valid min: 1\n      valid max: 6\n  - duplicate_count(phone) = 0\n\n# Checks for schema changes\nchecks for dim_product:\n  - schema:\n      name: Find forbidden, missing, or wrong type\n      warn:\n        when required column missing: [dealer_price, list_price]\n        when forbidden column present: [credit_card]\n        when wrong column type:\n          standard_cost: money\n      fail:\n        when forbidden column present: [pii*]\n        when wrong column index:\n          model_name: 22\n# Check for freshness \n  - freshness(start_date) \u003c 1d\n\n# Check for referential integrity\nchecks for dim_department_group:\n  - values in (department_group_name) must exist in dim_employee (department_name)\n```\n\u003cbr /\u003e\n\n## Documentation\n\n* [Soda Core](/docs/overview-main.md)\n* [Soda Checks Language (SodaCL)](https://docs.soda.io/soda-cl/soda-cl-overview.html)\n\n","funding_links":[],"categories":["📊 Data Validation \u0026 Quality","Python","Tools","Table of Contents","GenAI Readiness Features"],"sub_categories":["Open Source Tools","Frameworks and Libraries","Data Quality, Observability \u0026 Governance"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodadata%2Fsoda-core","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsodadata%2Fsoda-core","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodadata%2Fsoda-core/lists"}