Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://github.com/sodadata/soda-core
data-contracts data-engineering data-governance data-monitoring data-observability data-profiling data-quality data-quality-checks data-quality-monitoring data-quality-testing data-reliability data-testing data-unit-tests data-validation dataquality datatesting dbt pipeline-testing python snowflake
Last synced: 4 days ago
JSON representation
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
- Host: GitHub
- URL: https://github.com/sodadata/soda-core
- Owner: sodadata
- License: apache-2.0
- Created: 2020-12-14T19:59:19.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2025-01-14T14:12:11.000Z (20 days ago)
- Last Synced: 2025-01-23T15:01:42.297Z (11 days ago)
- Topics: data-contracts, data-engineering, data-governance, data-monitoring, data-observability, data-profiling, data-quality, data-quality-checks, data-quality-monitoring, data-quality-testing, data-reliability, data-testing, data-unit-tests, data-validation, dataquality, datatesting, dbt, pipeline-testing, python, snowflake
- Language: Python
- Homepage: https://go.soda.io/core-docs
- Size: 2.88 MB
- Stars: 1,978
- Watchers: 15
- Forks: 219
- Open Issues: 135
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING-DATA-SOURCE.md
- License: LICENSE
Awesome Lists containing this project
- awesome-data-quality - soda - enables data testing through extended SQL queries. (Table of Contents / Frameworks and Libraries)
README
Soda Core
Data quality testing for SQL-, Spark-, and Pandas-accessible data.
> [!IMPORTANT]
> **🚀 We're hiring! Are you passionate about open-source and love working on projects like Soda Core? Join our team as a Software Engineer and help shape the future of data quality tools. [Apply now!](https://careers.soda.io/o/software-engineer-data-testing-python-data-engineering-mediorsenior?source=gh-core)**
✔ An open-source, CLI tool and Python library for data quality testing
✔ Compatible with the Soda Checks Language (SodaCL)
✔ Enables data quality testing both in and out of your data pipelines and development workflows
✔ Integrated to allow a Soda scan in a data pipeline, or programmatic scans on a time-based scheduleSoda Core is a free, open-source, command-line tool and Python library that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries.
When it runs a scan on a dataset, Soda Core executes the checks to find invalid, missing, or unexpected data. When your Soda Checks fail, they surface the data that you defined as bad-quality.
#### Soda Library
Consider migrating to **[Soda Library](https://docs.soda.io/soda/quick-start-sip.html)**, an extension of Soda Core that offers more features and functionality, and enables you to connect to a [Soda Cloud](https://docs.soda.io/soda-cloud/overview.html) account to collaborate with your team on data quality.
* Use [Group by](https://docs.soda.io/soda-cl/group-by.html) and [Group Evolution](https://docs.soda.io/soda-cl/group-evolution.html) configurations to intelligently group check results
* Leverage [Reconciliation checks](https://docs.soda.io/soda-cl/recon.html) to compare data between data sources for data migration projects.
* Use [Schema Evolution](https://docs.soda.io/soda-cl/schema.html#define-schema-evolution-checks) checks to automatically validate schemas.
* Set up [Anomaly Detection](https://docs.soda.io/soda-cl/anomaly-detection.html) checks to automatically learn patterns and discover anomalies in your data.[Install Soda Library](https://docs.soda.io/soda-library/install.html) and get started with a 45-day free trial.
## Get started
Soda Core currently supports connections to several data sources. See [Compatibility](/docs/installation.md#compatibility) for a complete list.
**Requirements**
* Python 3.8 or greater
* Pip 21.0 or greater**Install and run**
1. To get started, use the install command, replacing `soda-core-postgres` with the package that matches your data source. See [Install Soda Core](/docs/installation.md) for a complete list.
```shell
pip install soda-core-postgres
```2. Prepare a `configuration.yml` file to connect to your data source. Then, write data quality checks in a `checks.yml` file. See [Configure Soda Core](/docs/configuration.md).
3. Run a scan to review checks that passed, failed, or warned during a scan. See [Run a Soda Core scan](/docs/scan-core.md).
```shell
soda scan -d your_datasource -c configuration.yml checks.yml
```#### Example checks
```yaml
# Checks for basic validations
checks for dim_customer:
- row_count between 10 and 1000
- missing_count(birth_date) = 0
- invalid_percent(phone) < 1 %:
valid format: phone number
- invalid_count(number_cars_owned) = 0:
valid min: 1
valid max: 6
- duplicate_count(phone) = 0# Checks for schema changes
checks for dim_product:
- schema:
name: Find forbidden, missing, or wrong type
warn:
when required column missing: [dealer_price, list_price]
when forbidden column present: [credit_card]
when wrong column type:
standard_cost: money
fail:
when forbidden column present: [pii*]
when wrong column index:
model_name: 22
# Check for freshness
- freshness(start_date) < 1d# Check for referential integrity
checks for dim_department_group:
- values in (department_group_name) must exist in dim_employee (department_name)
```## Documentation
* [Soda Core](/docs/overview-main.md)
* [Soda Checks Language (SodaCL)](https://docs.soda.io/soda-cl/soda-cl-overview.html)