https://github.com/paulf-999/data_profiling_w_great_expectations
Bulk Data Profiling Solution using Great Expectations
https://github.com/paulf-999/data_profiling_w_great_expectations
great-expectations makefile python3
Last synced: 2 months ago
JSON representation
Bulk Data Profiling Solution using Great Expectations
- Host: GitHub
- URL: https://github.com/paulf-999/data_profiling_w_great_expectations
- Owner: paulf-999
- Created: 2023-10-13T05:31:51.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-13T05:35:19.000Z (over 1 year ago)
- Last Synced: 2025-01-27T23:48:23.437Z (4 months ago)
- Topics: great-expectations, makefile, python3
- Language: HTML
- Homepage:
- Size: 37.1 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Bulk Data Profiling using Great Expectations
This repository provides a streamlined way to perform data profiling on a list of input tables using Great Expectations. Follow the instructions below to set up and run the data profiling process.
## Prerequisites
Before you begin, ensure you have the following in place:
1. **Configure `.env` File**
Expand for more details
* Create the `.env` file by copying and renaming `.env_template`, e.g.:
```bash
cp .env_template .env
```* Then populate the `.env` file, assigning values to the Snowflake vars listed below
```jinja
# Snowflake credentials
SNOWFLAKE_USER={{ SF_USER }}
SNOWFLAKE_PASSWORD={{ SF_PASSWORD }}
SNOWFLAKE_ACCOUNT={{ SF_ACCOUNT }}
SNOWFLAKE_REGION={{ SF_REGION }}
SNOWFLAKE_ROLE={{ SF_ROLE }}
SNOWFLAKE_WAREHOUSE={{ SF_WAREHOUSE }}
SNOWFLAKE_DATABASE={{ SF_DATABASE }}
SNOWFLAKE_SCHEMA={{ SF_SCHEMA }}
SNOWFLAKE_HOST={{ SF_HOST }}
INPUT_TABLE={{ SF_EG_TABLE }}
GX_DATA_SRC={{ GX_SOURCE_NAME }}
ROW_COUNT_LIMIT=30
```
2. **Enter list of input tables in `config.yaml`**
Expand for more details
* Open `config.yaml`.
* Under the `input_tables` key, list the tables you want to profile, e.g.:```yaml
input_tables:
- table_name_1
- table_name_2
# Add more input tables as needed
```
## Usage
After meeting the above prerequisites, profile your input tables using the following command:
```shell
make all
```This command will:
1. Create a Python Virtual Environment with the required Python libraries
* See Makefile target `deps`.
2. Create a Great Expectations (GX) project with the list of Snowflake tables you provided
* See Makefile target `install`.
3. Create a data profile and (test) expectation suite, per-input table
* See Makefile target `create_gx_profiler_and_expectation_suite`.
4. Generate GX 'data docs' - i.e., HTML pages to view the content.
* See Makefile target `update_gx_data_docs`.Feel free to reach out if you encounter any issues or have questions about the process. Happy data profiling!