https://github.com/datnguye/sqlmesh-demo
POC of sqlmesh with Jaffle Shop
https://github.com/datnguye/sqlmesh-demo
Last synced: 6 months ago
JSON representation
POC of sqlmesh with Jaffle Shop
- Host: GitHub
- URL: https://github.com/datnguye/sqlmesh-demo
- Owner: datnguye
- License: mit
- Created: 2023-08-26T04:06:15.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-05T03:24:26.000Z (over 2 years ago)
- Last Synced: 2025-06-07T10:07:26.350Z (7 months ago)
- Language: Python
- Homepage:
- Size: 390 KB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Audit: audits/assert_positive_order_ids.sql
Awesome Lists containing this project
README
# sqlmesh-demo
POC of `sqlmesh` with Jaffle Shop data as an engineer trying stepping out from the `dbt` world!
Actually try to mimic [dbt Jaffle Shop project](https://github.com/dbt-labs/jaffle_shop) and get the first impression of this cool tool π
_Environment used_:
- sqlmesh v0.30
- postgres 15 as the gateway
- Windows 11
- VSCode with dbt extensions installed
π **Quick final look of this POC:** π

## 1. Getting familiar with sqlmesh CLI
```bash
# 1. Create virtual environment & Activate it
python -m venv .env
.\.env\Scripts\activate.bat
# 2. Install sqlmesh and the dependencies if any
pip install -r requirements.txt # sqlmesh
sqlmesh --version # 0.28.0
sqlmesh --help
# Usage: sqlmesh [OPTIONS] COMMAND [ARGS]...
# SQLMesh command line tool.
# Options:
# --version Show the version and exit.
# -p, --paths TEXT Path(s) to the SQLMesh config/project.
# --config TEXT Name of the config object. Only applicable to
# configuration defined using Python script.
# --gateway TEXT The name of the gateway.
# --ignore-warnings Ignore warnings.
# --help Show this message and exit.
# Commands:
# audit Run audits.
# create_external_models Create a schema file containing external model...
# dag Renders the dag using graphviz.
# diff Show the diff between the current context and a...
# evaluate Evaluate a model and return a dataframe with a...
# fetchdf Runs a sql query and displays the results.
# format Format all models in a given directory.
# ide Start a browser-based SQLMesh IDE.
# info Print information about a SQLMesh project.
# init Create a new SQLMesh repository.
# invalidate Invalidates the target environment, forcing its...
# migrate Migrate SQLMesh to the current running version.
# plan Plan a migration of the current context's...
# prompt Uses LLM to generate a SQL query from a prompt.
# render Renders a model's query, optionally expanding...
# rollback Rollback SQLMesh to the previous migration.
# run Evaluates the DAG of models using the built-in...
# table_diff Show the diff between two tables.
# test Run model unit tests.
# ui Start a browser-based SQLMesh UI.
# 3. Initialize the project skeleton with Postgres dialect, and do the very first runs
sqlmesh init postgres
# (repo)
# ββββaudits
# β assert_positive_order_ids.sql
# ββββmacros
# β __init__.py
# ββββmodels
# β full_model.sql
# β incremental_model.sql
# β seed_model.sql
# ββββseeds
# β seed_data.csv
# ββββtests
# test_full_model.yaml
sqlmesh info
# Models: 3
# Macros: 11
# Data warehouse connection succeeded
# Test connection succeeded
sqlmesh plan [dev]
# ======================================================================
# Successfully Ran 1 tests against duckdb
# ----------------------------------------------------------------------
# New environment `prod` will be created from `prod`
# Summary of differences against `prod`:
# βββ Added Models:
# βββ sqlmesh_example.seed_model
# βββ sqlmesh_example.incremental_model
# βββ sqlmesh_example.full_model
# Models needing backfill (missing dates):
# βββ sqlmesh_example.seed_model: 2023-08-25 - 2023-08-25
# βββ sqlmesh_example.incremental_model: 2020-01-01 - 2023-08-25
# βββ sqlmesh_example.full_model: 2020-01-01 - 2023-08-25
# Apply - Backfill Tables [y/n]: y
# Creating new model versions ββββββββββββββββββββββββββββββββββββββββ 100.0% β’ 3/3 β’ 0:00:00
# All model versions have been created successfully
# Evaluating models ββββββββββββββββββββββββββββββββββββββββ 100.0% β’ 3/3 β’ 0:00:00
# All model batches have been executed successfully
# Virtually Updating 'prod' ββββββββββββββββββββββββββββββββββββββββ 100.0% β’ 0:00:00
# The target environment has been updated successfully
sqlmesh run
# No models scheduled to run at this time.
sqlmesh audit
# Found 1 audit(s).
# assert_positive_order_ids PASS.
# Finished with 0 audit errors and 0 audits skipped.
# Done.
sqlmesh test
# .
# ----------------------------------------------------------------------
# Ran 1 test in 0.162s
# OK
```
**First impressions**:
- So far so good, the installation tooks quite a bit long, especially when installed the `pandas` πββοΈ
- Lots of useful CLI commands π
- New concept with `plan` & `apply`, and the virtual environment defaults to `prod` π
- It creates DuckDB files by default and do transformation smoothly π
- Project skeleton looks similar to dbt, but not quite, there are new things such as: `audit`, only `config.yml` (not dealing with `dbt_project.yml` and `profiles.yml`) π
- Everything based `model` e.g. for a seed file we need to create a corresponding model.sql file, same for source files π
- No model global config π
- Each model has the individual config (kind, cron, grain) and the `SELECT` statement which are similar idea to dbt, but no Jinja syntax here! π
- Great Web IDE with data lineage π
- Oh wow! `sqlmesh prompt` command - LLM π
- No package ecosystem but it seems to utilize the python's eco one which is huge π―
**Struggling?!**:
- How to run a specific model and debug the compiled sql code?
- `sqlmesh run` command only allow to run all stuff with a date range
- `sqlmesh plan` command seems to be the same. Oh! `sqlmesh plan --select-model ` will do
- `sqlmesh render ` command seems to be helpful to see the SQL compiled code
- `sqlmesh evaluate ` command is my goal here, voila!
- How to generate the project documentation site and host it in Github Page? With `sqlmesh ui` command? It seems impossible as of v0.28?!
- What are the steps we should perform in CI/CD? I will find out later!
## 2. Model development and Testing
- **Adding seeds & models**
- It is not so quick to add the seeds because I neeed to create the coresponding model files with explicitly specifying the list of columns and datatypes π’
- `sqlmesh evaluate raw_customers` produces an error "Cannot find snapshot for 'raw_customers'" π’
- Oh! I need to have the model name passed in! So, I should run `sqlmesh evaluate jf.raw_customers` β
- To see the compiled sql code, let's use `sqlmesh render jf.stg_customers` π
- Try to duplicate the model name within the model config, and `sqlmesh plan` will complain: `Error: Duplicate key 'jf.raw_customers' found in UniqueKeyDict. Call dict.update(...) if this is intentional.` π
- Auto-completion works very nicely when coding the model π
- A built-in linter with `sqlmesh format`, hmm...the result looks not great in my imagination, but still looks ok!π
- No incremental run or full refresh run, it is to deal with date range or full load by default π
- Greate readability expo with zero Jinja code π
- Not a great expo with CLI typing because the command is long, easy to wrongly type and hence takes time π, but it is fine with Web IDE π
- Aha! Semi-colonm is important bit here if it is not related to SQL e.g. model config, macros π
- [Loop or Control flow ops](https://sqlmesh.readthedocs.io/en/stable/concepts/macros/sqlmesh_macros/#control-flow-operators) is cool even it takes sometime to get familiar with π
- Found a limitation: `@EACH(@payment_methods, x -> ... as @x_amount)` will fail but work like a charm when change to `@EACH(@payment_methods, x -> ... as amount_@x)` β οΈ
- Let's write a sqlmesh macro for that β
:
```python
@macro()
def make_order_amounts(
evaluator,
payment_method_values=[],
column__payment_method: str = "payment_method",
column__amount: str = "amount",
):
return [
f"""SUM(
CASE
WHEN {column__payment_method} = {item}
THEN {column__amount}
ELSE 0
END
) AS {item.name}_amount
"""
for item in payment_method_values.expressions
]
```
```sql
@DEF(payment_methods, ARRAY['credit_card', 'coupon', 'bank_transfer', 'gift_card']);
select @make_order_amounts(@payment_methods)
...
```
- Seems that the CLI logs was not exposed somewhere - hard to debug when something wrong happened π€
- There we go! Let's set the env variable `SQLMESH_DEBUG=1` β
- In the model kind of `INCREMENTAL_BY_UNIQUE_KEY`, the `unique_key` config is a tuple e.g. `(key1, key2)`, if I made it as an array `[key1, key2]`, it would hang your `sqlmesh` command(s) β οΈ
- Plan & Apply in Postgress randomly have error message β, do it again with a success β οΈ
```bash
Failed to create schema 'jf': duplicate key value violates unique constraint "pg_namespace_nspname_index"
DETAIL: Key (nspname)=(jf) already exists.
Failed to create schema 'jf': duplicate key value violates unique constraint "pg_namespace_nspname_index"
DETAIL: Key (nspname)=(jf) already exists.
```
- This is to be dealing with the concurrency config (default to 4 tasks), definitely an issue with Postgres adapter, but we can workaround by set it to `1` β
- Take sometime to get familiar with new concept [Virtual Environment](https://tobikodata.com/virtual-data-environments.html). There are several schemas get created within `plan` operation including: your configured schema & the environment schemas π
- Each env schema will be physically as:
- Schema name is auto-prefixed by `sqlmesh__` e.g. `sqlmesh__jf` π
- Table name is auto-prefixed by the configured schema name, and auto-suffixed with a hash e.g. `jf__orders__1347386500` π
- Recommended to [join Slack community](https://tobikodata.com/slack), the team is very supportive when I asked questionsπ
- **Adding audits and tests**:
- Audit is reusable - `@this_model` is represented to the attached model π
- Let's get familiar with `audit` command e.g. `sqlmesh audit --model jf.orders` π
- I need to add it to each model config π
- It is NOT fully reusable if the column name is vary β οΈ
- NO! Actually it allows to parameterize the audit with variable e.g. `@column is null` and then in the model config using `audits [assert_not_null(column=order_id)]`
- Similar idea to dbt singular test - write the sql to fail the case π
- I can join with other models if needed β
- Modifying an audit requires to apply a plan first β
- Quite easy to create the similar dbt generic tests in the audit: `not null`, `unique`, `accepted values`, `relationships` π
- Actually they have those [buit-in audits here](https://sqlmesh.readthedocs.io/en/stable/concepts/audits/#built-in-audits) π
- The log output of audit seems not to mention the attached model β
```log
(.env) C:\Users\DAT\Documents\Sources\sqlmesh-demo>sqlmesh audit
Found 21 audit(s).
assert_positive_order_ids PASS.
assert_not_null PASS.
assert_unique PASS.
assert_not_null PASS.
assert_not_null PASS.
assert_not_null PASS.
assert_not_null PASS.
assert_not_null PASS.
assert_not_null PASS.
assert_not_null PASS.
assert_accepted_values PASS.
assert_unique PASS.
assert_relationships PASS.
assert_not_null PASS.
assert_unique PASS.
assert_not_null PASS.
assert_unique PASS.
assert_accepted_values PASS.
assert_not_null PASS.
assert_unique PASS.
assert_accepted_values PASS.
Finished with 0 audit errors and 0 audits skipped.
Done.
```
- Test is really `unit test` (with `duckdb` dialect) π
- I need to add yml file(s) to the `(repo)/tests` directory π
- Might take time to implement because it requires to manually fake the data: both input and output β οΈ
- Let's get familiar with `test` command e.g. `sqlmesh test -k test_full` π
- It has a risk of new SQL Syntax (in the modern DWH) which might be not supported in DuckDB β οΈ
- **Additional stuff**:
- DRY with common functions
- Let's try [Python macro](https://sqlmesh.readthedocs.io/en/stable/concepts/macros/sqlmesh_macros/#python-macro-functions) π
- So far it's pefect π until when trying with passing List arguments -- it is just hanging β οΈ
- Oh! The docs is out of date, the team has advised the correct syntax β
:
```python
@macro()
def make_indicators(evaluator, string_column, string_values):
return [
f"CASE WHEN {string_column} = {value} THEN {value} ELSE NULL END AS {string_column.name}_{value.name}"
for value in string_values.expressions
]
```
- The syntax takes for a while to get familiar with (same expo when I started writting jinja) but the readability is better β
- Column Level Security (aka Masking Policy), for example, in Snowflake β
- Let's try [Pre/Post Statement](https://sqlmesh.readthedocs.io/en/stable/concepts/models/seed_models/#pre-and-post-statements) or [Statement](https://sqlmesh.readthedocs.io/en/stable/concepts/models/overview/#statements), better to get understanding of [Model Concept](https://sqlmesh.readthedocs.io/en/stable/concepts/models/sql_models/#model-ddl) first π
- Voila I successfully managed it with sqlmesh macro + pre/post statement β
- Here is a sample:
- `(repo)/macros/snow-mask-ddl/schema_name.masking_func_name.sql` -- multiple masking funcs created in the `snow-mask-ddl` dir
```sql
CREATE MASKING POLICY IF NOT EXISTS @schema.mp_customer_name AS (
masked_column string
) RETURNS string ->
CASE
WHEN CURRENT_ROLE() IN ('ANALYST') THEN masked_column
WHEN CURRENT_ROLE() IN ('SYSADMIN') THEN SHA2(masked_column)
ELSE '**********'
END;
```
- 2 main macros to create&apply the func:
```python
import os
from sqlmesh import macro
@macro()
def apply_masking_policy(evaluator, column: str, func: str, params: str):
param_list = str(params).split("|")
return """
INSERT INTO {table}(id) VALUES ('{value}') --faking sql here
""".format(
table=str(func), value=f"{column}-applied-{func}-with-{','.join(param_list)}"
)
@macro()
def create_masking_policy(evaluator, func: str):
ddl_dir = os.path.dirname(os.path.realpath(__file__))
ddl_file = f"{ddl_dir}/snow-mask-ddl/{func}.sql"
func_parts = str(func).split(".")
assert len(func_parts) == 2
schema = func_parts[0]
with open(ddl_file, "r") as file:
content = file.read()
return content.replace("@schema", schema)
```
- And, use it the model
```sql
MODEL (
name jf.customers,
...
);
@create_masking_policy(common.mp_customer_name);
/model sql code here/;
@apply_masking_policy(last_name, common.mp_customer_name, first_name | last_name)
```
- Metadata analysis
- State gets stored into the gateway database under the schema named `sqlmesh` by default π
- Snapshot gets stored in `_snapshots` table, one row per model & version π
- Seed gets stored in `_seeds` table, contains all seed data in a column π -- definitely will have some limitation of size
- Model schedule gets stored in `_intervals` table π
- Run time per model or per run -- cannot find the info β
- SqlMesh's Macros in another python package -- seems not support β
- Column Level Lineage, yes it is available int `docs` site by using `sqlmesh ui` command π
## 3. Setup CI/CD
CICD Bot is available for Github PR, let's try [getting started](https://sqlmesh.readthedocs.io/en/stable/integrations/github/#getting-started) π
Currently only Github Actions is supported β οΈ
- Only 1 job for CICD Bot defined in the workflow, when triggered, the bot will run 4 additional jobs:
- SQLMesh Unit Test
- Behind the scenes, it is the `sqlmesh test` β
- SQLMesh Has Required Approved
- If your pipeline has approval configs, it will require an approval before doing data gapless deployments to production β
- SQLMesh PR Environment Synced
- Behind the scenes, it creates/updates the PR Environment `sqlmesh_demo_{pull_request_number}` β
- Not sure what is the command behind, guessing `sqlmesh plan sqlmesh_demo_{pull_request_number} --auto-apply` π
- SQLMesh Prod Environment Synced
- Do the deployment of production, recommended to [add approval seeting](https://sqlmesh.readthedocs.io/en/stable/integrations/github/#enforce-that-certain-reviewers-have-approved-of-the-pr-before-it-can-be-merged) π
- Required 2 bits: β
- Approval (optional)
- SQLMesh PR Environment Synced done successfully
π Sample [PR's pipeline](https://github.com/datnguye/sqlmesh-demo/actions/runs/6071256412) check here
**Happy Engineering** π