{"id":13904562,"url":"https://github.com/carbonfact/lea","last_synced_at":"2025-07-18T02:31:25.113Z","repository":{"id":200564864,"uuid":"702144171","full_name":"carbonfact/lea","owner":"carbonfact","description":"🏃‍♀️ Minimalist alternative to dbt","archived":false,"fork":false,"pushed_at":"2024-10-22T03:56:47.000Z","size":1048,"stargazers_count":209,"open_issues_count":18,"forks_count":6,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-23T05:57:04.465Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/carbonfact.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-08T16:15:46.000Z","updated_at":"2024-10-22T03:56:50.000Z","dependencies_parsed_at":"2023-11-14T20:23:50.152Z","dependency_job_id":"8ff7e8a9-b228-4cfb-a5da-af3b81f1d790","html_url":"https://github.com/carbonfact/lea","commit_stats":null,"previous_names":["carbonfact/lea"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carbonfact%2Flea","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carbonfact%2Flea/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carbonfact%2Flea/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carbonfact%2Flea/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/carbonfact","download_url":"https://codeload.github.com/carbonfact/lea/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226329255,"owners_count":17607777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-06T23:00:57.685Z","updated_at":"2025-07-18T02:31:25.102Z","avatar_url":"https://github.com/carbonfact.png","language":"Python","funding_links":[],"categories":["Transformation","Python","\u003ca name=\"Python\"\u003e\u003c/a\u003ePython"],"sub_categories":[],"readme":"\u003ch1\u003elea\u003c/h1\u003e\n\n\u003cimg src=\"https://github.com/carbonfact/lea/assets/8095957/df2bcf1e-fcc9-4111-9897-ec29427aeeaa\" width=\"33%\" align=\"right\" /\u003e\n\n\u003cp\u003e\n\u003c!-- Tests --\u003e\n\u003ca href=\"https://github.com/carbonfact/lea/actions/workflows/unit-tests.yml\"\u003e\n    \u003cimg src=\"https://github.com/carbonfact/lea/actions/workflows/unit-tests.yml/badge.svg\" alt=\"tests\"\u003e\n\u003c/a\u003e\n\n\u003c!-- Code quality --\u003e\n\u003ca href=\"https://github.com/carbonfact/lea/actions/workflows/code-quality.yml\"\u003e\n    \u003cimg src=\"https://github.com/carbonfact/lea/actions/workflows/code-quality.yml/badge.svg\" alt=\"code_quality\"\u003e\n\u003c/a\u003e\n\n\u003c!-- PyPI --\u003e\n\u003ca href=\"https://pypi.org/project/lea-cli\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/v/lea-cli.svg?label=release\u0026color=blue\" alt=\"pypi\"\u003e\n\u003c/a\u003e\n\n\u003c!-- License --\u003e\n\u003ca href=\"https://opensource.org/license/apache-2-0/\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/carbonfact/lea\" alt=\"license\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\nlea is a minimalist alternative to SQL orchestrators like [dbt](https://www.getdbt.com/) and [SQLMesh](https://sqlmesh.com/).\n\nlea aims to be simple and provides sane defaults. We happily use it every day at [Carbonfact](https://www.carbonfact.com/) to manage our BigQuery data warehouse. We will actively maintain it and add features, while welcoming contributions.\n\n- [Examples](#examples)\n- [Installation](#installation)\n- [Configuration](#configuration)\n  - [BigQuery](#bigquery)\n  - [DuckDB](#duckdb)\n  - [MotherDuck](#motherduck)\n  - [DuckLake](#ducklake)\n- [Usage](#usage)\n  - [`lea run`](#lea-run)\n  - [File structure](#file-structure)\n    - [Jinja templating](#jinja-templating)\n  - [Development vs. production](#development-vs-production)\n  - [Selecting scripts](#selecting-scripts)\n  - [Write-Audit-Publish (WAP)](#write-audit-publish-wap)\n  - [Testing while running](#testing-while-running)\n  - [Skipping unmodified scripts during development](#skipping-unmodified-scripts-during-development)\n- [Warehouse specific features](#warehouse-specific-features)\n  - [BigQuery](#bigquery-1)\n    - [Default clustering](#default-clustering)\n    - [Big Blue Pick API](#big-blue-pick-api)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Examples\n\n- [Jaffle shop 🥪](examples/jaffle_shop/)\n- [Incremental 🕐](examples/incremental)\n- [School 🏫](examples/school/)\n- [Compare development to production 👯‍♀️](examples/diff/)\n- [Using MotherDuck 🦆](examples/motherduck/)\n\n## Installation\n\nUse one of the following commands, depending on which warehouse you wish to use:\n\n```sh\npip install lea-cli\n```\n\nThis installs the `lea` command. It also makes the `lea` Python library available.\n\n## Configuration\n\nlea is configured via environment variables.\n\n### BigQuery\n\n```sh\n# Required\nLEA_WAREHOUSE=bigquery\n# Required\nLEA_BQ_LOCATION=EU\n# Required\nLEA_BQ_DATASET_NAME=kaya\n# Required, the project where the dataset is located\nLEA_BQ_PROJECT_ID=carbonfact-dwh\n# Optional, allows using a different project for compute\nLEA_BQ_COMPUTE_PROJECT_ID=carbonfact-dwh-compute\n# Not necessary if you're logged in with the gcloud CLI\nLEA_BQ_SERVICE_ACCOUNT=\u003cJSON dump of the service account file\u003e  # not a path ⚠️\n# Defaults to https://www.googleapis.com/auth/bigquery\nLEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive\n# LOGICAL or PHYSICAL, defaults to PHYSICAL\nLEA_BQ_STORAGE_BILLING_MODEL=PHYSICAL\n```\n\n### DuckDB\n\n```sh\n# Required\nLEA_WAREHOUSE=duckdb\n# Required\nLEA_DUCKDB_PATH=duckdb.db\n# Optional\nLEA_DUCKDB_EXTENSIONS=parquet,httpfs\n```\n\n### MotherDuck\n\n```sh\n# Required\nLEA_WAREHOUSE=motherduck\n# Required\nMOTHERDUCK_TOKEN=\u003cget this from https://app.motherduck.com/settings/tokens\u003e\n# Required\nLEA_MOTHERDUCK_DATABASE=bike_sharing\n# Optional\nLEA_DUCKDB_EXTENSIONS=parquet,httpfs\n```\n\n### DuckLake\n\n```sh\n# Required\nLEA_WAREHOUSE=ducklake\n# Required\nLEA_DUCKLAKE_DATA_PATH=gcs://bike-sharing-analytics\n# Required\nLEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake\n# Optional\nLEA_DUCKLAKE_S3_ENDPOINT=storage.googleapis.com\n# Optional\nLEA_DUCKDB_EXTENSIONS=parquet,httpfs\n```\n\nDuckLake needs a database to [manage metadata](https://ducklake.select/docs/stable/duckdb/usage/choosing_a_catalog_database), which is what `LEA_DUCKLAKE_CATALOG_DATABASE` is for.\n\n## Usage\n\nThese parameters can be provided in an `.env` file, or directly in the shell. Each command also has an `--env` flag to provide a path to an `.env` file.\n\n### `lea run`\n\nThis is the main command. It runs SQL queries stored in the `scripts` directory:\n\n```sh\nlea run\n```\n\nYou can indicate the directory where the scripts are stored:\n\n```sh\nlea run --scripts /path/to/scripts\n```\n\nThe scripts are run concurrently. They are organized in a DAG, which is traversed in a topological order. The DAG's structure is determined [automatically](https://maxhalford.github.io/blog/dbt-ref-rant/) by analyzing the dependency between queries.\n\n### File structure\n\nEach query is expected to be placed under a schema, represented by a directory. Schemas can have sub-schemas. Here's an example:\n\n```\nscripts/\n    schema_1/\n        table_1.sql\n        table_2.sql\n    schema_2/\n        table_3.sql\n        table_4.sql\n        sub_schema_2_1/\n            table_5.sql\n            table_6.sql\n```\n\nEach script is materialized into a table. The table is named according to the script's name, following the warehouse convention.\n\n#### Jinja templating\n\nSQL queries can be templated with [Jinja](https://jinja.palletsprojects.com/en/3.1.x/). A `.sql.jinja` extension is necessary for lea to recognise them.\n\nYou have access to an `env` variable within the template context, which is simply an access point to `os.environ`.\n\n### Development vs. production\n\nBy default, lea creates an isolation layer with production. The way this is done depends on your warehouse:\n\n- BigQuery : by appending a `_\u003cuser\u003e` suffix to schema names\n- DuckDB : by adding a suffix `_\u003cuser\u003e` to database file.\n\nIn other words, a development environment is used by default. Use the `--production` flag when executing `lea run` to disable this behaviour, and instead target the product environment.\n\n```sh\nlea run --production\n```\n\nThe `\u003cuser\u003e` is determined automatically from the [login name](https://docs.python.org/3/library/getpass.html#getpass.getuser). It can be overriden by setting the `LEA_USERNAME` environment variable.\n\n### Selecting scripts\n\nA single script can be run:\n\n```sh\nlea run --select core.users\n```\n\nSeveral scripts can be run:\n\n```sh\nlea run --select core.users --select core.orders\n```\n\nSimilar to dbt, lea also supports graph operators:\n\n```sh\nlea run --select core.users+   # users and everything that depends on it\nlea run --select +core.users   # users and everything it depends on\nlea run --select +core.users+  # users and all its dependencies\n```\n\nYou can select all scripts in a schema:\n\n```sh\nlea run --select core/  # the trailing slash matters\n```\n\nThis also work with sub-schemas:\n\n```sh\nlea run --select analytics.finance/\n```\n\nThere are thus 8 possible operators:\n\n```\nschema.table    (table by itself)\nschema.table+   (table with its descendants)\n+schema.table   (table with its ancestors)\n+schema.table+  (table with its ancestors and descendants)\nschema/         (all tables in schema)\nschema/+        (all tables in schema with their descendants)\n+schema/        (all tables in schema with their ancestors)\n+schema/+       (all tables in schema with their ancestors and descendants)\n```\n\nCombinations are possible:\n\n```sh\nlea run --select core.users+ --select +core.orders\n```\n\nThere's an Easter egg that allows choosing scripts that have been committed or modified in the current Git branch:\n\n```sh\nlea run --select git\nlea run --select git+  # includes all descendants\n```\n\nThis becomes very handy when using lea in continuous integration.\n\n### Write-Audit-Publish (WAP)\n\n[WAP](https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/) is a data engineering pattern that ensures data consistency and reliability. It's the data engineering equivalent of [blue-green deployment](https://en.wikipedia.org/wiki/Blue%E2%80%93green_deployment) in the software engineering world.\n\nlea follows the WAP pattern by default. When you execute `lea run`, it actually creates temporary tables that have an `___audit` suffix. The latter tables are promoted to replace the existing tables, once they have all been materialized without errors.\n\nThis is a good default behavior. Let's say you refresh table `foo`. Then you refresh table `bar` that depends on `foo`. If the refresh of `bar` fails, you're left with a corrupt state. This is what the WAP pattern solves. In WAP mode, when you run `foo`'s script, it creates a `foo___audit` table. If `bar`'s script fails, then the run stops and `foo` is not modified.\n\n### Testing while running\n\nThere is no `lea test` command. Tests are run together with the regular script when `lea run` is executed. The run stops whenever a test fails.\n\nThere are two types of tests:\n\n- Singular tests — these are queries which return failing rows. They are stored in a `tests` directory.\n- Assertion tests — these are comment annotations in the queries themselves:\n  - `#NO_NULLS` — checks that all values in a column are not null.\n  - `#UNIQUE` — checks that a column's values are unique.\n  - `#UNIQUE_BY(\u003cby\u003e)` — checks that a column's values are unique within a group.\n  - `#SET{\u003celements\u003e}` — checks that a column's values are in a set of values.\n\nHere's an example of a query annotated with assertion tests:\n\n```sql\nSELECT\n    -- #UNIQUE\n    -- #NO_NULLS\n    user_id,\n    -- #NO_NULLS\n    address,\n    -- #UNIQUE_BY(address)\n    full_name,\n    -- #SET{'A', 'B', 'AB', 'O'}\n    blood_type\nFROM core.users\n```\n\nYou can run a single test via the `--select` flag:\n\n```sh\nlea run --select tests.check_n_users\n```\n\nOr even run all the tests, as so:\n\n```sh\nlea run --select tests/  # the trailing slash matters\n```\n\n☝️ When you run a script that is not a test, all the applicable tests are run as well. For instance, the following command will run the `core.users` script and all the tests that are applicable to it:\n\n```sh\nlea run --select core.users\n```\n\nYou may decide to run all scripts without executing tests, which is obviously not advisable:\n\n```sh\nlea run --unselect tests/\nlea run --select core.users --unselect tests/\n```\n\n### Skipping unmodified scripts during development\n\nWhen you call `lea run`, it generates audit tables, which are then promoted to replace the original tables. This is done to ensure that the data is consistent and reliable. lea doesn't run scripts when the audit table already exists, and when the script hasn't modified since the last time the audit table was created. This is to avoid unnecessary re-runs of scripts that haven't changed.\n\nFor instance:\n\n1. You execute `lea run` to sync all tables from sources, no errors, all tables are materialized.\n2. You modify a script named `core/expenses.sql` depending on `staging/customers.sql` and `staging/orders.sql`\n3. You execute `lea run core.expenses+` to run again all impacted tables\n4. `core__expenses___audit` is materialized in your data warehouse but the `-- #NO_NULLS` assertion test on a column fails\n5. After reviewing data in `core__expenses___audit`, you edit and fix `core/expenses.sql` to filter out results where NULLs are appearing\n6. You execute `lea run`\n7. The `staging/customers.sql` and `staging/orders.sql` scripts are skipped because they were modified before `staging__customers` and `staging__orders` was last materialized\n8. The `core/expenses.sql` script is run because it was modified after `core__expenses` was last materialized\n9. All audit tables are wipped out from database as the whole DAG has run successfully ! 🎉\n\nYou can disable this behavior altogether:\n\n```sh\nlea run --restart\n```\n\n## Warehouse specific features\n\n### BigQuery\n\n#### Default clustering\n\nAt Carbonfact, we cluster most of our tables by customer. This is done to optimize query performance and reduce costs. lea allows you to automatically cluster tables that contain a given field:\n\n```sh\nLEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug\n```\n\nYou can also specify multiple fields, meaning that tables which contain both fields will be clustered:\n\n```sh\nLEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug,brand_slug\n```\n\nFor each table, lea will use the clustering fields it can and ignore the others. With the previous configuration, if your table defines `account_slug` and not `brand_slug`, it will cluster by `account_slug`.\n\n#### Big Blue Pick API\n\n[Big Blue](https://biq.blue/) is a SaaS product to monitor and optimize BigQuery costs. As part of their offering, they provide a [Pick API](https://biq.blue/blog/compute/how-to-implement-bigquery-autoscaling-reservation-in-10-minutes). The idea is that some queries should be run on-demand, while others should be run on a reservation. Big Blue's Pick API suggests which billing model to use for each query.\n\nWe use this at Carbonfact, and so this API is available out of the box in lea. You can enable it by setting the following environment variables:\n\n```sh\nLEA_BQ_BIG_BLUE_PICK_API_KEY=\u003cget is from https://your-company.biq.blue/settings.html\u003e\nLEA_BQ_BIG_BLUE_PICK_API_URL=https://pick.biq.blue\nLEA_BQ_BIG_BLUE_PICK_API_ON_DEMAND_PROJECT_ID=on-demand-compute-project-id\nLEA_BQ_BIG_BLUE_PICK_API_REVERVATION_PROJECT_ID=reservation-compute-project-id\n```\n\n## Contributing\n\nFeel free to reach out to [max@carbonfact.com](mailto:max@carbonfact.com) if you want to know more and/or contribute 😊\n\nWe have suggested [some issues](https://github.com/carbonfact/lea/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%22good+first+issue%22) as good places to get started.\n\n## License\n\nlea is free and open-source software licensed under the Apache License, Version 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcarbonfact%2Flea","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcarbonfact%2Flea","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcarbonfact%2Flea/lists"}