{"id":19756383,"url":"https://github.com/mara/mara-schema","last_synced_at":"2025-04-30T11:33:28.802Z","repository":{"id":37084059,"uuid":"266987849","full_name":"mara/mara-schema","owner":"mara","description":"Mapping of DWH database tables to business entities, attributes \u0026 metrics in Python, with automatic creation of flattened tables","archived":false,"fork":false,"pushed_at":"2023-11-21T15:24:33.000Z","size":4053,"stargazers_count":73,"open_issues_count":8,"forks_count":4,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-21T12:22:08.905Z","etag":null,"topics":["data-governance","data-modeling","datawarehousing","metadata","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mara.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-26T08:32:24.000Z","updated_at":"2025-03-01T17:58:23.000Z","dependencies_parsed_at":"2023-01-22T14:00:56.773Z","dependency_job_id":null,"html_url":"https://github.com/mara/mara-schema","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-schema","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-schema/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-schema/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-schema/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mara","download_url":"https://codeload.github.com/mara/mara-schema/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251691621,"owners_count":21628358,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-governance","data-modeling","datawarehousing","metadata","python"],"created_at":"2024-11-12T03:15:45.652Z","updated_at":"2025-04-30T11:33:25.349Z","avatar_url":"https://github.com/mara.png","language":"Python","readme":"# Mara Schema\n\n[![Build Status](https://github.com/mara/mara-schema/actions/workflows/build.yaml/badge.svg)](https://github.com/mara/mara-schema/actions/workflows/build.yaml)\n[![PyPI - License](https://img.shields.io/pypi/l/mara-schema.svg)](https://github.com/mara/mara-schema/blob/main/LICENSE)\n[![PyPI version](https://badge.fury.io/py/mara-schema.svg)](https://badge.fury.io/py/mara-schema)\n[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack\u0026style=social)](https://communityinviter.com/apps/mara-users/public-invite)\n\nPython based mapping of physical data warehouse tables to logical business entities (a.k.a. \"cubes\", \"models\", \"data sets\", etc.). It comes with\n- sql query generation for flattening normalized database tables into wide tables for various analytics front-ends\n- a flask based visualization of the schema that can serve as a documentation of the business definitions of a data warehouse (a.k.a \"data dictionary\" or \"data guide\")\n- the possibility to sync schemas to reporting front-ends that have meta-data APIs (e.g. Metabase, Looker, Tableau)\n\n\u0026nbsp;\n\n![Mara Schema overview](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema.png)\n\n\u0026nbsp;\n\nHave a look at a real-world application of Mara Schema in the [Mara Example Project 1](https://github.com/mara/mara-example-project-1).\n\n\u0026nbsp;\n\n**Why** should I use Mara Schema?\n\n1. **Definition of analytical business entities as code**: There are many solutions for documenting the company-wide definitions of attributes \u0026 metrics for the users of a data warehouse. These can range from simple spreadsheets or wikis to metadata management tools inside reporting front-ends. However, these definitions can quickly get out of sync when new columns are added or changed in the underlying data warehouse. Mara Schema allows to deploy definition changes together with changes in the underlying ETL processes so that all definitions will always be in sync with the underlying data warehouse schema.\n\n\n2. **Automatic generation of aggregates / artifacts**: When a company wants to enforce a *single source of truth* in their data warehouse, then a heavily normalized Kimball-style [snowflake schema](https://en.wikipedia.org/wiki/Snowflake_schema) is still the weapon of choice. It enforces an agreed-upon unified modelling of business entities across domains and ensures referential consistency. However, snowflake schemas are not ideal for analytics or data science because they require a lot of joins. Most analytical databases and reporting tools nowadays work better with pre-flattened wide tables. Creating such flattened tables is an error-prone and dull activity, but with Mara Schema one can automate most of the work in creating flattened data set tables in the ETL.\n\n\u0026nbsp;\n\n## Installation\n\nTo use the library directly, use pip:\n\n```\npip install mara-schema\n```\n\nor\n\n```\npip install git+https://github.com/mara/mara-schema.git\n```\n\n\u0026nbsp;\n\n## Defining entities, attributes, metrics \u0026 data sets\n\nLet's consider the following toy example of a dimensional schema in the data warehouse of a hypothetical e-commerce company:\n\n![Example dimensional star schema](https://github.com/mara/mara-schema/raw/main/docs/_static/example-dimensional-database-schema.svg)\n\nEach box is a database table with its columns, and the lines between tables show the foreign key constraints. That's a classic Kimball style [snowflake schema](https://en.wikipedia.org/wiki/Snowflake_schema) and it requires a proper modelling / ETL layer in your data warehouse. A script that creates these example tables in PostgreSQL can be found in [example/dimensional-schema.sql](https://github.com/mara/mara-schema/blob/main/mara_schema/example/dimensional-schema.sql).\n\nIt's a prototypical data warehouse schema for B2C e-commerce: There are orders composed of individual product purchases (order items) made by customers. There are circular references: Orders have a customer, and customers have a first order. Order items have a product (and thus a product category) and customers have a favourite product category.\n\nThe respective entity and data set definitions for this database schema can be found in the [mara_schema/example](https://github.com/mara/mara-schema/tree/main/mara_schema/example) directory.\n\n\u0026nbsp;\n\nIn Mara Schema, each business relevant table in the dimensional schema is mapped to an [Entity](https://github.com/mara/mara-schema/blob/main/mara_schema/entity.py). In dimensional modelling terms, entities can be both fact tables and dimensions. For example, a customer entity can be a dimension of an order items data set (a.k.a. \"cube\", \"model\", \"data mart\") and a customer data set of its own.\n\nHere's a [shortened](https://github.com/mara/mara-schema/blob/main/mara_schema/example/entities/order_item.py) defnition of the \"Order item\" entity based on the `dim.order_item` table:\n\n```python\nfrom mara_schema.entity import Entity\n\norder_item_entity = Entity(\n    name='Order item',\n    description='Individual products sold as part of an order',\n    schema_name='dim')\n```\n\nIt assumes that there is an `order_item` table in the `dim` schema of the data warehouse, with `order_item_id` as the primary key. The optional `table_name` and `pk_column_name` parameters can be used when another naming scheme for tables and primary keys is used.\n\n\u0026nbsp;\n\n[Attributes](https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py) represent facts about an entity. They correspond to the non-numerical columns in a fact or dimension table:\n\n```python\nfrom mara_schema.attribute import Type\n\norder_item_entity.add_attribute(\n    name='Order item ID',\n    description='The ID of the order item in the backend',\n    column_name='order_item_id',\n    type=Type.ID,\n    high_cardinality=True)\n```\n\nThey come with a speaking name (as shown in reporting front-ends), a description and a `column_name` in the underlying database table.\n\nThere a several parameters for controlling the generation of artifact tables and the visibility in front-ends:\n- Setting `personal_data` to `True` means that the attribute contains personally identifiable information and thus should be hidden from most users.\n- When `high_cardinality` is `True`, then the attribute is hidden in front-ends that can not deal well with dimensions with a lot of values.\n- The `type` attribute controls how some fields are treated in artifact creation. See [mara_schema/attribute.py#L7](https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py#L7).\n- An `important_field` highlights the data set and is shown by default in overviews.\n- When `accessible_via_entity_link` is `False`, then the attribute will be hidden in data sets that use the entity as an dimension.\n\n\u0026nbsp;\n\nThe attributes of the dimensions of an entity are recursively linked with the `link_entity` method:\n\n```python\nfrom .order import order_entity\nfrom .product import product_entity\n\norder_item_entity.link_entity(target_entity=order_entity, prefix='')\norder_item_entity.link_entity(target_entity=product_entity)\n```\n\nThis pulls in attributes of other entities that are connected to an entity table via foreign key columns. When the other entity is called \"Foo bar\", then it's assumed that there is a `foo_bar_fk` in the entity table (can be overwritten with the `fk_column` parameter). The optional `prefix` controls how linked attributes are named (e.g. \"First order date\" vs \"Order date\") and also helps to disambiguate when there are multiple links from one entity to another.\n\n\u0026nbsp;\n\nOnce all entities and their relationships are established, [Data Sets](https://github.com/mara/mara-schema/blob/main/mara_schema/data_set.py) (a.k.a \"cubes\", \"models\" or \"data marts\") add metrics and attributes from linked entities to an entity:\n\n```python\nfrom mara_schema.data_set import DataSet\n\nfrom ..entities.order_item import order_item_entity\n\norder_items_data_set = DataSet(entity=order_item_entity, name='Order items')\n```\n\n\u0026nbsp;\n\nThere are two kinds of [Metrics](https://github.com/mara/mara-schema/blob/main/mara_schema/metric.py) (a.k.a \"Measures\") in Mara Schema: simple metrics and composed metrics. Simple metrics are computed as direct aggregations on an entity table column:\n\n```python\nfrom mara_schema.data_set import Aggregation\n\norder_items_data_set.add_simple_metric(\n    name='# Orders',\n    description='The number of valid orders (orders with an invoice)',\n    column_name='order_fk',\n    aggregation=Aggregation.DISTINCT_COUNT,\n    important_field=True)\n\norder_items_data_set.add_simple_metric(\n    name='Product revenue',\n    description='The price of the ordered products as shown in the cart',\n    aggregation=Aggregation.SUM,\n    column_name='product_revenue',\n    important_field=True)\n```\n\nIn this example the metric \"# Orders\" is defined as the distinct count on the `order_fk` column, and \"Product revenue\" as the sum of the `product_revenue` column.\n\nComposed metrics are built from other metrics (both simple and composed)  like this:\n\n```python\norder_items_data_set.add_composed_metric(\n    name='Revenue',\n    description='The total cart value of the order',\n    formula='[Product revenue] + [Shipping revenue]',\n    important_field=True)\n\norder_items_data_set.add_composed_metric(\n    name='AOV',\n    description='The average revenue per order. Attention: not meaningful when split by product',\n    formula='[Revenue] / [# Orders]',\n    important_field=True)\n```\n\nThe `formula` parameter takes simple algebraic expressions (`+`, `-`, `*`, `/` and parentheses) with the names of the parent metrics in rectangular brackets, e.g. `([a] + [b]) / [c]`.\n\n\u0026nbsp;\n\nWith complex snowflake schemas the graph of linked entities can become rather big. To avoid cluttering data sets with unnecessary attributes, Mara Schema has a way for excluding entire entity links:\n\n```python\ncustomers_data_set.exclude_path(['Order', 'Customer'])\n```\n\nThis means that the customer of the first order of a customer will not be part of the customers data set. Similarly, it is possible to limit the list of attributes from a linked entity:\n\n```python\norder_items_data_set.include_attributes(['Order', 'Customer', 'Order'], ['Order date'])\n```\n\nHere only the order date of the first order of the customer of the order will be included in the data set.\n\n\u0026nbsp;\n\n## Visualization\n\nMara schema comes with (an optional) Flask based visualization that documents the metrics and attributes of all data sets:\n\n![Mara schema data set visualization](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema-data-set-visualization.png)\n\nWhen made available to business users, then this can serve as the \"data dictionary\", \"data guide\" or \"data catalog\" of a company.\n\n\u0026nbsp;\n\n## Artifact generation\n\nThe function `data_set_sql_query` in [mara_schema/sql_generation.py](https://github.com/mara/mara-schema/blob/main/mara_schema/sql_generation.py) can be used to flatten the entities of a data set into a wide data set table:\n\n```python\ndata_set_sql_query(data_set=order_items_data_set, human_readable_columns=True, pre_computed_metrics=False,\n                   star_schema=False, personal_data=False, high_cardinality_attributes=True)\n```\n\nThe resulting SELECT statement can be used for creating a data set table that is specifically tailored for the use in Metabase:\n\n```sql\nSELECT\n     order_item.order_item_id AS \"Order item ID\",\n\n    \"order\".order_id AS \"Order ID\",\n    \"order\".order_date AS \"Order date\",\n\n    order_customer.customer_id AS \"Customer ID\",\n\n    order_customer_favourite_product_category.main_category AS \"Customer favourite product category level 1\",\n    order_customer_favourite_product_category.sub_category_1 AS \"Customer favourite product category level 2\",\n\n    order_customer_first_order.order_date AS \"Customer first order date\",\n\n    product.sku AS \"Product SKU\",\n\n    product_product_category.main_category AS \"Product category level 1\",\n    product_product_category.sub_category_1 AS \"Product category level 2\",\n\n    order_item.order_item_id AS \"# Order items\",\n    order_item.order_fk AS \"# Orders\",\n    order_item.product_revenue AS \"Product revenue\",\n    order_item.revenue AS \"Shipping revenue\"\n\nFROM dim.order_item order_item\nLEFT JOIN dim.\"order\" \"order\" ON order_item.order_fk = \"order\".order_id\nLEFT JOIN dim.customer order_customer ON \"order\".customer_fk = order_customer.customer_id\nLEFT JOIN dim.product_category order_customer_favourite_product_category ON order_customer.favourite_product_category_fk = order_customer_favourite_product_category.product_category_id\nLEFT JOIN dim.\"order\" order_customer_first_order ON order_customer.first_order_fk = order_customer_first_order.order_id\nLEFT JOIN dim.product product ON order_item.product_fk = product.product_id\nLEFT JOIN dim.product_category product_product_category ON product.product_category_fk = product_product_category.product_category_id\n```\n\nPlease note that the `data_set_sql_query` only returns SQL select statements, it's a matter of executing these statements somewhere in the ETL of the Data Warehouse. [Here](https://github.com/mara/mara-example-project-1/tree/main/app/pipelines/generate_artifacts/metabase.py) is an example for creating data set tables for Metabase using [Mara Pipelines](https://github.com/mara/mara-pipelines).\n\n\u0026nbsp;\n\nThere are several parameters for controlling the output of the `data_set_sql_query` function:\n\n - `human_readable_columns`: Whether to use \"Customer name\" rather than \"customer_name\" as column name\n - `pre_computed_metrics`: Whether to pre-compute composed metrics, counts and distinct counts on row level\n - `star_schema`: Whether to add foreign keys to the tables of linked entities rather than including their attributes\n - `personal_data`: Whether to include attributes that are marked as personal data\n - `high_cardinality_attributes`: Whether to include attributes that are marked to have a high cardinality\n\n![Mara schema SQL generation](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema-sql-generation.gif)\n\n\n## Schema sync to front-ends\n\nWhen reporting tools have a Metadata API (e.g. Metabase, Tableau) or can read schema definitions from text files (e.g. Looker, Mondrian), then it's easy to sync definitions with them. The [Mara Metabase](https://github.com/mara/mara-metabase) package contains a function for syncing Mara Schema definitions with Metabase and the [Mara Mondrian](https://github.com/mara/mara-mondrian) package contains a generator for a Mondrian schema.\n\nWe welcome contributions for creating Looker LookML files, for syncing definitions with Tableau, and for syncing with any other BI front-end.\n\nAlso, we see a potential for automatically creating data guides in other Wikis or documentation tools.\n\n\n## Installation\n\nTo use the library directly, use pip:\n\n```\npip install mara-schema\n```\n\nor\n\n```\npip install git+https://github.com/mara/mara-schema.git\n```\n\nFor an example of an integration into a flask application, have a look at the [Mara Example Project 1](https://github.com/mara/mara-example-project-1).\n\n\u0026nbsp;\n\n## Links\n\n* Documentation: https://mara-schema.readthedocs.io/\n* Changes: https://mara-schema.readthedocs.io/en/stable/changes.html\n* PyPI Releases: https://pypi.org/project/mara-schema/\n* Source Code: https://github.com/mara/mara-schema\n* Issue Tracker: https://github.com/mara/mara-schema/issues\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmara%2Fmara-schema","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmara%2Fmara-schema","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmara%2Fmara-schema/lists"}