{"id":13631449,"url":"https://github.com/totalhack/zillion","last_synced_at":"2026-01-07T16:07:54.169Z","repository":{"id":49339351,"uuid":"175296188","full_name":"totalhack/zillion","owner":"totalhack","description":"Make sense of it all. Semantic data modeling and analytics with a sprinkle of AI. https://totalhack.github.io/zillion/","archived":false,"fork":false,"pushed_at":"2024-02-13T14:01:46.000Z","size":2478,"stargazers_count":178,"open_issues_count":12,"forks_count":6,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-10-02T02:52:04.741Z","etag":null,"topics":["ai","analytics","data-analysis","data-warehousing","datasources","openai","python","query-builder","reporting","semantic-data-model","semantic-layer","sql","text-to-sql","warehouse"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/totalhack.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["totalhack"]}},"created_at":"2019-03-12T21:05:24.000Z","updated_at":"2024-09-26T16:51:43.000Z","dependencies_parsed_at":"2023-01-23T17:31:23.072Z","dependency_job_id":"5409c321-54e0-4434-92b6-152f7e226536","html_url":"https://github.com/totalhack/zillion","commit_stats":{"total_commits":243,"total_committers":2,"mean_commits":121.5,"dds":0.004115226337448541,"last_synced_commit":"daa535745b9be96a3c9c16438d2b2fca97f004a7"},"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/totalhack%2Fzillion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/totalhack%2Fzillion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/totalhack%2Fzillion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/totalhack%2Fzillion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/totalhack","download_url":"https://codeload.github.com/totalhack/zillion/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223768475,"owners_count":17199353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","analytics","data-analysis","data-warehousing","datasources","openai","python","query-builder","reporting","semantic-data-model","semantic-layer","sql","text-to-sql","warehouse"],"created_at":"2024-08-01T22:02:25.856Z","updated_at":"2026-01-07T16:07:54.139Z","avatar_url":"https://github.com/totalhack.png","language":"Python","readme":"Zillion: Make sense of it all\n=============================\n\n[![Generic badge](https://img.shields.io/badge/Status-Alpha-yellow.svg)](https://shields.io/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n![License: MIT](https://img.shields.io/badge/license-MIT-blue)\n![Python 3.6+](https://img.shields.io/badge/python-3.6%2B-blue)\n[![Downloads](https://static.pepy.tech/badge/zillion)](https://pepy.tech/project/zillion)\n\n**Introduction**\n----------------\n\n`Zillion` is a data modeling and analytics tool that allows combining and\nanalyzing data from multiple datasources through a simple API. It acts as a semantic layer\non top of your data, writes SQL so you don't have to, and easily bolts onto existing\ndatabase infrastructure via SQLAlchemy Core. The `Zillion` NLP extension has experimental\nsupport for AI-powered natural language querying and warehouse configuration.\n\nWith `Zillion` you can:\n\n* Define a warehouse that contains a variety of SQL and/or file-like\n  datasources\n* Define or reflect metrics, dimensions, and relationships in your data\n* Run multi-datasource reports and combine the results in a DataFrame\n* Flexibly aggregate your data with multi-level rollups and table pivots\n* Customize or combine fields with formulas\n* Apply technical transformations including rolling, cumulative, and rank\n  statistics\n* Apply automatic type conversions - i.e. get a \"year\" dimension for free\n  from a \"date\" column\n* Save and share report specifications\n* Utilize ad hoc or public datasources, tables, and fields to enrich reports\n* Query your warehouse with natural language (NLP extension)\n* Leverage AI to bootstrap your warehouse configurations (NLP extension)\n\n**Table of Contents**\n---------------------\n\n* [Installation](#installation)\n* [Primer](#primer)\n    * [Metrics and Dimensions](#metrics-and-dimensions)\n    * [Warehouse Theory](#warehouse-theory)\n    * [Query Layers](#query-layers)\n    * [Warehouse Creation](#warehouse-creation)\n    * [Executing Reports](#executing-reports)\n    * [Natural Language Querying](#natural-language-querying)\n    * [Zillion Configuration](#zillion-configuration)\n* [Example - Sales Analytics](#example-sales-analytics)\n    * [Warehouse Configuration](#example-warehouse-config)\n    * [Reports](#example-reports)\n* [Advanced Topics](#advanced-topics)\n    * [Subreports](#subreports)\n    * [FormulaMetrics](#formula-metrics)\n    * [Divisor Metrics](#divisor-metrics)\n    * [Aggregation Variants](#aggregation-variants)\n    * [FormulaDimensions](#formula-dimensions)\n    * [DataSource Formulas](#datasource-formulas)\n    * [Type Conversions](#type-conversions)\n    * [AdHocMetrics](#adhoc-metrics)\n    * [AdHocDimensions](#adhoc-dimensions)\n    * [AdHocDataTables](#adhoc-data-tables)\n    * [Technicals](#technicals)\n    * [Config Variables](#config-variables)\n    * [DataSource Priority](#datasource-priority)\n* [Supported DataSources](#supported-datasources)\n* [Multiprocess Considerations](#multiprocess-considerations)\n* [Demo UI / Web API](#demo-ui)\n* [Docs](#documentation)\n* [How to Contribute](#how-to-contribute)\n\n\u003ca name=\"installation\"\u003e\u003c/a\u003e\n\n**Installation**\n----------------\n\n\u003e **Warning**: This project is in an alpha state and is subject to change. Please test carefully for production usage and report any issues.\n\n```shell\n$ pip install zillion\n\nor\n\n$ pip install zillion[nlp]\n```\n\n---\n\n\u003ca name=\"primer\"\u003e\u003c/a\u003e\n\n**Primer**\n----------\n\nThe following is meant to give a quick overview of some theory and\nnomenclature used in data warehousing with `Zillion` which will be useful\nif you are newer to this area. You can also skip below for a usage [example](#example-sales-analytics) or warehouse/datasource creation [quickstart](#warehouse-creation) options.\n\nIn short: `Zillion` writes SQL for you and makes data accessible through a very simple API:\n\n```python\nresult = warehouse.execute(\n    metrics=[\"revenue\", \"leads\"],\n    dimensions=[\"date\"],\n    criteria=[\n        (\"date\", \"\u003e\", \"2020-01-01\"),\n        (\"partner\", \"=\", \"Partner A\")\n    ]\n)\n```\n\n\u003ca name=\"metrics-and-dimensions\"\u003e\u003c/a\u003e\n\n### **Metrics and Dimensions**\n\nIn `Zillion` there are two main types of `Fields` that will be used in\nyour report requests:\n\n1. `Dimensions`: attributes of data used for labelling, grouping, and filtering\n2. `Metrics`: facts and measures that may be broken down along dimensions\n\nA `Field` encapsulates the concept of a column in your data. For example, you\nmay have a `Field` called \"revenue\". That `Field` may occur across several\ndatasources or possibly in multiple tables within a single datasource. `Zillion` \nunderstands that all of those columns represent the same concept, and it can try \nto use any of them to satisfy reports requesting \"revenue\".\n\nLikewise there are two main types of tables used to structure your warehouse:\n\n1. `Dimension Tables`: reference/attribute tables containing only related\ndimensions\n2. `Metric Tables`: fact tables that may contain metrics and some related\ndimensions/attributes\n\nDimension tables are often static or slowly growing in terms of row count and contain\nattributes tied to a primary key. Some common examples would be lists of US Zip Codes or\ncompany/partner directories.\n\nMetric tables are generally more transactional in nature. Some common examples\nwould be records for web requests, ecommerce sales, or stock market price history.\n\n\u003ca name=\"warehouse-theory\"\u003e\u003c/a\u003e\n\n### **Warehouse Theory**\n\nIf you really want to go deep on dimensional modeling and the drill-across\nquerying technique `Zillion` employs, I recommend reading Ralph Kimball's\n[book](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/) on data warehousing.\n\nTo summarize, [drill-across\nquerying](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/drilling-across/)\nforms one or more queries to satisfy a report request for `metrics` that may\nexist across multiple datasources and/or tables at a particular `dimension` grain.\n\n`Zillion` supports flexible warehouse setups such as\n[snowflake](https://en.wikipedia.org/wiki/Snowflake_schema) or\n[star](https://en.wikipedia.org/wiki/Star_schema) schemas, though it isn't\npicky about it. You can specify table relationships through a parent-child\nlineage, and `Zillion` can also infer acceptable joins based on the presence\nof dimension table primary keys. `Zillion` does not support many-to-many relationships at this time, though most analytics-focused scenarios should be able to work around that by adding views to the model if needed.\n\n\u003ca name=\"query-layers\"\u003e\u003c/a\u003e\n\n### **Query Layers**\n\n`Zillion` reports can be thought of as running in two layers:\n\n1. `DataSource Layer`: SQL queries against the warehouse's datasources\n2. `Combined Layer`: A final SQL query against the combined data from the\nDataSource Layer\n\nThe Combined Layer is just another SQL database (in-memory SQLite by default)\nthat is used to tie the datasource data together and apply a few additional\nfeatures such as rollups, row filters, row limits, sorting, pivots, and technical computations.\n\n\u003ca name=\"warehouse-creation\"\u003e\u003c/a\u003e\n\n### **Warehouse Creation**\n\nThere are multiple ways to quickly initialize a warehouse from a local or remote file:\n\n```python\n# Path/link to a CSV, XLSX, XLS, JSON, HTML, or Google Sheet\n# This builds a single-table Warehouse for quick/ad-hoc analysis.\nurl = \"https://raw.githubusercontent.com/totalhack/zillion/master/tests/dma_zip.xlsx\"\nwh = Warehouse.from_data_file(url, [\"Zip_Code\"]) # Second arg is primary key\n\n# Path/link to a sqlite database\n# This can build a single or multi-table Warehouse\nurl = \"https://github.com/totalhack/zillion/blob/master/tests/testdb1?raw=true\"\nwh = Warehouse.from_db_file(url)\n\n# Path/link to a WarehouseConfigSchema (or pass a dict)\n# This is the recommended production approach!\nconfig = \"https://raw.githubusercontent.com/totalhack/zillion/master/examples/example_wh_config.json\"\nwh = Warehouse(config=config)\n```\n\nZillion also provides a helper script to boostrap a DataSource configuration file for an existing database. See `zillion.scripts.bootstrap_datasource_config.py`. The bootstrap script requires a connection/database url and output file as arguments. See `--help` output for more options, including the optional `--nlp` flag that leverages OpenAI to infer configuration information such as column types, table types, and table relationships. The NLP feature requires the NLP extension to be installed as well as the following set in your `Zillion` config file:\n\n* OPENAI_MODEL\n* OPENAI_API_KEY\n\n\u003ca name=\"executing-reports\"\u003e\u003c/a\u003e\n\n### **Executing Reports**\n\nThe main purpose of `Zillion` is to execute reports against a `Warehouse`.\nAt a high level you will be crafting reports as follows:\n\n```python\nresult = warehouse.execute(\n    metrics=[\"revenue\", \"leads\"],\n    dimensions=[\"date\"],\n    criteria=[\n        (\"date\", \"\u003e\", \"2020-01-01\"),\n        (\"partner\", \"=\", \"Partner A\")\n    ]\n)\nprint(result.df) # Pandas DataFrame\n```\n\nWhen comparing to writing SQL, it's helpful to think of the dimensions as the\ntarget columns of a **group by** SQL statement. Think of the metrics as the\ncolumns you are **aggregating**. Think of the criteria as the **where\nclause**. Your criteria are applied in the DataSource Layer SQL queries.\n\nThe `ReportResult` has a Pandas DataFrame with the dimensions as the index and\nthe metrics as the columns.\n\nA `Report` is said to have a `grain`, which defines the dimensions each metric\nmust be able to join to in order to satisfy the `Report` requirements. The\n`grain` is a combination of **all** dimensions, including those referenced in\ncriteria or in metric formulas. In the example above, the `grain` would be\n`{date, partner}`. Both \"revenue\" and \"leads\" must be able to join to those\ndimensions for this report to be possible.\n\nThese concepts can take time to sink in and obviously vary with the specifics\nof your data model, but you will become more familiar with them as you start\nputting together reports against your data warehouses.\n\n\u003ca name=\"natural-language-querying\"\u003e\u003c/a\u003e\n\n### **Natural Language Querying**\n\nWith the NLP extension `Zillion` has experimental support for natural language querying of your data warehouse. For example:\n\n```python\nresult = warehouse.execute_text(\"revenue and leads by date last month\")\nprint(result.df) # Pandas DataFrame\n```\n\nThis NLP feature requires a running instance of Qdrant (vector database) and the following values set in your `Zillion` config file:\n\n* QDRANT_HOST\n* OPENAI_API_KEY\n\nEmbeddings will be produced and stored in both Qdrant and a local cache. The\nvector database will be initialized the first time you try to use this by\nanalyzing all fields in your warehouse. An example docker file to run Qdrant is provided in the root of this repo.\n\nYou have some control over how fields get embedded. Namely in the configuration for any field you can choose whether to exclude a field from embeddings or override which embeddings map to that field. All fields are\nincluded by default. The following example would exclude the `net_revenue` field from being embedded and map `revenue` metric requests to the `gross_revenue` field.\n\n```javascript\n{\n    \"name\": \"gross_revenue\",\n    \"type\": \"numeric(10,2)\",\n    \"aggregation\": \"sum\",\n    \"rounding\": 2,\n    \"meta\": {\n        \"nlp\": {\n            // enabled defaults to true\n            \"embedding_text\": \"revenue\" // str or list of str\n        }\n    }\n},\n{\n    \"name\": \"net_revenue\",\n    \"type\": \"numeric(10,2)\",\n    \"aggregation\": \"sum\",\n    \"rounding\": 2,\n    \"meta\": {\n        \"nlp\": {\n            \"enabled\": false\n        }\n    }\n},\n```\n\nAdditionally you may also exclude fields via the following warehouse-level configuration settings:\n\n```javascript\n{\n    \"meta\": {\n        \"nlp\": {\n            \"field_disabled_patterns\": [\n                // list of regex patterns to exclude\n                \"rpl_ma_5\"\n            ],\n            \"field_disabled_groups\": [\n                // list of \"groups\" to exclude, assuming you have\n                // set group value in the field's meta dict.\n                \"No NLP\"\n            ]\n        }\n    },\n    ...\n}\n```\n\nIf a field is disabled at any of the aforementioned levels it will be ignored. This type of control becomes useful as your data model gets more complex and you want to guide the NLP logic in cases where it could confuse similarly named fields. Any time you adjust which fields are excluded you will want to force recreation of your embeddings collection using the `force_recreate` flag on `Warehouse.init_embeddings`.\n\n\u003e *Note:* This feature is in its infancy. It's usefulness will depend on the\nquality of both the input query and your data model (i.e. good field names)!\n\n\u003ca name=\"zillion-configuration\"\u003e\u003c/a\u003e\n\n### **Zillion Configuration**\n\nIn addition to configuring the structure of your `Warehouse`, which will be\ndiscussed further below, `Zillion` has a global configuration to control some\nbasic settings. The `ZILLION_CONFIG` environment var can point to a yaml config file. See `examples/sample_config.yaml` for more details on what values can be set. Environment vars prefixed with ZILLION_ can override config settings (i.e. ZILLION_DB_URL will override DB_URL).\n\nThe database used to store Zillion report specs can be configured by setting the DB_URL value in your `Zillion` config to a valid database connection string. By default a SQLite DB in /tmp is used.\n\n---\n\n\u003ca name=\"example-sales-analytics\"\u003e\u003c/a\u003e\n\n**Example - Sales Analytics**\n-----------------------------\n\nBelow we will walk through a simple hypothetical sales data model that\ndemonstrates basic `DataSource` and `Warehouse` configuration and then shows\nsome sample [reports](#example-reports). The data is a simple SQLite database\nthat is part of the `Zillion` test code. For reference, the schema is as\nfollows:\n\n```sql\nCREATE TABLE partners (\n  id INTEGER PRIMARY KEY,\n  name VARCHAR NOT NULL UNIQUE,\n  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n);\n\nCREATE TABLE campaigns (\n  id INTEGER PRIMARY KEY,\n  name VARCHAR NOT NULL UNIQUE,\n  category VARCHAR NOT NULL,\n  partner_id INTEGER NOT NULL,\n  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n);\n\nCREATE TABLE leads (\n  id INTEGER PRIMARY KEY,\n  name VARCHAR NOT NULL,\n  campaign_id INTEGER NOT NULL,\n  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n);\n\nCREATE TABLE sales (\n  id INTEGER PRIMARY KEY,\n  item VARCHAR NOT NULL,\n  quantity INTEGER NOT NULL,\n  revenue DECIMAL(10, 2),\n  lead_id INTEGER NOT NULL,\n  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n);\n```\n\n\u003ca name=\"example-warehouse-config\"\u003e\u003c/a\u003e\n\n### **Warehouse Configuration**\n\nA `Warehouse` may be created from a JSON or YAML configuration that defines\nits fields, datasources, and tables. The code below shows how it can be done in as little as one line of code if you have a pointer to a JSON/YAML `Warehouse` config.\n\n```python\nfrom zillion import Warehouse\n\nwh = Warehouse(config=\"https://raw.githubusercontent.com/totalhack/zillion/master/examples/example_wh_config.json\")\n```\n\nThis example config uses a `data_url` in its `DataSource` `connect` info that\ntells `Zillion` to dynamically download that data and connect to it as a\nSQLite database. This is useful for quick examples or analysis, though in most\nscenarios you would put a connection string to an existing database like you\nsee\n[here](https://raw.githubusercontent.com/totalhack/zillion/master/tests/test_mysql_ds_config.json)\n\nThe basics of `Zillion's` warehouse configuration structure are as follows:\n\nA `Warehouse` config has the following main sections:\n\n* `metrics`: optional list of metric configs for global metrics\n* `dimensions`: optional list of dimension configs for global dimensions\n* `datasources`: mapping of datasource names to datasource configs or config URLs\n\nA `DataSource` config has the following main sections:\n\n* `connect`: database connection url or dict of connect params\n* `metrics`: optional list of metric configs specific to this datasource\n* `dimensions`: optional list of dimension configs specific to this datasource\n* `tables`: mapping of table names to table configs or config URLs\n\n\u003e Tip: datasource and table configs may also be replaced with a URL that points\nto a local or remote config file.\n\nIn this example all four tables in our database are included in the config,\ntwo as dimension tables and two as metric tables. The tables are linked\nthrough a parent-\u003echild relationship: partners to campaigns, and leads to\nsales.  Some tables also utilize the `create_fields` flag to automatically\ncreate `Fields` on the datasource from column definitions. Other metrics and\ndimensions are defined explicitly.\n\nTo view the structure of this `Warehouse` after init you can use the `print_info`\nmethod which shows all metrics, dimensions, tables, and columns that are part\nof your data warehouse:\n\n```python\nwh.print_info() # Formatted print of the Warehouse structure\n```\n\nFor a deeper dive of the config schema please see the full\n[docs](https://totalhack.github.io/zillion/zillion.configs/).\n\n\u003ca name=\"example-reports\"\u003e\u003c/a\u003e\n\n### **Reports**\n\n**Example:** Get sales, leads, and revenue by partner:\n\n```python\nresult = wh.execute(\n    metrics=[\"sales\", \"leads\", \"revenue\"],\n    dimensions=[\"partner_name\"]\n)\n\nprint(result.df)\n\"\"\"\n              sales  leads  revenue\npartner_name\nPartner A        11      4    165.0\nPartner B         2      2     19.0\nPartner C         5      1    118.5\n\"\"\"\n```\n\n**Example:** Let's limit to Partner A and break down by its campaigns:\n\n```python\nresult = wh.execute(\n    metrics=[\"sales\", \"leads\", \"revenue\"],\n    dimensions=[\"campaign_name\"],\n    criteria=[(\"partner_name\", \"=\", \"Partner A\")]\n)\n\nprint(result.df)\n\"\"\"\n               sales  leads  revenue\ncampaign_name\nCampaign 1A        5      2       83\nCampaign 2A        6      2       82\n\"\"\"\n```\n\n**Example:** The output below shows rollups at the campaign level within each\npartner, and also a rollup of totals at the partner and campaign level.\n\n\u003e *Note:* the output contains a special character to mark DataFrame rollup rows\nthat were added to the result. The\n[ReportResult](https://totalhack.github.io/zillion/zillion.report/#reportresult)\nobject contains some helper attributes to automatically access or filter\nrollups, as well as a `df_display` attribute that returns the result with\nfriendlier display values substituted for special characters. The\nunder-the-hood special character is left here for illustration, but may not\nrender the same in all scenarios.\n\n```python\nfrom zillion import RollupTypes\n\nresult = wh.execute(\n    metrics=[\"sales\", \"leads\", \"revenue\"],\n    dimensions=[\"partner_name\", \"campaign_name\"],\n    rollup=RollupTypes.ALL\n)\n\nprint(result.df)\n\"\"\"\n                            sales  leads  revenue\npartner_name campaign_name\nPartner A    Campaign 1A      5.0    2.0     83.0\n             Campaign 2A      6.0    2.0     82.0\n             􏿿               11.0    4.0    165.0\nPartner B    Campaign 1B      1.0    1.0      6.0\n             Campaign 2B      1.0    1.0     13.0\n             􏿿                2.0    2.0     19.0\nPartner C    Campaign 1C      5.0    1.0    118.5\n             􏿿                5.0    1.0    118.5\n􏿿            􏿿               18.0    7.0    302.5\n\"\"\"\n```\n\nSee the `Report`\n[docs](https://totalhack.github.io/zillion/zillion.report/#report) for more\ninformation on supported rollup behavior.\n\n**Example:** Save a report spec (not the data):\n\nFirst you must make sure you have saved your `Warehouse`, as saved reports\nare scoped to a particular `Warehouse` ID. To save a `Warehouse`\nyou must provide a URL that points to the complete config.\n\n```python\nname = \"My Unique Warehouse Name\"\nconfig_url = \u003csome url pointing to a complete warehouse config\u003e\nwh.save(name, config_url) # wh.id is populated after this\n\nspec_id = wh.save_report(\n    metrics=[\"sales\", \"leads\", \"revenue\"],\n    dimensions=[\"partner_name\"]\n)\n```\n\n\u003e *Note*: If you built your `Warehouse` in python from a list of `DataSources`,\nor passed in a `dict` for the `config` param on init, there currently is not\na built-in way to output a complete config to a file for reference when saving.\n\n**Example:** Load and run a report from a spec ID:\n\n```python\nresult = wh.execute_id(spec_id)\n```\n\nThis assumes you have saved this report ID previously in the database specified by the DB_URL in your `Zillion` yaml configuration.\n\n**Example:** Unsupported Grain\n\nIf you attempt an impossible report, you will get an\n`UnsupportedGrainException`. The report below is impossible because it\nattempts to break down the leads metric by a dimension that only exists\nin a child table. Generally speaking, child tables can join back up to\nparents (and \"siblings\" of parents) to find dimensions, but not the other\nway around.\n\n```python\n# Fails with UnsupportedGrainException\nresult = wh.execute(\n    metrics=[\"leads\"],\n    dimensions=[\"sale_id\"]\n)\n```\n\n---\n\n\u003ca name=\"advanced-topics\"\u003e\u003c/a\u003e\n\n**Advanced Topics**\n-------------------\n\n\u003ca name=\"subreports\"\u003e\u003c/a\u003e\n\n### **Subreports**\n\nSometimes you need subquery-like functionality in order to filter one\nreport to the results of some other (that perhaps required a different grain).\nZillion provides a simplistic way of doing that by using the `in report` or `not in report`\ncriteria operations. There are two supported ways to specify the subreport: passing a\nreport spec ID or passing a dict of report params.\n\n```python\n# Assuming you have saved report 1234 and it has \"partner\" as a dimension:\n\nresult = warehouse.execute(\n    metrics=[\"revenue\", \"leads\"],\n    dimensions=[\"date\"],\n    criteria=[\n        (\"date\", \"\u003e\", \"2020-01-01\"),\n        (\"partner\", \"in report\", 1234)\n    ]\n)\n\n# Or with a dict:\n\nresult = warehouse.execute(\n    metrics=[\"revenue\", \"leads\"],\n    dimensions=[\"date\"],\n    criteria=[\n        (\"date\", \"\u003e\", \"2020-01-01\"),\n        (\"partner\", \"in report\", dict(\n            metrics=[...],\n            dimension=[\"partner\"],\n            criteria=[...]\n        ))\n    ]\n)\n```\n\nThe criteria field used in `in report` or `not in report` must be a dimension\nin the subreport. Note that subreports are executed at `Report` object initialization\ntime instead of during `execute` -- as such they can not be killed using `Report.kill`.\nThis may change down the road.\n\n\u003ca name=\"formula-metrics\"\u003e\u003c/a\u003e\n\n### **Formula Metrics**\n\nIn our example above our config included a formula-based metric called \"rpl\",\nwhich is simply `revenue / leads`. A `FormulaMetric` combines other metrics\nand/or dimensions to calculate a new metric at the Combined Layer of\nquerying. The syntax must match your Combined Layer database, which is SQLite\nin our example.\n\n```json\n{\n    \"name\": \"rpl\",\n    \"aggregation\": \"mean\",\n    \"rounding\": 2,\n    \"formula\": \"{revenue}/{leads}\"\n}\n```\n\n\u003ca name=\"divisor-metrics\"\u003e\u003c/a\u003e\n\n### **Divisor Metrics**\n\nAs a convenience, rather than having to repeatedly define formula metrics for\nrate variants of a core metric, you can specify a divisor metric configuration on a non-formula metric. As an example, say you have a `revenue` metric and want to create variants for `revenue_per_lead` and `revenue_per_sale`. You can define your revenue metric as follows:\n\n```json\n{\n    \"name\": \"revenue\",\n    \"type\": \"numeric(10,2)\",\n    \"aggregation\": \"sum\",\n    \"rounding\": 2,\n    \"divisors\": {\n        \"metrics\": [\n            \"leads\",\n            \"sales\"\n        ]\n    }\n}\n```\n\nSee `zillion.configs.DivisorsConfigSchema` for more details on configuration options, such as overriding naming templates, formula templates, and rounding.\n\n\u003ca name=\"aggregation-variants\"\u003e\u003c/a\u003e\n\n### **Aggregation Variants**\n\nAnother minor convenience feature is the ability to automatically generate variants of metrics for different aggregation types in a single field configuration instead of across multiple fields in your config file. As an example, say you have a `sales` column in your data and want to create variants for `sales_mean` and `sales_sum`. You can define your metric as follows:\n\n```json\n{\n    \"name\": \"sales\",\n    \"aggregation\": {\n        \"mean\": {\n            \"type\": \"numeric(10,2)\",\n            \"rounding\": 2\n        },\n        \"sum\": {\n            \"type\": \"integer\"\n        }\n    }\n}\n```\n\nThe resulting warehouse would not have a `sales` metric, but would instead have `sales_mean` and `sales_sum`. Note that you can further customize the settings for the generated fields, such as setting a custom name, by specifying that in the nested settings for that aggregation type. In practice this is not a big efficiency gain over just defining the metrics separately, but some may prefer this approach.\n\n\u003ca name=\"formula-dimensions\"\u003e\u003c/a\u003e\n\n### **Formula Dimensions**\n\nExperimental support exists for `FormulaDimension` fields as well. A `FormulaDimension` can only use other dimensions as part of its formula, and it also gets evaluated in the Combined Layer database. As an additional restriction, a `FormulaDimension` can not be used in report criteria as those filters are evaluated at the DataSource Layer. The following example assumes a SQLite Combined Layer database:\n\n\n```json\n{\n    \"name\": \"partner_is_a\",\n    \"formula\": \"{partner_name} = 'Partner A'\"\n}\n```\n\n\u003ca name=\"datasource-formulas\"\u003e\u003c/a\u003e\n\n### **DataSource Formulas**\n\nOur example also includes a metric \"sales\" whose value is calculated via\nformula at the DataSource Layer of querying. Note the following in the\n`fields` list for the \"id\" param in the \"main.sales\" table. These formulas are\nin the syntax of the particular `DataSource` database technology, which also\nhappens to be SQLite in our example.\n\n```json\n\"fields\": [\n    \"sale_id\",\n    {\"name\":\"sales\", \"ds_formula\": \"COUNT(DISTINCT sales.id)\"}\n]\n```\n\n\u003ca name=\"type-conversions\"\u003e\u003c/a\u003e\n\n### **Type Conversions**\n\nOur example also automatically created a handful of dimensions from the\n\"created_at\" columns of the leads and sales tables. Support for automatic type\nconversions is limited, but for date/datetime columns in supported\n`DataSource` technologies you can get a variety of dimensions for free this\nway.\n\nThe output of `wh.print_info` will show the added dimensions, which are\nprefixed with \"lead_\" or \"sale_\" as specified by the optional\n`type_conversion_prefix` in the config for each table. Some examples of\nauto-generated dimensions in our example warehouse include sale_hour,\nsale_day_name, sale_month, sale_year, etc. \n\nAs an optimization in the where clause of underlying report queries, `Zillion` \nwill try to apply conversions to criteria values instead of columns. For example, \nit is generally more efficient to query as `my_datetime \u003e '2020-01-01' and my_datetime \u003c '2020-01-02'`\ninstead of `DATE(my_datetime) == '2020-01-01'`, because the latter can prevent index\nusage in many database technologies. The ability to apply conversions to values\ninstead of columns varies by field and `DataSource` technology as well. \n\nTo prevent type conversions, set `skip_conversion_fields` to `true` on your\n`DataSource` config.\n\nSee `zillion.field.TYPE_ALLOWED_CONVERSIONS` and `zillion.field.DIALECT_CONVERSIONS`\nfor more details on currently supported conversions.\n\n\u003ca name=\"adhoc-metrics\"\u003e\u003c/a\u003e\n\n### **Ad Hoc Metrics**\n\nYou may also define metrics \"ad hoc\" with each report request. Below is an\nexample that creates a revenue-per-lead metric on the fly. These only exist\nwithin the scope of the report, and the name can not conflict with any existing\nfields:\n\n```python\nresult = wh.execute(\n    metrics=[\n        \"leads\",\n        {\"formula\": \"{revenue}/{leads}\", \"name\": \"my_rpl\"}\n    ],\n    dimensions=[\"partner_name\"]\n)\n```\n\n\u003ca name=\"adhoc-dimensions\"\u003e\u003c/a\u003e\n\n### **Ad Hoc Dimensions**\n\nYou may also define dimensions \"ad hoc\" with each report request. Below is an\nexample that creates a dimension that partitions on a particular dimension value on the fly. Ad Hoc Dimensions are a subclass of `FormulaDimension`s and therefore have the same restrictions, such as not being able to use a metric as a formula field. These only exist within the scope of the report, and the name can not conflict with any existing fields:\n\n```python\nresult = wh.execute(\n    metrics=[\"leads\"],\n    dimensions=[{\"name\": \"partner_is_a\", \"formula\": \"{partner_name} = 'Partner A'\"]\n)\n```\n\n\u003ca name=\"adhoc-tables\"\u003e\u003c/a\u003e\n\n### **Ad Hoc Tables**\n\n`Zillion` also supports creation or syncing of ad hoc tables in your database\nduring `DataSource` or `Warehouse` init. An example of a table config that\ndoes this is shown\n[here](https://github.com/totalhack/zillion/blob/master/tests/test_adhoc_ds_config.json).\nIt uses the table config's `data_url` and `if_exists` params to control the\nsyncing and/or creation of the \"main.dma_zip\" table from a remote CSV in a\nSQLite database.  The same can be done in other database types too.\n\nThe potential performance drawbacks to such an approach should be obvious,\nparticularly if you are initializing your warehouse often or if the remote\ndata file is large. It is often better to sync and create your data ahead of\ntime so you have complete schema control, but this method can be very useful\nin certain scenarios.\n\n\u003e **Warning**: be careful not to overwrite existing tables in your database!\n\n\u003ca name=\"technicals\"\u003e\u003c/a\u003e\n\n### **Technicals**\n\nThere are a variety of technical computations that can be applied to metrics to\ncompute rolling, cumulative, or rank statistics. For example, to compute a 5-point\nmoving average on revenue one might define a new metric as follows:\n\n```json\n{\n    \"name\": \"revenue_ma_5\",\n    \"type\": \"numeric(10,2)\",\n    \"aggregation\": \"sum\",\n    \"rounding\": 2,\n    \"technical\": \"mean(5)\"\n}\n```\n\nTechnical computations are computed at the Combined Layer, whereas the \"aggregation\"\nis done at the DataSource Layer (hence needing to define both above). \n\nFor more info on how shorthand technical strings are parsed, see the\n[parse_technical_string](https://totalhack.github.io/zillion/zillion.configs/#parse_technical_string)\ncode. For a full list of supported technical types see\n`zillion.core.TechnicalTypes`.\n\nTechnicals also support two modes: \"group\" and \"all\". The mode controls how to\napply the technical computation across the data's dimensions. In \"group\" mode,\nit computes the technical across the last dimension, whereas in \"all\" mode in\ncomputes the technical across all data without any regard for dimensions.\n\nThe point of this becomes more clear if you try to do a \"cumsum\" technical\nacross data broken down by something like [\"partner_name\", \"date\"]. If \"group\"\nmode is used (the default in most cases) it will do cumulative sums *within*\neach partner over the date ranges. If \"all\" mode is used, it will do a\ncumulative sum across every data row. You can be explicit about the mode by\nappending it to the technical string: i.e. \"cumsum:all\" or \"mean(5):group\"\n\n---\n\n\u003ca name=\"config-variables\"\u003e\u003c/a\u003e\n\n### **Config Variables**\n\nIf you'd like to avoid putting sensitive connection information directly in\nyour `DataSource` configs you can leverage config variables. In your `Zillion`\nyaml config you can specify a `DATASOURCE_CONTEXTS` section as follows:\n\n```yaml\nDATASOURCE_CONTEXTS:\n  my_ds_name:\n    user: user123\n    pass: goodpassword\n    host: 127.0.0.1\n    schema: reporting\n```\n\nThen when your `DataSource` config for the datasource named \"my_ds_name\" is\nread, it can use this context to populate variables in your connection url:\n\n```json\n\"datasources\": {\n    \"my_ds_name\": {\n        \"connect\": \"mysql+pymysql://{user}:{pass}@{host}/{schema}\"\n        ...\n    }\n}\n```\n\n\u003ca name=\"datasource-priority\"\u003e\u003c/a\u003e\n\n### **DataSource Priority**\n\nOn `Warehouse` init you can specify a default priority order for datasources\nby name. This will come into play when a report could be satisfied by multiple\ndatasources. `DataSources` earlier in the list will be higher priority. This\nwould be useful if you wanted to favor a set of faster, aggregate tables that\nare grouped in a `DataSource`.\n\n```python\nwh = Warehouse(config=config, ds_priority=[\"aggr_ds\", \"raw_ds\", ...])\n```\n\n\u003ca name=\"supported-datasources\"\u003e\u003c/a\u003e\n\n**Supported DataSources**\n-------------------------\n\n`Zillion's` goal is to support any database technology that SQLAlchemy\nsupports (pictured below). That said the support and testing levels in `Zillion` vary at the\nmoment. In particular, the ability to do type conversions, database\nreflection, and kill running queries all require some database-specific code\nfor support. The following list summarizes known support levels. Your mileage\nmay vary with untested database technologies that SQLAlchemy supports (it\nmight work just fine, just hasn't been tested yet). Please report bugs and\nhelp add more support!\n\n* SQLite: supported\n* MySQL: supported\n* PostgreSQL: supported\n* DuckDB: supported\n* BigQuery, Redshift, Snowflake, SingleStore, PlanetScale, etc: not tested but would like to support these\n\nSQLAlchemy has connectors to many popular databases. The barrier to support many of these is likely\npretty low given the simple nature of the sql operations `Zillion` uses.\n\n![SQLAlchemy Connectors](https://github.com/totalhack/zillion/blob/master/docs/images/sqlalchemy_connectors.webp?raw=true)\n\nNote that the above is different than the database support for the Combined Layer\ndatabase. Currently only SQLite is supported there; that should be sufficient for\nmost use cases but more options will be added down the road.\n\n\u003ca name=\"multiprocess-considerations\"\u003e\u003c/a\u003e\n\n**Multiprocess Considerations**\n-------------------------------\n\nIf you plan to run `Zillion` in a multiprocess scenario, whether on a single\nnode or across multiple nodes, there are a couple of things to consider:\n\n* SQLite DataSources do not scale well and may run into locking issues with multiple processes trying to access them on the same node.\n* Any file-based database technology that isn't centrally accessible would be challenging when using multiple nodes.\n* Ad Hoc DataSource and Ad Hoc Table downloads should be avoided as they may conflict/repeat across each process. Offload this to an external\nETL process that is better suited to manage those data flows in a scalable production scenario.\n\nNote that you can still use the default SQLite in-memory Combined Layer DB without issues, as that is made on the fly with each report request and\nrequires no coordination/communication with other processes or nodes.\n\n\u003ca name=\"demo-ui\"\u003e\u003c/a\u003e\n\n**Demo UI / Web API**\n--------------------\n\n[Zillion Web UI](https://github.com/totalhack/zillion-web) is a demo UI and web API for Zillion that also includes an experimental ChatGPT plugin. See the README there for more info on installation and project structure. Please note that the code is light on testing and polish, but is expected to work in modern browsers. Also ChatGPT plugins are quite slow at the moment, so currently that is mostly for fun and not that useful.\n\n---\n\n\u003ca name=\"documentation\"\u003e\u003c/a\u003e\n\n**Documentation**\n-----------------\n\nMore thorough documentation can be found [here](https://totalhack.github.io/zillion/).\nYou can supplement your knowledge by perusing the [tests](https://github.com/totalhack/zillion/tree/master/tests) directory\nor the [API reference](https://totalhack.github.io/zillion/).\n\n---\n\n\u003ca name=\"how-to-contribute\"\u003e\u003c/a\u003e\n\n**How to Contribute**\n---------------------\n\nPlease See the\n[contributing](https://github.com/totalhack/zillion/blob/master/CONTRIBUTING.md)\nguide for more information. If you are looking for inspiration, adding support and tests for additional database technologies would be a great help.\n\n\n\n","funding_links":["https://github.com/sponsors/totalhack"],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftotalhack%2Fzillion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftotalhack%2Fzillion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftotalhack%2Fzillion/lists"}