{"id":13557492,"url":"https://github.com/mara/mara-example-project-2","last_synced_at":"2026-03-09T15:31:08.858Z","repository":{"id":82623546,"uuid":"128645577","full_name":"mara/mara-example-project-2","owner":"mara","description":"An example mini data warehouse for python project stats, template for new projects","archived":false,"fork":false,"pushed_at":"2020-07-21T07:43:30.000Z","size":25156,"stargazers_count":178,"open_issues_count":2,"forks_count":38,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-10-25T22:03:54.169Z","etag":null,"topics":["bigquery","data-integration","etl","pypi","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mara.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2018-04-08T13:44:40.000Z","updated_at":"2025-09-02T08:22:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"445016b4-ad11-416b-8b17-aa3e6ac092fb","html_url":"https://github.com/mara/mara-example-project-2","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mara/mara-example-project-2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-example-project-2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-example-project-2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-example-project-2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-example-project-2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mara","download_url":"https://codeload.github.com/mara/mara-example-project-2/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-example-project-2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30301109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T14:33:48.460Z","status":"ssl_error","status_checked_at":"2026-03-09T14:33:48.027Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","data-integration","etl","pypi","sql"],"created_at":"2024-08-01T12:04:23.033Z","updated_at":"2026-03-09T15:31:08.830Z","avatar_url":"https://github.com/mara.png","language":"Python","funding_links":[],"categories":["Python","sql"],"sub_categories":[],"readme":"# Mara Example Project\n\nA runnable app that demonstrates how to build a data warehouse with mara. Combines the [data-integration](https://github.com/mara/data-integration) and [bigquery-downloader](https://github.com/mara/bigquery-downloader) libraries with the [mara-app](https://github.com/mara/mara-app) framework into a project. \n\nThe example ETL integrates PyPi download stats and GitHub repo activitiy metrics into more general Python project activity stats.\n\nThe repository is intended to serve as a template for new projects.\n\n\u0026nbsp;\n\n\n## Example: Python Project Stats\n\nThe project uses two data sources: \n\n1. The [PyPI downloads](https://packaging.python.org/guides/analyzing-pypi-package-downloads/) BigQuery data set at [https://bigquery.cloud.google.com/dataset/the-psf:pypi](https://bigquery.cloud.google.com/dataset/the-psf:pypi) (Google login required). It contains each individual package download together with project and client attributes. \n\n2. The [Github archive](https://www.gharchive.org/) BigQuery data set at [https://bigquery.cloud.google.com/dataset/githubarchive:day](https://bigquery.cloud.google.com/dataset/githubarchive:day). It contains nearly all events that happen to Github repositories.\n\nFrom both data sources, a set of pre-aggregated and filtered CSVs is incrementally downloaded using the queries in [app/bigquery_downloader](app/biqquery_downloader):\n\n```console\n$ gunzip --decompress --stdout data/2018/04/10/pypi/downloads-v1.csv.gz | grep \"\\tflask\\t\\|day_id\" | head -n 11\nday_id\tproject\tproject_version\tpython_version\tinstaller\tnumber_of_downloads\n20180410\tflask\t0.1\t\tbandersnatch\t1\n20180410\tflask\t0.2\t\tbandersnatch\t1\n20180410\tflask\t0.5\t\tbandersnatch\t1\n20180410\tflask\t0.6\t\tbandersnatch\t1\n20180410\tflask\t0.8\t\tbandersnatch\t1\n20180410\tflask\t0.9\t\tbandersnatch\t1\n20180410\tflask\t0.10\t2.6\tpip\t1\n20180410\tflask\t0.11\t\tBrowser\t1\n20180410\tflask\t0.11\t\tbandersnatch\t1\n20180410\tflask\t0.5.1\t\tbandersnatch\t1\n```\n\n```console\n$ gunzip --decompress --stdout data/2018/04/10/github/repo-activity-v1.csv.gz | grep \"\\tflask\\t\\|day_id\"\nday_id\tuser\trepository\tnumber_of_forks\tnumber_of_commits\tnumber_of_closed_pull_requests\n20180410\tliks79\tflask\t1\t\t\n20180410\tdengyifan\tflask\t\t1\t\n20180410\txeriok18600\tflask\t\t1\t\n20180410\tmanhhomienbienthuy\tflask\t\t10\n20180410\tdavidism\tflask\t\t49\t\n20180410\tpallets\tflask\t10\t6\t3\n```\n\nThe total size of these (compressed) csv files is 3.5GB for the time range from Jan 2017 to April 2018.\n\n\u0026nbsp;\n\nThen there is the ETL in [app/data_integration/pipelines](app/data_integration/pipelines) that transforms this data into a classic Kimball-like [star schema](https://en.wikipedia.org/wiki/Star_schema):\n\n![Star schema](docs/star-schema.png)\n\nIt shows 4 database schemas, each created by a different pipeline: \n\n- `time`: All days from the beginning of data processing until yesterday,\n- `pypi_dim`: PyPI download counts per project version, installer and day,\n- `gh_dim`: The number of commits, forks and closed pull requests per Github repository and day,\n- `pp_dim`: PyPI and Github metrics merged by day and repository/ project name.\n\nThe overall database size of the data warehouse is roughly 100GB for the timerange mentioned above. \n\n\u0026nbsp;\n\nWith this structure in place, it is then possible to run queries like this:\n\n```sql\nSELECT \n  _date, project_name, number_of_downloads, number_of_forks, \n  number_of_commits, number_of_closed_pull_requests \nFROM pp_dim.python_project_activity \n  JOIN pypi_dim.project ON project_id = project_fk \n  JOIN time.day ON day_fk = day_id \nWHERE project_name = 'flask' \nORDER BY day_fk DESC \nLIMIT 10;\n```\n\n```\n   _date    | project_name | number_of_downloads | number_of_forks | number_of_commits | number_of_closed_pull_requests \n------------+--------------+---------------------+-----------------+-------------------+--------------------------------\n 2018-04-10 | flask        |               45104 |              11 |                67 |                              3\n 2018-04-09 | flask        |               57177 |              13 |                45 |                              4\n 2018-04-08 | flask        |               70392 |              13 |                 7 |                               \n 2018-04-07 | flask        |               65060 |              10 |                 7 |                               \n 2018-04-06 | flask        |               70779 |               7 |                11 |                              2\n 2018-04-05 | flask        |               62215 |               6 |                22 |                               \n 2018-04-04 | flask        |               33116 |              11 |                23 |                               \n 2018-04-03 | flask        |               39248 |              15 |                27 |                               \n 2018-04-02 | flask        |               54517 |              14 |                17 |                               \n 2018-04-01 | flask        |               68685 |               4 |                 6 |                               \n(10 rows)\n```\n\n\u0026nbsp;\n\nMara data integration pipelines are visualized and debugged though a web ui. Here, the pipeline `github` is run (locally on an old Mac with 2 days of data): \n\n![Mara web ui ETL run](docs/mara-web-ui-etl-run.gif)\n\n\u0026nbsp;\n\nOn production, pipelines are run through a cli interface:\n\n![Mara cli ETL run](docs/mara-cli-etl-run.gif)\n\n\u0026nbsp;\n\nMara ETL pipelines are compeletely transparent, both to stakeholders in terms of applied business logic and to data engineers in terms of runtime behavior.\n\nThis is the page in the web ui that visualizes the pipeline `pypi`: \n\n![Mara web UI for pipelines](docs/mara-web-ui-pipeline.png)\n\nIt shows \n\n- a graph of all pipeline children with dependencies between them,\n- run times of the pipeline and top child nodes over time,\n- a list of all child nodes with their average run time and cost,\n- system statistics, a timeline and output of the last runs.\n\n\u0026nbsp;\n\nSimilarly, this the page for the task `pypi/transform_project`:\n\n![Mara web ui for tasks](docs/mara-web-ui-task.png)\n\nIt shows its\n\n- direct upstreams and downstreams,\n- run times over time,\n- all commands of the task,\n- system statistics, a timeline and output of the last runs. \n\n\u0026nbsp;\n\n## Getting started\n\n### Sytem requirements\n\nPython \u003e=3.6 and PostgreSQL \u003e=10 and some smaller packages are required to run the example (and mara in general). \n\nMac:\n\n```console\n$ brew install -v python3\n$ brew install -v dialog\n$ brew install -v coreutils\n$ brew install -v graphviz\n```\n\nUbuntu 16.04:\n\n```console\n$ sudo apt install git dialog coreutils graphviz python3 python3-dev python3-venv\n```\n\n\u0026nbsp;\n\nMara does not run Windows.\n\n\u0026nbsp;\n\nOn Mac, install Postgresql with `brew install -v postgresql`. On Ubuntu, follow  [these instructions](https://www.postgresql.org/download/linux/ubuntu/). Also, install the [cstore_fdw](https://github.com/citusdata/cstore_fdw) with `brew install cstore_fdw` and [postgresql-hll](https://github.com/citusdata/postgresql-hll) extensions from source.\n\nTo optimize PostgreSQL for ETL workloads, update your postgresql.conf along [this example](docs/postgresql.conf).\n\nStart a database client with `sudo -u postgres psql postgres` and then create a user with `CREATE ROLE root SUPERUSER LOGIN;` (you can use any other name).\n\n\u0026nbsp;\n\n### Installation\n\nClone the repository somewhere. Copy the file [`app/local_setup.py.example`](app/local_setup.py.example) to `app/local_setup.py` and adapt to your machine.\n\nLog into PostgreSQL with `psql -U root postgres` and create two databases:\n\n```sql\nCREATE DATABASE example_project_dwh;\nCREATE DATABASE example_project_mara;\n```\n\nHit `make` in the root directory of the project. This will \n\n- create a virtual environment in `.venv`,\n- install all packages from [`requirements.txt.freeze`](requirements.txt.freeze) (if you want to create a new `requirements.txt.freeze` from [`requirements.txt`](requirements.txt), then run `make update-packages`),\n- create a number of tables that are needed for running mara.\n\nYou can now activate the virtual environment with \n\n```console\n$ source .venv/bin/activate\n```\n\nTo list all available flask cli commands, run `flask` without parameters.\n\n\u0026nbsp;\n\n### Running the web UI\n\n```console\n$ flask run --with-threads --reload --eager-loading\n```\n\nThe app is now accessible at [http://localhost:5000](http://localhost:5000).\n\n\u0026nbsp;\n\n### Downloading PyPI and Github data from BigQuery\n\nThis step takes many hours to complete. If you don't have time for this (or don't want to go through the hassle of creating Google Cloud credentials), we provide a daily updated copy of the result data sets on s3. You can get (and update) the set of result CSVs with \n\n```console\n$ make sync-bigquery-csv-data-sets-from-s3\n```\n\n\u0026nbsp;\n\nTo download the data yourself, follow the instructions in the README of the [mara/bigquery-downloader](https://github.com/mara/bigquery-downloader) package to get a Google Cloud credentials file. Store it in `app/bigquery_downloader/bigquery-credentials.json`. \n\nThen run the downloader in an activated virtual environment with:\n\n```\n$ flask bigquery_downloader.download_data\n```\n\n\u0026nbsp;\n\n### Running the ETL\n\nFor development, it is recommended to run the ETL from the web UI (see above). On production, use `flask data_integration.ui.run` to run a pipeline or a set of its child nodes. \n\nThe command `data_integration.ui.run_interactively` provides an ncurses-based menu for selecting and running pipelines.\n\n\u0026nbsp;\n\n## Documentation\n\nDocumentation is work in progress. But the code base is quite small and documented.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmara%2Fmara-example-project-2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmara%2Fmara-example-project-2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmara%2Fmara-example-project-2/lists"}