{"id":13534490,"url":"https://github.com/mara/mara-pipelines","last_synced_at":"2025-05-14T21:06:29.088Z","repository":{"id":37010529,"uuid":"127569316","full_name":"mara/mara-pipelines","owner":"mara","description":"A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow","archived":false,"fork":false,"pushed_at":"2023-12-15T16:14:47.000Z","size":3451,"stargazers_count":2078,"open_issues_count":26,"forks_count":100,"subscribers_count":54,"default_branch":"main","last_synced_at":"2025-04-13T17:46:45.219Z","etag":null,"topics":["data","data-integration","etl","pipeline","postgresql","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mara.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-31T20:37:22.000Z","updated_at":"2025-04-13T00:12:17.000Z","dependencies_parsed_at":"2023-02-16T18:40:27.601Z","dependency_job_id":"74d3e8a0-f86f-490a-a510-23d9af4b3285","html_url":"https://github.com/mara/mara-pipelines","commit_stats":{"total_commits":151,"total_committers":19,"mean_commits":7.947368421052632,"dds":0.543046357615894,"last_synced_commit":"3ba3d8b312c7e96e8c8fb2d40a7b37936c2a492a"},"previous_names":["mara/data-integration"],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-pipelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-pipelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-pipelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mara%2Fmara-pipelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mara","download_url":"https://codeload.github.com/mara/mara-pipelines/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254227611,"owners_count":22035669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-integration","etl","pipeline","postgresql","python"],"created_at":"2024-08-01T07:01:34.293Z","updated_at":"2025-05-14T21:06:24.063Z","avatar_url":"https://github.com/mara.png","language":"Python","funding_links":[],"categories":["Python","DevOps","数据管道和流处理","Data Pipelines \u0026 Streaming","data","Workflow Management/Engines"],"sub_categories":["Data Management"],"readme":"# Mara Pipelines\n\n[![Build \u0026 Test](https://github.com/mara/mara-pipelines/actions/workflows/build.yaml/badge.svg)](https://github.com/mara/mara-pipelines/actions/workflows/build.yaml)\n[![PyPI - License](https://img.shields.io/pypi/l/mara-pipelines.svg)](https://github.com/mara/mara-pipelines/blob/main/LICENSE)\n[![PyPI version](https://badge.fury.io/py/mara-pipelines.svg)](https://badge.fury.io/py/mara-pipelines)\n[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack\u0026style=social)](https://communityinviter.com/apps/mara-users/public-invite)\n\n\n\nThis package contains a lightweight data transformation framework with a focus on transparency and complexity reduction. It has a number of baked-in assumptions/ principles:\n\n- Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code.\n\n- PostgreSQL as a data processing engine.\n\n- Extensive web ui. The web browser as the main tool for inspecting, running and debugging pipelines.\n\n- GNU make semantics. Nodes depend on the completion of upstream nodes. No data dependencies or data flows.\n\n- No in-app data processing: command line tools as the main tool for interacting with databases and data.\n\n- Single machine pipeline execution based on Python's [multiprocessing](https://docs.python.org/3.6/library/multiprocessing.html). No need for distributed task queues. Easy debugging and output logging.\n\n- Cost based priority queues: nodes with higher cost (based on recorded run times) are run first.\n\n\u0026nbsp;\n\n## Installation\n\nTo use the library directly, use pip:\n\n```\npip install mara-pipelines\n```\n\nor\n\n```\npip install git+https://github.com/mara/mara-pipelines.git\n```\n\nFor an example of an integration into a flask application, have a look at the [mara example project 1](https://github.com/mara/mara-example-project-1) and [mara example project 2](https://github.com/mara/mara-example-project-2).\n\nDue to the heavy use of forking, Mara Pipelines does not run natively on Windows. If you want to run it on Windows, then please use Docker or the [Windows Subsystem for Linux](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux).\n\n\u0026nbsp;\n\n## Example\n\nHere is a pipeline \"demo\" consisting of three nodes that depend on each other: the task `ping_localhost`, the pipeline `sub_pipeline` and the task `sleep`:\n\n```python\nfrom mara_pipelines.commands.bash import RunBash\nfrom mara_pipelines.pipelines import Pipeline, Task\nfrom mara_pipelines.cli import run_pipeline, run_interactively\n\npipeline = Pipeline(\n    id='demo',\n    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')\n\npipeline.add(Task(id='ping_localhost', description='Pings localhost',\n                  commands=[RunBash('ping -c 3 localhost')]))\n\nsub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')\n\nfor host in ['google', 'amazon', 'facebook']:\n    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',\n                          commands=[RunBash(f'ping -c 3 {host}.com')]))\n\nsub_pipeline.add_dependency('ping_amazon', 'ping_facebook')\nsub_pipeline.add(Task(id='ping_foo', description='Pings foo',\n                      commands=[RunBash('ping foo')]), ['ping_amazon'])\n\npipeline.add(sub_pipeline, ['ping_localhost'])\n\npipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',\n                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])\n```\n\nTasks contain lists of commands, which do the actual work (in this case running bash commands that ping various hosts).\n\n\u0026nbsp;\n\nIn order to run the pipeline, a PostgreSQL database is recommended to be configured for storing run-time information, run output and status of incremental processing:\n\n```python\nimport mara_db.auto_migration\nimport mara_db.config\nimport mara_db.dbs\n\nmara_db.config.databases \\\n    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='root', database='example_etl_mara')}\n\nmara_db.auto_migration.auto_discover_models_and_migrate()\n```\n\nGiven that PostgresSQL is running and the credentials work, the output looks like this (a database with a number of tables is created):\n\n```\nCreated database \"postgresql+psycopg2://root@localhost/example_etl_mara\"\n\nCREATE TABLE data_integration_file_dependency (\n    node_path TEXT[] NOT NULL,\n    dependency_type VARCHAR NOT NULL,\n    hash VARCHAR,\n    timestamp TIMESTAMP WITHOUT TIME ZONE,\n    PRIMARY KEY (node_path, dependency_type)\n);\n\n.. more tables\n```\n\n### CLI UI\n\nThis runs a pipeline with output to stdout:\n\n```python\nfrom mara_pipelines.cli import run_pipeline\n\nrun_pipeline(pipeline)\n```\n\n![Example run cli 1](https://github.com/mara/mara-pipelines/raw/3.2.x/docs/_static/example-run-cli-1.gif)\n\n\u0026nbsp;\n\nAnd this runs a single node of pipeline `sub_pipeline` together with all the nodes that it depends on:\n\n```python\nrun_pipeline(sub_pipeline, nodes=[sub_pipeline.nodes['ping_amazon']], with_upstreams=True)\n```\n\n![Example run cli 2](https://github.com/mara/mara-pipelines/raw/3.2.x/docs/_static/example-run-cli-2.gif)\n\n\u0026nbsp;\n\n\nAnd finally, there is some sort of menu based on [pythondialog](http://pythondialog.sourceforge.net/) that allows to navigate and run pipelines like this:\n\n```python\nfrom mara_pipelines.cli import run_interactively\n\nrun_interactively()\n```\n\n![Example run cli 3](https://github.com/mara/mara-pipelines/raw/3.2.x/docs/_static/example-run-cli-3.gif)\n\n\n\n### Web UI\n\nMore importantly, this package provides an extensive web interface. It can be easily integrated into any [Flask](https://flask.palletsprojects.com/) based app and the [mara example project](https://github.com/mara/mara-example-project) demonstrates how to do this using [mara-app](https://github.com/mara/mara-app).\n\nFor each pipeline, there is a page that shows\n\n- a graph of all child nodes and the dependencies between them\n- a chart of the overal run time of the pipeline and it's most expensive nodes over the last 30 days (configurable)\n- a table of all the pipeline's nodes with their average run times and the resulting queuing priority\n- output and timeline for the last runs of the pipeline\n\n\n![Mara pipelines web ui 1](https://github.com/mara/mara-pipelines/raw/3.2.x/docs/_static/mara-pipelines-web-ui-1.png)\n\nFor each task, there is a page showing\n\n- the upstreams and downstreams of the task in the pipeline\n- the run times of the task in the last 30 days\n- all commands of the task\n- output of the last runs of the task\n\n![Mara pipelines web ui 2](https://github.com/mara/mara-pipelines/raw/3.2.x/docs/_static/mara-pipelines-web-ui-2.png)\n\n\nPipelines and tasks can be run from the web ui directly, which is probably one of the main features of this package:\n\n![Example run web ui](https://github.com/mara/mara-pipelines/raw/3.2.x/docs/_static/example-run-web-ui.gif)\n\n\u0026nbsp;\n\n## Getting started\n\nDocumentation is currently work in progress. Please use the [mara example project 1](https://github.com/mara/mara-example-project-1) and [mara example project 2](https://github.com/mara/mara-example-project-2) as a reference for getting started.\n\n## Links\n\n* Documentation: https://mara-pipelines.readthedocs.io/\n* Changes: https://mara-pipelines.readthedocs.io/en/latest/changes.html\n* PyPI Releases: https://pypi.org/project/mara-pipelines/\n* Source Code: https://github.com/mara/mara-pipelines\n* Issue Tracker: https://github.com/mara/mara-pipelines/issues\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmara%2Fmara-pipelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmara%2Fmara-pipelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmara%2Fmara-pipelines/lists"}