{"id":21284481,"url":"https://github.com/nasdaq/flask-data-pipes","last_synced_at":"2025-07-11T11:31:50.677Z","repository":{"id":45123697,"uuid":"153642596","full_name":"Nasdaq/flask-data-pipes","owner":"Nasdaq","description":"Simple, performant data pipelines.","archived":false,"fork":false,"pushed_at":"2022-01-06T22:27:27.000Z","size":302,"stargazers_count":10,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-06T03:11:07.291Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Nasdaq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-18T15:00:22.000Z","updated_at":"2024-09-23T20:59:25.000Z","dependencies_parsed_at":"2022-09-22T17:30:55.320Z","dependency_job_id":null,"html_url":"https://github.com/Nasdaq/flask-data-pipes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Nasdaq/flask-data-pipes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nasdaq%2Fflask-data-pipes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nasdaq%2Fflask-data-pipes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nasdaq%2Fflask-data-pipes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nasdaq%2Fflask-data-pipes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Nasdaq","download_url":"https://codeload.github.com/Nasdaq/flask-data-pipes/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nasdaq%2Fflask-data-pipes/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264795385,"owners_count":23665227,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-21T11:16:09.503Z","updated_at":"2025-07-11T11:31:50.369Z","avatar_url":"https://github.com/Nasdaq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Flask Data Pipes\nFlask Data Pipes provides simple, performant data pipelines with a host of robust features and configurable tools. The project is currently in alpha with plans to further document and streamline dependencies.\n\n## Features\n* Like-magic ETL\n* Transparent file and table management\n* Simple logging and versioning of inbound data\n* Staged data processing, fully configurable ETL\n* File upload management\n* Decorator functions for infinite configurability\n* Low memory utilization due to straight to disk processing\n\n## Dependencies\n* Flask\n* Flask-SQLAlchemy\n* blinker\n* celery\n* marshmallow\n* inflection\n* requests\n\n(We know it’s a lot. For now…)\n\n## Installation and Implementation\n`pip install flask-data-pipes`\n\nCreate the flask app with dependencies\n```python\nfrom flask import Flask\nfrom celery import Celery\nfrom flask_sqlalchemy import SQLAlchemy\nfrom flask-data-pipes import ETL\nimport os\n\nBASE = os.path.dirname(__file__)\n\napp = Flask(__name__)\napp.celery = Celery(app.import_name, broker='amqp://guest@localhost//')\n\ndb = SQLAlchemy(app)\netl = ETL(app, db)\n```\n\nOr via application factories\n```python\ndb = SQLAlchemy()\netl = ETL()\n\ndef create_app(config_filename):\n    app = Flask(__name__)\n    app.config.from_pyfile(config_filename)\n\n    db.init_app(app)\n    app.celery = Celery(app.import_name, broker='amqp://guest@localhost//')\n\n    etl.init_app(app, db)\n\n    return app\n```\n\n\nImport data models and send import signal for versioning/registration\n```python\nwith app.app_context():\n    for module_name in app.config[‘MODULES’]:\n        try:\n            import_module(f’{BASE}.{module_name}.models’, package=__name__)\n        except (AttributeError, ModuleNotFoundError):\n            continue\n\napp.signal.etl_tables_imported.send(app)\n```\n\n## Usage\nDefine Pipeline\n\n```python\nfrom my_app import app, etl\nfrom flask_data_pipes import extract, on_load_commit\n\nclass UserPipeline(etl.Pipeline, extract=True, transform=True, load=True):\n\n    @extract\n    def collect_all_users(self, *args, **kwargs):\n        with MyAPIClient(app.config['API_CONFIG']) as client:\n            for entry in client.get_all_users():\n                yield self.models('User', entry)\n\n    # yes, transform will happen automagically\n    # same for load\n    # although you could infinitely hook and customize\n\n    @on_load_commit\n    def alert_complete(self, meta: list):\n        for entry in meta:\n            app.logger.info(f\"Table '{self.tables(entry['model]).__tablename__}' updated successfully!\")\n```\n\n\nDefine Model\n```python\nfrom my_app.pipelines.users import UserPipeline\nfrom my_app.tables.users import User as UserDBTable\n\nclass User(etl.Model):\n\n    __filename__ = 'users'\n    __table__ = UserDBTable\n    __pipeline__ = UserPipeline\n\n    first = etl.fields.UppercaseString()\n    last = etl.fields.UppercaseString()\n    email = etl.fields.Method('define_email')\n    birthday = etl.fields.Date()\n    profile = etl.fields.URL()\n\n    def define_email(self, data):\n        return f'{data['first']}.{data['last']}@mycompany.com'\n```\n\nRun Pipeline\n```python\nfrom my_app.pipelines.users import UserPipeline\n\nUserPipeline()\n```\n\n### ETL Stages\nStages are infinitely extensible via the ETL decorators and pre/post processor functions, including synchronous and asynchronous processors\n\nDefault stages execute as follows:\n* Upload: validates and saves file to disk as is\n* Extract: writes json data to disk as received without edits\n* Transform: utilizes model declaration to transform extracted data and write to disk\n* Load: Inserts records to the corresponding table via raw transaction\n\n### Logging and Versioning\n\n`__etl_data_models` table provides the meta data for each `etl.Model` created within your application, including the stages defined and the hashes of each ETL stage used to determine the pipeline version.\n\n![image](https://raw.githubusercontent.com/Nasdaq/flask-data-pipes/master/static/etl_data_models.png)\n\n\n`__etl_data_objects` table maintains a record of each data object processed by your pipeline, including the pipeline versions used to process the data, file locations for each stage of the data and status and timestamps of all executions.\n\n![image](https://raw.githubusercontent.com/Nasdaq/flask-data-pipes/master/static/etl_data_objects.png)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnasdaq%2Fflask-data-pipes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnasdaq%2Fflask-data-pipes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnasdaq%2Fflask-data-pipes/lists"}