{"id":16734950,"url":"https://github.com/giacbrd/smartpipeline","last_synced_at":"2025-03-21T21:31:32.371Z","repository":{"id":43360686,"uuid":"147200043","full_name":"giacbrd/SmartPipeline","owner":"giacbrd","description":"A framework for rapid development of robust data pipelines following a simple design pattern","archived":false,"fork":false,"pushed_at":"2024-02-26T21:30:23.000Z","size":402,"stargazers_count":22,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-04-27T02:42:28.497Z","etag":null,"topics":["data-analysis","data-analytics","data-mining","data-pipelines","data-processing","data-science","dataops","design-patterns","etl","machine-learning","mlops","pipeline","pipeline-framework","pipelines","reproducibility","task-queue","workflow"],"latest_commit_sha":null,"homepage":"https://smartpipeline.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/giacbrd.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-03T12:11:11.000Z","updated_at":"2024-06-12T16:43:59.773Z","dependencies_parsed_at":"2024-01-14T17:02:47.432Z","dependency_job_id":"f42495d8-9cf6-477f-8ca0-8a4094327a12","html_url":"https://github.com/giacbrd/SmartPipeline","commit_stats":{"total_commits":275,"total_committers":3,"mean_commits":91.66666666666667,"dds":0.2581818181818182,"last_synced_commit":"4e3f78a12abb6832c42ccc65edbee4c79bcc19f8"},"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FSmartPipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FSmartPipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FSmartPipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FSmartPipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/giacbrd","download_url":"https://codeload.github.com/giacbrd/SmartPipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244874283,"owners_count":20524576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-analytics","data-mining","data-pipelines","data-processing","data-science","dataops","design-patterns","etl","machine-learning","mlops","pipeline","pipeline-framework","pipelines","reproducibility","task-queue","workflow"],"created_at":"2024-10-13T00:04:25.464Z","updated_at":"2025-03-21T21:31:31.995Z","avatar_url":"https://github.com/giacbrd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"SmartPipeline\n-------------\n\nA framework for rapid development of robust data pipelines following a simple design pattern\n\n.. figure:: https://imgs.xkcd.com/comics/data_pipeline.png\n   :alt: pipeline comic\n\n   from https://xkcd.com\n\n.. image:: https://readthedocs.org/projects/smartpipeline/badge/?version=stable\n   :target: https://smartpipeline.readthedocs.io/en/stable/?badge=stable\n   :alt: Documentation Status\n\n.. image:: https://github.com/giacbrd/SmartPipeline/actions/workflows/tests.yml/badge.svg?branch=master\n   :target: https://github.com/giacbrd/SmartPipeline/actions/workflows/tests.yml\n   :alt: Tests\n\n.. image:: https://coveralls.io/repos/github/giacbrd/SmartPipeline/badge.svg?branch=master\n   :target: https://coveralls.io/github/giacbrd/SmartPipeline?branch=master\n   :alt: Tests Coverage\n\n\n.. documentation-marker\n\nSmartPipeline gives you the tools to design and formalize simple data pipelines,\nin which tasks are sequentially encapsulated in pipeline stages.\n\nIt is straightforward to implement pipelines,\nbut they are deeply customizable:\nstages can run concurrently and scale on heavy tasks,\nthey can process batch of items at once,\nmoreover executions and errors can be monitored easily.\n\nIt is a framework for engineering sequences of data operations\nand making them concurrent, following an optimized but transparent producer-consumer pattern.\nAn excellent solution for fast and clean data analysis prototypes (small/medium projects and POC)\nbut also for production code, as an alternative to plain scripts.\nConsider it as a solution for problems where big task queues and workflow frameworks are overkill.\nNo dependencies are required.\n\nInstall\n~~~~~~~\n\nInstall from PyPI, no dependencies will be installed:\n\n.. code-block:: bash\n\n   pip install smartpipeline\n\nWriting your pipeline\n~~~~~~~~~~~~~~~~~~~~~\n\nSmartPipeline is designed to help the developer following best practices,\nthe design is based on industrial experience on data products.\n\nSmartPipeline focuses on simplicity and efficiency in handling data locally,\ni.e. serialization and copies of the data are minimized.\n\nMain features:\n\n- Define a pipeline object as a sequence of stateful stage objects,\n  optionally set a source on which the pipeline iterates.\n- A pipeline can run indefinitely on the source or it can be used to process single items.\n- Concurrency can be set independently for each stage and single items can be processed asynchronously.\n- A stage can be designed for processing batches, i.e. sequences of consecutive items, at once.\n- Custom error handling can be set for logging and monitoring at stage level.\n\nAn example of a trivial pipeline for retrieving news from a feed\nand generating text embeddings of the raw pages content.\nWe define the source of the data and two stages, then we build and run the pipeline.\n\n.. code-block:: python\n\n    class FeedReader(Source):\n        def __init__(self):\n            feed = feedparser.parse(\"https://hnrss.org/newest\")\n            self.urls = (entry.link for entry in feed.entries)\n\n        # pop method generates a new data item when called\n        def pop(self):\n            # each call of pop consumes an url to send to the pipeline\n            url = next(self.urls, None)\n            if url is not None:\n                item = Item()\n                item.data[\"url\"] = url\n                return item\n            # when all urls are consumed we stop the pipeline\n            else:\n                self.stop()\n\n\n    class NewsRetrieve(Stage):\n        def process(self, item):\n            # add the page content to each item,\n            # http errors will be implicitly handled by the pipeline error manager\n            html = requests.get(item.data[\"url\"]).text\n            item.data[\"content\"] = re.sub('\u003c.*?\u003e', '', html).strip()\n            return item\n\n\n    class NewsEmbedding(BatchStage):\n        def __init__(self, size: int):\n            super().__init__(size)\n            self.model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n        def process_batch(self, items):\n            # efficiently compute embeddings by batching pages texts,\n            # instead of processing one page at a time\n            vectors = self.model.encode([item.data[\"content\"] for item in items])\n            for vector, item in zip(vectors, items):\n                item.data[\"vector\"] = vector\n            return items\n\n\n    pipeline = (\n        Pipeline()\n        .set_source(FeedReader())\n        # by using multi-thread (default) concurrency we speed up multiple http calls\n        .append(\"retriever\", NewsRetrieve(), concurrency=4)\n        # each batch of items to vectorize will be of size 10\n        .append(\"vectorizer\", NewsEmbedding(size=10))\n        .build()\n    )\n\n\n    for item in pipeline.run():\n        print(item)\n\n`Read the documentation \u003chttps://smartpipeline.readthedocs.io\u003e`_ for an exhaustive guide.\n\nThe `examples` folder contains full working sample pipelines.\n\nFuture improvements:\n\n- Stages can be memory profiled.\n- Processed items can be cached at stage level.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgiacbrd%2Fsmartpipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgiacbrd%2Fsmartpipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgiacbrd%2Fsmartpipeline/lists"}