{"id":29015314,"url":"https://github.com/abdur75648/multithreaded-pipeline-manager-python","last_synced_at":"2025-10-10T21:39:47.142Z","repository":{"id":292892255,"uuid":"982289752","full_name":"abdur75648/multithreaded-pipeline-manager-python","owner":"abdur75648","description":"Lightweight, deadlock-free multithreaded pipeline framework for fast, modular Python data and ML model workflows. Easily extensible for real-time or batch processing tasks.","archived":false,"fork":false,"pushed_at":"2025-05-18T19:59:35.000Z","size":2018,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-12T06:25:07.566Z","etag":null,"topics":["model-pipeline","modular","multi-process","multi-threading","multithreaded","python","python-thread-manager","python-threading"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abdur75648.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-12T16:51:49.000Z","updated_at":"2025-06-07T11:37:27.000Z","dependencies_parsed_at":"2025-08-12T06:14:56.539Z","dependency_job_id":null,"html_url":"https://github.com/abdur75648/multithreaded-pipeline-manager-python","commit_stats":null,"previous_names":["abdur75648/multithreaded-pipeline-manager-python"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/abdur75648/multithreaded-pipeline-manager-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdur75648%2Fmultithreaded-pipeline-manager-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdur75648%2Fmultithreaded-pipeline-manager-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdur75648%2Fmultithreaded-pipeline-manager-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdur75648%2Fmultithreaded-pipeline-manager-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abdur75648","download_url":"https://codeload.github.com/abdur75648/multithreaded-pipeline-manager-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdur75648%2Fmultithreaded-pipeline-manager-python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279005418,"owners_count":26083883,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["model-pipeline","modular","multi-process","multi-threading","multithreaded","python","python-thread-manager","python-threading"],"created_at":"2025-06-25T21:05:45.943Z","updated_at":"2025-10-10T21:39:47.110Z","avatar_url":"https://github.com/abdur75648.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multithreaded Pipeline Manager in Python\n\n![Multithreaded Pipeline Manager](assets/demo.png)\n\nA reusable, beginner-friendly code template for creating generic multithreaded data processing pipelines in Python. This repository demonstrates how to build robust pipelines that can handle pre-processing, model inference (simulated), and post-processing stages concurrently, with support for parallelizing CPU-bound sub-tasks within a stage.\n\n## Table of Contents\n\n- [Motivation](#motivation)\n- [Core Design Concepts](#core-design-concepts)\n  - [Threads vs. Processes](#threads-vs-processes)\n  - [Queues for Inter-Thread Communication](#queues-for-inter-thread-communication)\n  - [Sentinel Values for Graceful Shutdown](#sentinel-values-for-graceful-shutdown)\n  - [Stopping Events for Error Handling](#stopping-events-for-error-handling)\n  - [ThreadPoolExecutor for Parallel Sub-tasks](#threadpoolexecutor-for-parallel-sub-tasks)\n  - [Deadlock and Race Condition Prevention](#deadlock-and-race-condition-prevention)\n  - [Modularity and Reusability](#modularity-and-reusability)\n- [Repository Tour](#repository-tour)\n  - [File Structure](#file-structure)\n- [Quick Start](#quick-start)\n- [How It Works](#how-it-works)\n  - [High-Level Diagram](#high-level-diagram)\n  - [Stage-by-Stage Explanation](#stage-by-stage-explanation)\n  - [Code Walkthrough (Key Components)](#code-walkthrough-key-components)\n- [Extending to Real Workloads](#extending-to-real-workloads)\n- [Troubleshooting Cheatsheet](#troubleshooting-cheatsheet)\n- [License](#license)\n- [Contributing](#contributing)\n\n## Motivation\n\nMany data processing and machine learning tasks involve a sequence of steps (a pipeline). Running these steps sequentially can be slow, especially if some steps are I/O-bound or CPU/GPU-bound. Multithreading allows different stages of the pipeline to run concurrently, improving throughput and reducing overall processing time.\n\nThis repository helps you:\n- Understand the fundamentals of multithreaded pipelines.\n- Implement robust pipelines with proper error handling.\n- Parallelize CPU-bound sub-tasks using `ThreadPoolExecutor`.\n- Adapt a generic structure for real-world projects.\n\n## Core Design Concepts\n\n### Threads vs. Processes\n- **Threads** share memory and are lightweight—ideal for I/O-bound or GIL-releasing tasks.\n- **Processes** are isolated—better for CPU-bound tasks in native Python.\n- This template uses **threads**, assuming I/O and C-based libraries release the GIL.\n\n### Queues for Inter-Thread Communication\n- `queue.Queue` is used for safe, thread-safe communication.\n- Follows a **producer-consumer** pattern.\n- Supports **backpressure** using `maxsize`.\n\n### Sentinel Values for Graceful Shutdown\n- A unique object like `SENTINEL = object()` signals the end of data.\n- Workers propagate `SENTINEL` downstream and exit cleanly.\n\n### Stopping Events for Error Handling\n- A shared `threading.Event` (`stop_event`) lets workers shut down gracefully on error.\n\n### ThreadPoolExecutor for Parallel Sub-tasks\n- Used within a stage (e.g. post-processing) to run multiple sub-tasks in parallel.\n\n### Deadlock and Race Condition Prevention\n- Clear data flow with `Queue`.\n- `robust_put()` handles blocking puts with timeout and stop signal.\n- Shared state should use `threading.Lock`.\n\n### Modularity and Reusability\n- Create new stages by subclassing `BaseWorker`.\n- `PipelineManager` handles orchestration.\n\n## Repository Tour\n\n### File Structure\n\n```\n\nmultithreaded-pipeline-manager-python/\n├── pipeline\\_core/\n│   ├── **init**.py\n│   ├── pipeline\\_manager.py\n│   └── utils.py\n├── demo.py\n└── README.md\n\n````\n\n- **`pipeline_core/pipeline_manager.py`**: `PipelineManager` and `BaseWorker` classes.\n- **`pipeline_core/utils.py`**: Utility functions (`robust_put`, `SENTINEL`, logger).\n- **`demo.py`**: Demonstrates building and running a pipeline.\n\n## Quick Start\n\n1. Clone or copy files:\n\n```bash\n# git clone https://github.com/yourusername/multithreaded-pipeline-manager-python.git\n# cd multithreaded-pipeline-manager-python\n````\n\n2. Ensure Python 3.8+ is installed.\n\n3. Run the demo:\n\n```bash\npython demo.py\n```\n\n## How It Works\n\n### High-Level Diagram\n\n```\nInput Data --\u003e Queue 1 --\u003e PreProcessingWorker --\u003e Queue 2 --\u003e ModelInferenceWorker --\u003e Queue 3 --\u003e PostProcessingWorker --\u003e Final Output\n                                                (Simulated)                            (with ThreadPoolExecutor for sub-tasks)\n```\n\n### Stage-by-Stage Explanation\n\n* **Data Producer**: Feeds items into the first queue.\n* **PreProcessingWorker**: Simulates data prep.\n* **ModelInferenceWorker**: Simulates model inference, may raise errors.\n* **PostProcessingWorker**: Submits sub-tasks to a thread pool, buffers and orders results.\n\n### Code Walkthrough (Key Components)\n\n#### `PipelineManager`\n\n* Manages `stop_event`, progress bar, workers, and queues.\n* `add_worker()`, `start()`, `wait_for_completion()`.\n\n#### `BaseWorker`\n\n* Implements `_run()` and requires `process_item(item)` to be defined.\n* Handles queue communication and `SENTINEL` passing.\n\n#### `robust_put` and `SENTINEL`\n\n* Handles blocked puts safely with timeouts and stop checks.\n\n#### Parallel Post-Processing\n\n* `PostProcessingWorker` uses a thread pool for parallel sub-tasks.\n* Buffers and re-orders results for consistent output.\n\n## Extending to Real Workloads\n\n* Subclass `BaseWorker` to define custom stages.\n* Use `PipelineManager` to build your pipeline.\n* Handle exceptions inside `process_item`.\n* Use `robust_put` to safely queue items.\n* Tune performance via queue sizes, worker count, etc.\n\n## Troubleshooting Cheatsheet\n\n| Symptom                   | Cause                                              | Fix                                                                 |\n| ------------------------- | -------------------------------------------------- | ------------------------------------------------------------------- |\n| Pipeline hangs            | Blocked producer or consumer died                  | Use `robust_put`, ensure `SENTINEL` is passed, avoid circular waits |\n| Data not processed        | `SENTINEL` not propagated or workers exit early    | Ensure each stage sends and reacts to `SENTINEL` properly           |\n| Race conditions           | Shared state accessed without lock                 | Use `threading.Lock`                                                |\n| High idle CPU             | Busy-waiting threads                               | Use `queue.get(timeout=...)`                                        |\n| Errors in sub-tasks       | Exceptions in executor workers                     | Catch and handle exceptions in sub-task functions                   |\n| Progress bar not updating | Missing `pbar.update()` or wrong `num_total_items` | Ensure correct configuration                                        |\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nFeel free to fork and submit pull requests to improve functionality or documentation. Suggestions and issues are welcome.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdur75648%2Fmultithreaded-pipeline-manager-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabdur75648%2Fmultithreaded-pipeline-manager-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdur75648%2Fmultithreaded-pipeline-manager-python/lists"}