{"id":16731632,"url":"https://github.com/missinglink/pipeline","last_synced_at":"2025-07-01T08:33:49.177Z","repository":{"id":17923440,"uuid":"20892765","full_name":"missinglink/pipeline","owner":"missinglink","description":"distributed non-buffering data pipeline with built in orchestrator and flood control (alpha)","archived":false,"fork":false,"pushed_at":"2014-06-19T19:44:38.000Z","size":689,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-15T18:26:04.222Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/missinglink.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-06-16T16:56:50.000Z","updated_at":"2017-07-27T15:28:58.000Z","dependencies_parsed_at":"2022-08-30T08:31:43.370Z","dependency_job_id":null,"html_url":"https://github.com/missinglink/pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/missinglink/pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/missinglink%2Fpipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/missinglink%2Fpipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/missinglink%2Fpipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/missinglink%2Fpipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/missinglink","download_url":"https://codeload.github.com/missinglink/pipeline/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/missinglink%2Fpipeline/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261255616,"owners_count":23131473,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T23:38:14.870Z","updated_at":"2025-06-22T07:33:07.218Z","avatar_url":"https://github.com/missinglink.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n## Pipeline\n\nA distributed non-buffering data pipeline with built in orchestrator and flood control.  \n\n====\n\n#### Overview\n\nA pipeline is a set of workers acting both in series and in parallel. The data is constantly moving down the pipeline from one server to the next. \n    \nThe pipeline will auto balance, turning off the tap when the pipes or sink can't handle the flow, each worker can disconnect/recoonect if they are experiencing flooding or internal errors. This mitigates the memory and disk issues and isolates them to a single node.\n\n====\n\n#### Notice\n\n**In active development** - The public API will likely change before the initial release.  \n  \n**Unpublished** - calls to `require('pipeline')` will fail as the module is not being published in `npm`. Additionally the name `pipeline` has been taken so it will use a different name.  \n  \nRun `npm run symlink` to create a symlink that fixes this during development.\n    \n====\n\n#### Inspired by unix pipes and zeromq\n  \n`unix pipes` provide an amazingly easy-to-use and portable API.\n\n```bash  \n#unix pipes  \necho '{ hello: \"world\" }' | filter.sh 2\u003e\u003e error.txt | map.sh 1\u003e out.txt 2\u003e\u003e error.txt  \n```\n\n`zeromq (ØMQ)` is an asynchronous messaging library.\n\n![zeromq](http://learning-0mq-with-pyzmq.readthedocs.org/en/latest/_images/pushpull.png)\n\n====\n\n#### Pipeline\n\n`pipeline` aims to provide a similar unix pipes `API` with support for `TCP` sockets while also offering:\n\n- The ability to attach **multiple processes** to `stdin` `stdout` and `stderr`.\n- Smart **flood control** mechanisms to avoid buffering data at any branch of the pipe.\n- Role-based workflows which allows simple ways to **perform tasks in parallel** or **in a series**. \n\n====\n\n#### Orchestrator\n  \n`pipeline` allows you to create worker processes which run anywhere on your network. Rather than assigning your workers fixed addresses, `pipeline` allows you to delegate addressing to a process called the `orchestractor`.  \n  \nAn example `orchestrator` is provided with the package. You can use this type of `socket` to assign work to other `Worker` processes like this:  \n  \n```javascript\nvar pipeline = require('pipeline');\n\nvar orchestrator = new pipeline.Orchestrator(\n  new pipeline.Pipeline()\n    .from('tap').to('filter')\n    .from('filter').to('map')\n    .from('map').to('sink')\n);\n\norchestrator.bind(5000);\n```\n    \n====\n    \n#### Worker\n\nEach worker must be assigned a `role`. Their unique network addresses don't matter and can change without causing the entire pipeline to error.  \n  \nAs with unix sockets, the worker does not need to know where the data comes from or where it is going next; the `orchestrator` will tell it which other `Worker` sockets to connect to.  \n  \nExample worker:  \n  \n```javascript  \nvar pipeline = require('pipeline');\n\nvar worker = new pipeline.Worker({\n  role: 'filter',\n  concurrency: 10,\n  orchestrator: { port: 5000 }\n});\n\n// recieve work from upstream\nworker.on( 'data', function( msg, done ){\n\n  worker._debug( 'worker2 got message', msg );\n  \n  // worker must call done() when task is completed\n  doSomethingAsnyc( { cmd: 'takes_time', msg: msg }, function( err, data ){  \n    \n    // send some work downstream and pass done handler to write socket\n    worker.write( data, done );\n\n  });\n\n});\n```  \n  \nThe worker will automatically handle concurrency control; when the maximum number of concurrent jobs are being executed on this process the `stdin` socket(s) will disconnect.  \n  \nWhen the worker is again free to process data it will automatically re-connect it's `stdin` socket(s) and start processing messages again. \n  \n====  \n  \n#### Trying out the project  \n  \nYou can try out the project in it's current form; while the code is not release-ready yet, there IS a functional demo that runs all the workers in child processes and pipes all their `stdout` streams to one window for easy debugging.  \n  \n```bash  \n$\u003e git clone git@github.com:missinglink/pipeline.git \u0026\u0026 cd pipeline\n$\u003e npm install  \n$\u003e npm run symlink  \n$\u003e npm start\n```\n\n==== \n\n#### Example\n  \nIn this example, we want to parse a file of 10M user records. For each `user` in the file we want to go and look up their facebook profile; twitter profile and then save the record to the `database`.  \n  \n###### workers  \n\n- orchestrator (singleton)  \n- file parser (1 worker)\n- facebook (10 workers)  \n- twitter (10 workers)  \n- database client (2 workers)\n  \nthen we tell the `orchestrator` how to connect them together:\n\neither **in series**:\n\n```\n         ┌─→ facebook ─→ twitter ──┐\nparser ──┼─→ facebook ─→ twitter ──┼─→ database_client\n         └─→ facebook ─→ twitter ──┘\n```\n\n```javascript\nnew pipeline.Pipeline()\n  .from('parser').to('facebook')\n  .from('facebook').to('twitter')\n  .from('twitter').to('database_client');\n```\n\nor **in parallel**:\n\n```\n         ┌─→ facebook ──┐\n         ├─→ facebook ──┤\nparser ──┤              ├─→ merger ─→ database_client\n         ├─→ twitter ───┤\n         └─→ twitter ───┘\n```\n\n```javascript\nnew pipeline.Pipeline()\n  .from('parser').to('facebook').from('facebook').to('merger')\n  .from('parser').to('twitter').from('twitter').to('merger')\n  .from('merger').to('database_client');\n```\n\n... simple as that, the pipeline will load-balance each role. workers will slow-down and speed up depending on the ability of the 3rd party services to fulful the requests.\n\n**Note:** The `parser` should pause iteration when no peer sockets are connected.\n\n**Note:** The `merger` should be configured with a high concurrency value.\n\n**Note:** The `database_client` should call `worker.pause()` if the database starts to become slow or un-responsive.\n\n\n====  \n  \n#### FAQ      \n       \n**Q. How does this differ from a traditional job queue?**\n  \nA queue system has a single centralized server to store messages until worker nodes collect them.  \n  \nA queue is limited by the available memory and disk space, which can become a problem when dealing with very large datasets.\n      \n**Q. So you stream from the orchestrator to the available workers and then stream the responses back to the orchestrator?**\n  \nNo, the orchestractor **ONLY** tells workers where to attach their `stdin` streams to; it does not do any work and does not usually ever see any of the data.\n\n**Q. How do the worker know which port to bind to?** \n  \nEach worker binds it's `stdout` stream(s) to `INADDR_ANY` (any available port).  \n  \nThe worker then connects to the `orchestrator` and announces it's `role` and the network address that peers can connect to if they wish to consume it's output.  \n  \nConventional logic would suggest you `bind` your `stdin` and `connect` on your `stdout`.\n\nUsing the inverse allows for the worker to `disconnect` its `stin` socket(s) when it starts to flood while maintaining the port it has bound for `stdout`.\n     \n====\n  \n... more to come\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmissinglink%2Fpipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmissinglink%2Fpipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmissinglink%2Fpipeline/lists"}