{"id":18024478,"url":"https://github.com/guo-yong-zhi/distributedtaskqueue","last_synced_at":"2025-04-04T18:28:45.820Z","repository":{"id":46557498,"uuid":"514883387","full_name":"guo-yong-zhi/DistributedTaskQueue","owner":"guo-yong-zhi","description":"distributed task queue based on bash, ssh and flock","archived":false,"fork":false,"pushed_at":"2022-09-29T05:36:22.000Z","size":65,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-10T03:44:47.869Z","etag":null,"topics":["flock","ssh","task-queue","task-scheduler"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/guo-yong-zhi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-17T15:35:25.000Z","updated_at":"2022-08-11T05:56:33.000Z","dependencies_parsed_at":"2023-01-18T18:45:14.938Z","dependency_job_id":null,"html_url":"https://github.com/guo-yong-zhi/DistributedTaskQueue","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guo-yong-zhi%2FDistributedTaskQueue","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guo-yong-zhi%2FDistributedTaskQueue/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guo-yong-zhi%2FDistributedTaskQueue/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guo-yong-zhi%2FDistributedTaskQueue/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/guo-yong-zhi","download_url":"https://codeload.github.com/guo-yong-zhi/DistributedTaskQueue/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247228597,"owners_count":20904898,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flock","ssh","task-queue","task-scheduler"],"created_at":"2024-10-30T07:12:57.398Z","updated_at":"2025-04-04T18:28:45.801Z","avatar_url":"https://github.com/guo-yong-zhi.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"DistributedTaskQueue\n===\nIf there are a large number of tasks and multiple distributed workers, how can we reasonably allocate them?  \nManually write m tasks to n .sh files and then execute them in n workers respectively? The disadvantages of this static scheme are obvious: 1. Cumbersome; 2. It is difficult to distribute evenly; 3. New tasks cannot be added dynamically; 4. New workers cannot be added dynamically. In short, this manual scheme is not flexible enough.  \nThe ideal solution is to implement a distributed task queue. New tasks are appended to the end of the queue, and workers consume them dynamically.  \nThere are some such tools, but they all use solutions such as Redis to implement distributed locks, so they are very complex, have many dependencies, and are difficult to install and get started.   \nHere is an installation-free, single-file solution with just over 200 lines of code. The scheme is based on `bash`, `ssh` and `python3`. Basic flow: 1.The master node keeps a task list (text file) 2. The worker nodes connect to the master via ssh 3.On the master, use the `flock` command to lock the task list file, and use python to edit file, and finally return the task item (as a string) to the worker. 4. Execute specific commands on the worker. python3 is required on master for string processing, but any 3rd party packages are not required. The task queue is a text file. So no special commands are required to manage tasks, and you can edit the file directly. On the master node, no special monitoring process is run, no communication port is occupied. Master knows nothing about workers, and all information is exchanged through a one-way SSH session from the worker to the master.  \n## A simple example\n### Download \nFirst, download the script to each worker.  \n```shell\ncd ~\nwget https://raw.githubusercontent.com/guo-yong-zhi/DistributedTaskQueue/main/runtask.sh\nchmod a+x ~/runtask.sh\n```\nNote: Use `~/runtask.sh -h` to view help information  \n### Create a new task file\nCreate a new task file on the master node disk, such as `~/examplelist.sh`, one task item per line. The master node will not actually execute these commands, commands will be assigned to workers. The master does not need to have a corresponding environment.  \n```shell\necho task1; sleep 3 \necho task2; sleep 3 \necho task3; sleep 3\necho task4; sleep 3 \necho task5; sleep 3 \n```\n### runtask \nThen execute the following command on each worker. New workers can join at any time.  \n```shell\n~/runtask.sh ~/examplelist.sh -m \"master@myhost\"\n```\nor, equivalently:\n```shell\n~/runtask.sh master@myhost:~/examplelist.sh\n```  \nThe positional parameter `~/examplelist.sh` is the file path on the master node, which may not exist on the worker.   \nThe keyword argument `-m` is used to specify the address of the master, please replace the string with your server. The worker must be able to log in to the master with password-free ssh. So you may need to configure ssh key.    \nDuring the running process, the tasks are executed sequentially from top to bottom. `~/examplelist.sh` will be automatically edited and task items will be consumed (commented out) line by line. Information such as worker-id, running time, etc. will be added. An example is as follows:  \n```shell\n#LASTWORKER 1\n#echo task1; sleep 3 # worker 0 # (07-28 15:52:29 ... 07-28 15:52:33) #ok\n#echo task2; sleep 3 # worker 0 # (07-28 15:52:33 ...\n#echo task3; sleep 3 # worker 1 # (07-28 15:52:34 ...\necho task4; sleep 3 \necho task5; sleep 3 \n# worker 0: myspace-g46kh-25239-worker-0 100.122.27.103  (07-28 15:52:28)\n# worker 1: myspace-khg46-39252-worker-0 100.121.27.101  (07-28 15:52:34)\n```\n## A practical example\nAn example of training a series of deep learning models is as follows. Suppose the directory structure is like:  \n\u003e~/playground/models  \n|- resnet_family   \n|  |- resnet34  \n|  |- resnet50  \n|  |- resnet101  \n|  \n|- mbnet_family  \n|  |- mbnetv1  \n|  |- mbnetv2  \n|  |- mbnetv3  \n\n### Create a new task file\nCreate a new `~/tasklist.sh` as follows:  \n```shell\ncd ~/playground/models/resnet_family #!\ncd resnet34; train_model #:mygroup1\ncd resnet50; train_model\ncd resnet101; train_model #@2\n\ncd ../mbnet_family #!\ncd mbnetv1; train_model\ncd mbnetv2; train_model\ncd mbnetv3; train_model\n\ncd ../resnet_family #!\ncd resnet34; deploy_model #:mygroup1\ncd resnet34; test_model_on dataset1 #+mygroup1\ncd resnet34; test_model_on dataset2 #+mygroup1\ncd resnet34; test_model_on dataset3 #+mygroup1\ncd resnet34; report_test_result #:mygroup1\n```\nLines marked with `#!` are used for environment initialization, and lines without `#!` are specific training/testing tasks.  \n`#@2` tag here specifies that the big resnet101 will be trained on worker 2 (because my worker 2 has more memory).\n```\n                                +--\u003e test_model_on dataset1 ---+\n                                |                              |\ntrain_model --\u003e deploy_model ---+--\u003e test_model_on dataset2 ---+--\u003e report_test_result\n                                |                              |\n                                +--\u003e test_model_on dataset3 ---+\n```\nTo schedule order-sensitive tasks, we use tags `#:name` (sequential) and `#+name` (parallel).  [see tags](#tags)\n### Configure environment variables  \n```shell\nexport MASTER_SERVER=\"master@myhost\"\n```  \nHere you also need to replace the string with the address of your server. It can be added to `~/.bashrc`, `~/.zshrc` or other configuration files according to the shell you use. [see Environment variables and runtime variables](#environment-variables-and-runtime-variables)  \n### runtask  \nExecute the following command on each worker. Note that `-m` can be omitted here.  \n```shell\n~/runtask.sh ~/tasklist.sh\n```\n## other instructions\n### lock\nManually editing the task file may cause conflicts when the tasks are running (unless you are sure that the worker is busy executing the current task and will not fetch the next task item), so you'd better run a lock command such as `~/runtask.sh ~/tasklist.sh --lock` before editing, and release it with `CTRL-C` after editing. It is ok to append new tasks, delete pending tasks or change their order, but it is best not to add or subtract content before tasks that have already started. (ie change their line numbers)   \n### reset\nDuring the running process, the task file will be edited and added with many comments. If you want to run the task again, execute such as `~/runtask.sh ~/tasklist.sh --reset`, and the task file will be restored to the non-running state.  Use `--reset k` to partially reset the lines after line `k`.\n### tags \nFour types of tags are supported in task files, `#!`, `#@i`, `#:group1` and `#+group1`.    \n* `#!`    \nLines with `#!` tags will be executed by all possible workers, and commands such as `cd` will affect the environment and are often used for initialization; Lines without `#!` tags will only be executed by one worker (and then commented out), and will be executed in a subshell. Commands such as `cd` do not affect the parent environment, and are often used to run specific tasks.    \n* `#@`    \nTasks tagged with `#@i` will be specified to run on a certain worker, where `i` is the worker-id. Multiple tags such as `#@1#@2` represent multiple alternative workers. If it is used with `#!`, such as `#!#@1#@2`, every worker will execute the task. No `#@i` tag means it can be executed on all workers, which is equivalent to having tags of all workers.  \n* `#:` and `#+`  \nTags `#:group1` and `#+group1` are used to bind some task lines to one group. \"group1\" can be replaced with any name you like. Empty name means default group. All tasks have a hidden tag `#+`. A line marked with `#:group1` will not start to run until the tasks above marked with `#:group1` or `#+group1` (tasks in the same group) completed and succeed (labeled with `#ok`). A line marked with `#+group1` will not start to run until the tasks above marked with `#:group1` completed and succeed. That is, `#+xx` marked lines do not wait for each other, so they can be executed in parallel. A line marked with `#:` will wait until all tasks above  are completed, and the tasks below it will also wait until it is completed. So, it creates a join point.\n### Environment variables and runtime variables\nEnvironment variables `MASTER_SERVER`, `WORKER_NAME`, `TASK_FILE` can be configured in the worker (not necessary for the master node). With environment variables configured, the corresponding parameters can be omitted when running `~/runtask.sh` on the command line. Another environment variable `WORKERID` can be read, but not set. The worker-id is usually generated automatically, but you can also set it via the command line argument `--id` or the runtime variable `newid`.  \nThere are three important runtime variables `newtask`, `newid`  and `jumpto` that can be used to set new task file and worker-id on the fly. `newtask` corresponds to the environment variable `TASK_FILE` and the positional command line parameter, and `newid` corresponds to the environment variable `WORKERID` and command line arguments `-i`, `--id`.   \nUsing these variables, you can achieve dynamic task file jumping. Here is an example (`~/tasklist1.sh`):  \n```shell\necho task1; sleep 3 \necho task2; sleep 3 \necho task3; sleep 3\necho task4; sleep 3\nexit #!#@1\nnewtask=\"~/tasklist2.sh\" #!#@2\nnewtask=\"~/tasklist3.sh\"; newid=$WORKERID #!\necho task5; sleep 3\n```\nNote that the tag `#!` is required when you want to change the execution flow. After task1 to task4 are executed, worker 1 will exit directly; worker 2 will jump to `~/tasklist2.sh`, and will be reassigned a new worker-id; other workers will jump to `~/tasklist3.sh` and keep the worker-id unchanged. task5 will never be executed.  Setting both `newtask` and `newid` is a common pattern, so here's a shorthand `jumpto`. That is, `newtask=\"~/tasklist3.sh\"; newid=$WORKERID #!` is equivalent to `jumpto=\"~/tasklist3.sh\" #!`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguo-yong-zhi%2Fdistributedtaskqueue","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fguo-yong-zhi%2Fdistributedtaskqueue","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguo-yong-zhi%2Fdistributedtaskqueue/lists"}