https://github.com/byzheng/taskqueue
Task Queue in R based on PostgreSQL Database
https://github.com/byzheng/taskqueue
parallel-computing postgresql r task-queue
Last synced: 10 months ago
JSON representation
Task Queue in R based on PostgreSQL Database
- Host: GitHub
- URL: https://github.com/byzheng/taskqueue
- Owner: byzheng
- License: other
- Created: 2023-11-29T21:14:02.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-28T00:36:55.000Z (about 1 year ago)
- Last Synced: 2024-11-28T01:24:47.333Z (about 1 year ago)
- Topics: parallel-computing, postgresql, r, task-queue
- Language: R
- Homepage: https://taskqueue.bangyou.me/
- Size: 4.81 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# taskqueue
[](https://cran.r-project.org/package=taskqueue)

[](https://cran.r-project.org/package=taskqueue)
[](https://cran.r-project.org/package=taskqueue)
[](https://cran.r-project.org/package=taskqueue)
Task Queue is implemented in R for asynchronous tasks based on [PostgreSQL](https://www.postgresql.org/) database. This package is only suitable for parallel computing without any communication among parallel tasks (i.e. [Embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)).
## Challenge of parallel computing in R
Several R packages have been using for parallel computing (e.g. [High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html) which are not suitable for asynchronous tasks.
* Uneven load among cores/workers. Some workers run faster than others and then wait for others to stop.
* Cannot utilise the new available workers after a parallel task is started.
`taskqueue` is designed to utilise all available computing resources until all tasks are finished through dynamic allocating tasks to workers.
## Installation
Install the developing version from [Github](https://github.com/byzheng/taskqueue).
```r
devtools::install_github('byzheng/taskqueue')
```
## PostgreSQL installation and configuration
As large amount of concurrent requests might simultaneously reach the database, a specific server is preferred to install PostgreSQL database and can be connected by all workers. Following the [PostgreSQL website](https://www.postgresql.org/download/) to install and configure PostgreSQL. A database should be created for each user.
In all workers, five environmental variables should be added to `.Renviron` file to connect database, i.e.
```
PGHOST=
PGPORT=
PGUSER=
PGPASSWORD=
PGDATABASE=
```
## Resource
A computing resource is defined as a facility/computer which can run multiple jobs/workers.
A new resource can be added by `resource_add` with configurations.
* `name` as resource name
* `type` resource type. Only support `slurm` at this stage
* `host` network name to access resource
* `nodename` obtain by `Sys.info()` in the resource
* `workers` maximum number of available cores in resource
* `log_folder` folder to store log files in the resource
Currently only `slurm` is supported although I plan to implement other types of resource.
```r
resource_add(name = "hpc",
type = "slurm",
host = "hpc.example.com",
nodename = "hpc",
workers = 500,
log_folder = "/home/user/log_folder/")
```
`log_folder` is important for troubleshooting and split by `project`. It wouble be better to store in the high speed hard drive as the frequent I/O process to write log files.
## Project
`taskqueue` manages tasks by project which has its own resources, working directory, runtime requirements and other configurations.
A project with unique name can be created by `project_add` function and assign common requirements (e.g. memory).
```r
project_add("test_project", memory = 20)
```
Assign a `resource` to `project` which can be used to schedule workers with configurations (e.g. working directory, total runtimes, project specified workers).
```r
project_resource_add(project = "test_project",
resource = "hpc")
```
Now tasks can be added into database through function `task_add`, e.g.
```
task_add(project, num = 100, clean = TRUE)
```
Now we can develop a function to actually perform data processing, which
* takes task `id` as the first argument,
* expect no return values,
* save final output into file system and check whether this task is finished.
Finally, call `worker` to run wrapper function with project name.
```r
library(taskqueue)
fun_test <- function(i) {
# Check output file
out_file <- sprintf("%s.Rds", i)
if (file.exists(out_file)) {
return()
}
# Perform the actual computing
# ....
# Save final output
saveRDS(i, out_file)
}
worker("test1", fun_test)
```
After developing and testing the wrapper function, we can save it into a file (e.g. `rcode.R`) and then schedule to run it with following functions.
* Reset all or `failed`/`working` tasks to `idle` status with `project_reset`
* Start the project to allow workers consuming tasks with `project_start`
* Schedule tasks into resources (e.g. `worker_slurm` for slurm cluster)
* Check task status with `task_status`
* Stop project with `project_stop`
```r
# Reset status for all tasks in a project
project_reset("test1")
# Start project
project_start("test1")
# Schedule task on slurm resource `hpc`
worker_slurm("test1", "hpc", "rcode.R", modules = "sqlite/3.43.1")
# Check status of all tasks
task_status("test1")
# Stop the project
project_stop("test1")
```
A task has four status
* `idle`: task is not running.
* `working`: task is still running on one of worker.
* `failed`: task is failed with some reason. Check the log folder for trouble shooting.
* `finished`: task is finished without errors.