https://github.com/byzheng/taskqueue

Task Queue in R based on PostgreSQL Database
https://github.com/byzheng/taskqueue

parallel-computing postgresql r task-queue

Last synced: 10 months ago
JSON representation

Task Queue in R based on PostgreSQL Database

Host: GitHub
URL: https://github.com/byzheng/taskqueue
Owner: byzheng
License: other
Created: 2023-11-29T21:14:02.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-11-28T00:36:55.000Z (about 1 year ago)
Last Synced: 2024-11-28T01:24:47.333Z (about 1 year ago)
Topics: parallel-computing, postgresql, r, task-queue
Language: R
Homepage: https://taskqueue.bangyou.me/
Size: 4.81 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          # taskqueue

[![](https://www.r-pkg.org/badges/version/taskqueue?color=green)](https://cran.r-project.org/package=taskqueue)

![R-CMD-check](https://github.com/byzheng/taskqueue/workflows/R-CMD-check/badge.svg) 

[![](http://cranlogs.r-pkg.org/badges/grand-total/taskqueue?color=green)](https://cran.r-project.org/package=taskqueue)

[![](http://cranlogs.r-pkg.org/badges/last-month/taskqueue?color=green)](https://cran.r-project.org/package=taskqueue)

[![](http://cranlogs.r-pkg.org/badges/last-week/taskqueue?color=green)](https://cran.r-project.org/package=taskqueue)

Task Queue is implemented in R for asynchronous tasks based on [PostgreSQL](https://www.postgresql.org/) database. This package is only suitable for parallel computing without any communication among parallel tasks (i.e. [Embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)).

## Challenge of parallel computing in R

Several R packages have been using for parallel computing (e.g. [High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html) which are not suitable for asynchronous tasks.

* Uneven load among cores/workers. Some workers run faster than others and then wait for others to stop. 

* Cannot utilise the new available workers after a parallel task is started. 

`taskqueue` is designed to utilise all available computing resources until all tasks are finished through dynamic allocating tasks to workers.  

## Installation 

Install the developing version from [Github](https://github.com/byzheng/taskqueue).

```r

devtools::install_github('byzheng/taskqueue')

```

## PostgreSQL installation and configuration

As large amount of concurrent requests might simultaneously reach the database, a specific server is preferred to install PostgreSQL database and can be connected by all workers. Following the [PostgreSQL website](https://www.postgresql.org/download/) to install and configure PostgreSQL. A database should be created for each user.

In all workers, five environmental variables should be added to `.Renviron` file to connect database, i.e.

```

PGHOST=

PGPORT=

PGUSER=

PGPASSWORD=

PGDATABASE=

```

## Resource

A computing resource is defined as a facility/computer which can run multiple jobs/workers.

A new resource can be added by `resource_add` with configurations.

* `name` as resource name

* `type` resource type. Only support `slurm` at this stage

* `host` network name to access resource

* `nodename` obtain by `Sys.info()` in the resource

* `workers` maximum number of available cores in resource

* `log_folder` folder to store log files in the resource

Currently only `slurm` is supported although I plan to implement other types of resource.

```r

resource_add(name = "hpc", 

            type = "slurm", 

            host = "hpc.example.com", 

            nodename = "hpc",

            workers = 500,

            log_folder = "/home/user/log_folder/")

```

`log_folder` is important for troubleshooting and split by `project`. It wouble be better to store in the high speed hard drive as the frequent I/O process to write log files.

## Project

`taskqueue` manages tasks by project which has its own resources, working directory, runtime requirements and other configurations.

A project with unique name can be created by `project_add` function and assign common requirements (e.g. memory).

```r

project_add("test_project", memory = 20)

```

Assign a `resource` to `project` which can be used to schedule workers with configurations (e.g. working directory, total runtimes, project specified workers).

```r

project_resource_add(project = "test_project", 

                     resource = "hpc")

```

Now tasks can be added into database through function `task_add`, e.g. 

```

task_add(project, num = 100, clean = TRUE)

```

Now we can develop a function to actually perform data processing, which 

* takes task `id` as the first argument, 

* expect no return values,

* save final output into file system and check whether this task is finished.

Finally, call `worker` to run wrapper function with project name.

```r

library(taskqueue)

fun_test <- function(i) {

    # Check output file

    out_file <- sprintf("%s.Rds", i)

    if (file.exists(out_file)) {

        return()

    }

    # Perform the actual computing

    # ....

    

    # Save final output

    saveRDS(i, out_file)

}

worker("test1", fun_test)

```

After developing and testing the wrapper function, we can save it into a file (e.g. `rcode.R`) and then schedule to run it with following functions.

* Reset all or `failed`/`working` tasks to `idle` status with `project_reset`

* Start the project to allow workers consuming tasks with `project_start`

* Schedule tasks into resources (e.g. `worker_slurm` for slurm cluster)

* Check task status with `task_status`

* Stop project with `project_stop`

```r

# Reset status for all tasks in a project 

project_reset("test1")

# Start project

project_start("test1")

# Schedule task on slurm resource `hpc`

worker_slurm("test1", "hpc", "rcode.R", modules = "sqlite/3.43.1")

# Check status of all tasks

task_status("test1")

# Stop the project

project_stop("test1")

```

A task has four status 

* `idle`: task is not running.

* `working`: task is still running on one of worker.

* `failed`: task is failed with some reason. Check the log folder for trouble shooting.

* `finished`: task is finished without errors.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/byzheng/taskqueue

Awesome Lists containing this project

README