https://github.com/mengke-mk/ring

RING is a ~5k lines pure C++11 runtime system for scaling irregular applications in PGAS programming praradigm.
https://github.com/mengke-mk/ring

graph numa pgas rdma runtime

Last synced: 8 months ago
JSON representation

RING is a ~5k lines pure C++11 runtime system for scaling irregular applications in PGAS programming praradigm.

Host: GitHub
URL: https://github.com/mengke-mk/ring
Owner: mengke-mk
Created: 2017-07-01T02:59:34.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2017-07-01T03:04:30.000Z (almost 9 years ago)
Last Synced: 2024-12-28T04:17:04.160Z (over 1 year ago)
Topics: graph, numa, pgas, rdma, runtime
Language: C
Homepage:
Size: 555 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ## Intro

RING is a **~5k lines** pure C++11 runtime system for scaling irregular applications in PGAS programming paradigm.

## Dependences

You must have a 64-bit Linux system with the following installed to build RING.

- Build

    - CMake >= 2.8.12

- Compiler

    - GCC > 4.8

- External:

    - Infiniband Verbs

    - MPI

    - pthread

    - libnuma

- optional

    - gperftool

## Quick Start

Like Grappa, Ring is primarly designed to be a "global view" model, whcih means that rather than coordinating where all parallel SPMD processes are at and how they devide up the data, the programmer is encourage to think of the system **as a large single shared memory**.

### Section 1 : Hello world

You can use `run()` to capture the whole process you want to run. Use `all_do()` to spawn the same task on all cores just like what you do in SPMD. Use `call()` to perform a remote process call.

`void run([]{ /*your codes*/ });`    

`void all_do([]{ /*your works*/ });`    

`void call(GlobalAddress gt, [](T* t){} );`    

`auto f = call(GlobalAddress gt, [](T* t){} );`   

`T result = f.get() // this will block explicitly just like C++11 future`  

`void call(int which_core, [](){} );`    

`auto f = call(int which_core, [](){} );`  

`T result = f.get() // this will block explicitly just like C++11 future`  

```c++

#include 

#include 

#include 

#include "ring.hpp"

int x;

int main(int argc, char* argv[]){

  RING_Init(&argc, &argv);

  run([]{

    all_do([]{

      x=0;

      if(thread_rank() == 0){

        auto gx = make_global(0, 1, &x);

        call(gx, [](int *x){

           sync_printf("hello world", (*x));

        });//call

      }

    });//all_do

  });//run

  RING_Finalize();

  return 0;

}

```

### Section 2 : Global memory

In Ring, all memory on all cores is addressable by any other core in spirit of **PGAS** programming paradigm.

You can allocate a global array by `gmalloc()`, free it by `gfree()`.

`auto A = gmalloc(size_t size);//only can be called inner run()`  

`free(A);`   

```c++

run([]{

  auto x = gmalloc(1<<20);

  auto y = gmalloc(1<<20);

  auto x1 = x + 230;

  auto y1 = y + 333;

  call(x1, [](int * w){

    sync_printf("hello pgas x", id());

  });

  call(y1, [](int * w){

    sync_printf("hello pgas y", id());

  });

  /* free request will be executed after rpc is executed. */

  gfree(x);

  gfree(y);

});

```

### Section 3 : parallel for

Instead of spawning tasks individually, it's almost always better to use a parallel loops of some sort. you can use `pfor()` to spwan loop iterations recursively untill hitting a threshold.

`void pfor(Array A, int s, int e, [](T* t){})`  

`void pfor(Array A, int s, int e, [](i, T* t){})`  

```c++

run([]{

  auto A = gmalloc(1<<20);

  pfor(A, 0, 1024, [](int* t){

    auto a = id();

    call(1, [a]{

      sync_printf(id());

    });//call

  });//pfor

  gfree(std::move(A));

});//run

```

## Benchmarks

  - GUPS (Ring vs. Grappa)

    - 1.23e8 vs. 8.04e7 (UPS), 120 core do 1<<20 updates

    - 5.30e7 vs. 5.12e7 (UPS), 12 core do 1<<20 updates

    - 2.36e7 vs. 2.10e7 (UPS), 1 core do 1<<20 updates

    - Baseline 2.62e8 (UPS), 1 core do 1<<20 updates.

  - Graph500 (Ring bs. Grappa)

    - 1.59e8 vs. 9.7e7 (TEPS), 120 core do scale=22 BFS. (10 nodes)

    - 1.55e7 vs 1.64e7 (TEPS), 12 core do scale=22 BFS. (1 node)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mengke-mk/ring

Awesome Lists containing this project

README