Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/NAGAGroup/Scalix

Scalix is a data parallel compute library that automatically scales to the available compute resources.
https://github.com/NAGAGroup/Scalix
cuda hpc scientific-computing
Last synced: 3 months ago
JSON representation
Scalix is a data parallel compute library that automatically scales to the available compute resources.
Host: GitHub
URL: https://github.com/NAGAGroup/Scalix
Owner: NAGAGroup
License: bsd-3-clause
Created: 2023-02-10T00:45:54.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-04-12T13:54:51.000Z (10 months ago)
Last Synced: 2024-04-13T23:10:34.697Z (10 months ago)
Topics: cuda, hpc, scientific-computing
Language: Cuda
Homepage:
Size: 1.06 MB
Stars: 7
Watchers: 2
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Roadmap: ROADMAP.md
Awesome Lists containing this project

awesome-high-performance-computing - Scalix - Data parallel computing framework (Software / Trends)
README

        # Welcome to SCALIX!

---

**Note**: This proof-of-concept version of Scalix has reached EOL and will no longer receive

updates/fixes, except potentially on an as-needed basis for current research that depends on this

version. The insights gained since starting this project, both through active development and

real-world use in a non-trivial codebase, have been incredibly invaluable. With those insights,

development on a much improved version written in SYCL instead of CUDA has begun on the sycl

branch. This version will be better in nearly every way, some of the improvements include:

- SYCL is a more modern API, using ISO compliant C++ instead of an extended C-like language with

  some C++ support like CUDA. Not only does this enable the nice-to-haves of modern C++, but the

  new implementation will be platform agnostic, enabling a wider range of hardware, and likely

  even more added hardware support in the future with no added development of Scalix

- Using the SYCL specification as a guide, Scalix will use the buffer/accessor model, rather

  than relying on CUDA's Unified Memory(UM)/On-demand page migration. This will allow the Scalix

  runtime to maintain a dependency graph and explicitly manage data in an optimal way defined by

  that dependency graph. While Scalix currently scales well with relatively simple access patterns,

  the UM driver heuristics really break down for more complex problems leading to inefficient data

  transfers that tank performance

- Way simpler API that is both easier to use and to maintain/test. If you know SYCL, you'll feel right

  at home with Scalix

- Much better thread management. Currently, Scalix spawns three or four nested threads for a single kernel, which

  is already pretty bad, but the threads are spawned via `std::async` instead of a dedicated threadpool. While the

  true cost of this is unclear, it is a pretty bad implementation that has a lot of room for improvement and makes

  profiling incredibly difficult in NSight as the number of CUDA streams balloons very quickly. Going forward,

  Scalix will internally use a multi-threading approach that is akin to an actor model and we have been able to

  avoid spawning any threads until absolutely necessary, completely avoiding any need for async tasks to spin in

  some thread waiting on a set of dependent futures before proceeeding. It also enables a stateless implementation

  where the dependency graph is implicitly defined via shared/unique locks around protected data.

  Additionally, we will likely be incorporating a threadpool for even better thread management.

- Well-thought out API/system design and proper unit-testing, static analysis, etc from the start. With the first

  attempt at Scalix, it was being developed in parallel with a research project, and often times getting something

  that worked, even if not ideal, was priortized over well-thought out solutions. Well, that version of Scalix works

  well enough for the research project, at least until the first working release of SYCL-Scalix, so we can take

  our time. This should give more confidence to future users that the project is properly handled, and will make

  getting started with the software much easier.

- More opportunities for optimization, some implicit, some explicit:

    - We will enforce that only the minimally-required memory to keep data valid will be used unless explicitly declared by the user.

    - The way memory is handled will have very clear, documented behavior, allowing users to fine-tune their code for these rules.

    - Only the memory locations being accessed on each device will be allocated, where these accesses are specified by either a few

      simple but commonly used strategies or a more flexible strategy where the user provides a custom command that defines accessed indices.

    - In terms of optimization of compute resources, Scalix development will be focused on providing default and user-defined "knobs"

      that can be tweaked to find the optimal balance of resources

---

Scalix is a data parallel computing framework designed to provide an easy-to-use

interface for designing data parallel applications that automatically scale to

the available compute resources. Data distribution and load balancing are

abstracted away from the user, allowing them to focus on the application logic.

Scalix is designed to be written using "kernel style" programming, where the

user writes a single function that is executed on a grid of threads.

Currently, Scalix is only supported on a single machine with multiple CUDA

devices that must support Unified Memory. Furthermore, while not strictly

required, it is recommended to use Linux and a CUDA device with compute

capability 6.0 or higher so that we can take advantage of on-demand page

migration. Performance without on-demand page migration is not tested, but will

assuredly be slower.

In the future, we plan to migrate to a SYCL-based implementation + MPI. This

future specification will look drastically different as we won't be able to rely

on Unified Memory, but many of the concepts will remain the same. Specifically,

the user should be able to write for an abstract "device" that may be one or

many actual compute devices.

Even further in the future we plan to define a generic specification for which

anyone can write their own implementation.

Think of this current version of Scalix as a proof of concept.

For a more detailed outline of our roadmap, see [ROADMAP.md](ROADMAP.md).

## Getting Started

For the moment, this is a header-only library, but that may change in the

future. Any program that uses Scalix must be compiled with the nvcc flag

`--extended-lambda`. The library is written in C++17. We have provided a

`CMakeLists.txt` file that provides the `scalix` interface library that, when

added as a link to a target, will automatically add the `--extended-lambda` flag

to the nvcc compiler. There are also a number of options that can be used to do

things like emulate multiple devices to check for correctness. These options are

described in the `CMakeLists.txt` file.

### Writing a Scalix Program

In the following program we add one to all elements of a Scalix array.

```c++

#include 

int main() {

    // create a 2D array of 1000x1000 elements

    //

    // the unified memory hints for the array are set such as to maximize

    // performance for equally distributed access across all devices

    sclx::array arr({1000, 1000});

    // initialize the array to 0

    //

    // kernels automatically distribute the work across all devices

    // for the result array

    sclx::execute_kernel([&](sclx::kernel_handler& handler) {

        handler.launch(

            sclx::md_range_t<2>(arr.shape()),

            arr,  // need to provide a result array to guide the distribution of

                  // work

            [=] __device__(const sclx::md_index_t<2>& idx,

                           /*see below for what this is*/ const auto&) {

                arr[idx] = 0;

            }

        );

    });

    // we also provide a kernel info object (sclx::kernel_info) add one to all elements and print out the block thread

    // id

    sclx::execute_kernel([&](sclx::kernel_handler& handler) {

        handler.launch(

            sclx::md_range_t<2>(arr.shape()),

            arr,

            [=] __device__(

                const sclx::md_index_t<2>& idx,

                const sclx::kernel_info<

                    2,

                    sclx::cuda::traits::kernel::default_block_shape.rank()>&

                    info

            ) {

                printf(

                    "block thread id: %u\n",

                    static_cast(info.local_thread_id()[0])

                );

                arr[idx] += 1;

            }

        );

    });

}

```

A special type of problem one may face is one where the write pattern cannot fit

into a neat range. For example, suppose we want to generate a histogram for some

values in an array. Sequentially, we would do something like iterate over each

value, determine the bin it belongs to, and increment the value of that bin.

To do this in Scalix, we instead would use the "index generator"-based kernel

API. The index generator in this case would take an index from the range

associated with the values and return the bin index. Our kernel implementation

replicates the entire range over each device, but only calls the provided

functor if the write index is local to the device. So, unfortunately, this means

scaling will not be ideal, and in many cases worse than a single device

implementation. However, what this does allow is the prevention of expensive

memory transfers for a problem that mostly scales well using the range-based

API.

```c++

#include 

// required interface for an index generator

class histogram_index_generator {

  public:

    static constexpr uint range_rank = 1;

    static constexpr uint index_rank = 1;

    __host__ __device__ const sclx::md_range_t& range() const;

    __host__ __device__ const sclx::md_range_t& index_range() const;

    __host__ __device__ sclx::md_index_t

    operator()(const sclx::md_index_t&) const;

    ...

};

int main() {

    sclx::array values({1000});

    sclx::array histogram({100});

    histogram_index_generator index_generator(...);

    sclx::execute_kernel([&](sclx::kernel_handler& handler) {

        handler.launch(

            index_generator,

            histogram,

            [=] __device__(const sclx::md_index_t<1>& histogram_idx, const auto&) {

                atomicAdd(&histogram[histogram_idx], 1);

            }

        );

    });

}

```

The [examples](examples) directory contains a number of examples that

demonstrate how to use the library. For every new feature we add, we will do our

best to add an example that demonstrates its use. As we approach the first

stable release, we will start to focus on adding proper documentation.

## Performance

We have shown almost perfect strong scaling for the `distributed_access` example

in the repo, distributed across 2 GPUs. The example has two arrays of interest.

Each kernel reads 64 elements from one array and atomically adds it to an

element of the other array for each element in the result array. A portion of

these reads happen across device boundaries. Also note that the source and

result arrays exchange roles for each kernel, showing we can still get good

performance, even for arrays that read/write equally. Finally, a distributed

reduction algorithm is used to sum the values of one of the arrays together just

for fun (and to check that the solution is correct). The following times were

recorded on 1 vs 2 RTX 3090s, respectively.

```

1 GPU time: 163.1ms

2 GPU time: 87.5ms

```