https://github.com/ornl/syclreduce

SYCL Reduce Primitive
https://github.com/ornl/syclreduce

Last synced: 2 months ago
JSON representation

SYCL Reduce Primitive

Host: GitHub
URL: https://github.com/ornl/syclreduce
Owner: ORNL
License: lgpl-3.0
Created: 2023-05-14T21:29:30.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-06-19T17:36:40.000Z (almost 2 years ago)
Last Synced: 2025-01-24T17:14:45.153Z (4 months ago)
Language: C++
Size: 15.6 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # SYCL Reduce Primitive

This is a tiny package implementing what is a giant

unmet need in SYCL2020 - proper reductions.

Want to sum a vector coming from every thread in a kernel

launch?  Want to accumulate a couple different kinds

of diagnostic output from a kernel?  Too bad.  SYCL doesn't

have full documentation on how span<> works, and you'll easily

get lost writing your own __undefined type__ reducer.

So, instead, try this:

    #include 

    ...

    namespace SR = syclreduce;

    struct FourInts {

        int x[4];

        FourInts() {}

        FourInts(int i) : x{1, i,i,i} {}

		const int& operator[](int i) const { return x[i]; }

		int& operator[](int i) { return x[i]; }

    };

    struct ReduceFour {

        using T = FourInts; // What we're creating.

        // Initial value for all reduction variables.

        // This is not called often by the library, but

        // is an important part of the mathematics.

        void identity(FourInts &a) const {

            a[0] = 0; a[1] = 0;

            a[2] = ~((int)1<<(sizeof(int)*8-1));

            a[3] =   (int)1<<(sizeof(int)*8-1);

		}

		// How to combine two reduction results.

        // Must be associative.

        // Does not need to be commutative.

        void combine(FourInts &a, const FourInts &b) const {

            a[0] += b[0]; // count

            a[1] += b[1]; // sum

            a[2] = b[2] < a[2] ? b[2] : a[2]; // min

            a[3] = b[3] < a[3] ? a[3] : b[3]; // max

		}

    };

    SR::Reducer result{ ReduceFour() };

    // Like parallel_for, but kernel functions all return

    // a ReduceFour by value.

    SR::parallel_reduce(cgh, sycl::nd_range({4096, 32}),

				        result, [=](sycl::nd_item<1> it) {

        const size_t tid = it.get_global_id(0);

        return FourInts(tid*31337 % 4792 + 101);

    });

    FourInts ans = result.get();

The reduction is done internally with the following steps:

1. Reducing the return results from every work group in parallel,

   and storing that in a unique position in an internal buffer.

2. Copying the buffer to the host and reducing over those results.

For a work group size of `sycl::nd_range(12,4)`,

this works out like:

   out[0] = op(op(0,1), op(2,3))

   out[1] = op(op(4,5), op(6,7))

   out[2] = op(op(8,9), op(10,11))

   result = op( op(out[0], out[1]) , out[2] )

Note that you should be careful about kernel launch sizes using

this method.  In particular, don't use more than, say, 4x the

number of work groups as you have compute units.  Otherwise you

are wasting memory and storing more intermediate reduction results

than you need to.

We're assuming you want already use `sycl::nd_range` to

control launch sizes for this reason.

# Installing

This is a header-only package.  You can just copy it into an

include directory like `/usr/local/include/syclreduce/reduce.hpp`.

If you also want to test and install the cmake package description,

do the following:

    mkdir build && cd build

    cmake -DCMAKE_CXX_COMPILER=`which syclcc` \

          -DCMAKE_INSTALL_PREFIX=/usr/local \

          ..

    make install

# Copyright and License

Copyright 2023 UT-Battelle LLC.  See LICENSE.md for license details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ornl/syclreduce

Awesome Lists containing this project

README