https://github.com/anicusan/mpisort.jl

Optimised multi-node MPI sorting algorithms in Julia
https://github.com/anicusan/mpisort.jl

distributed julia mpi out-of-core sorting sorting-algorithms

Last synced: 18 days ago
JSON representation

Optimised multi-node MPI sorting algorithms in Julia

Host: GitHub
URL: https://github.com/anicusan/mpisort.jl
Owner: anicusan
License: mit
Created: 2022-10-13T16:32:19.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-09T16:51:23.000Z (over 1 year ago)
Last Synced: 2024-04-26T15:03:03.727Z (about 1 year ago)
Topics: distributed, julia, mpi, out-of-core, sorting, sorting-algorithms
Language: Julia
Homepage: https://anicusan.github.io/MPISort.jl
Size: 226 KB
Stars: 6
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # MPISort

_Don't put all your eggs in one basket!_

[![CI](https://github.com/anicusan/MPISort.jl/actions/workflows/ci.yml/badge.svg)](https://github.com/anicusan/MPISort.jl/actions/workflows/ci.yml)

[![License](https://img.shields.io/github/license/anicusan/MPISort.jl)](https://github.com/anicusan/MPISort.jl)

[![DevDocs](https://img.shields.io/badge/docs-dev-blue.svg)](https://anicusan.github.io/MPISort.jl/dev/)

Sorting $N$ elements spread out across $P$ processors, _with no processor being able to hold all

elements at once_ is a difficult problem, with very few open-source implementations in

[C++](https://github.com/hsundar/usort) and [Charm++](https://github.com/vipulharsh/HSS). This

library provides the `mpisort!` function for distributed MPI-based sorting algorithms following the

standard Julia `Base.sort!` signature; at the moment, one optimised algorithm is provided:

## `SIHSort`

Sampling with interpolated histograms sorting algorithm (pronounced _sigh_ sort, like anything

MPI-related), optimised for minimum inter-rank communication and memory footprint. Features:

- **Does not require that distributed data fits into the memory of a single node**. No IO either.

- Works for any comparison-based data, with additional optimisations for numeric elements.

- Optimised for minimum MPI communication; can use Julia threads on each shared-memory node.

- The node-local arrays may have different sizes; sorting will try to balance the number of elements held by each MPI rank.

- Works with any `AbstractVector`, including accelerators such as GPUs (see Note).

- Implements the standard Julia `sort!` API, and naturally works for custom data, comparisons, orderings, etc.

### Example

```julia

# File:   mpisort.jl

# Run as: mpiexec -n 4 julia --threads=2 mpisort.jl

using MPI

using MPISort

using Random

# Initialise MPI, get communicator for all ranks, rank index, number of ranks

MPI.Init()

comm = MPI.COMM_WORLD

rank = MPI.Comm_rank(comm)

nranks = MPI.Comm_size(comm)

# Generate local array on each MPI rank - even with different number of elements

rng = Xoshiro(rank)

num_elements = 50 + rank * 2

local_array = rand(rng, 1:500, num_elements)

# Sort arrays across all MPI ranks

alg = SIHSort(comm)

sorted_local_array = mpisort!(local_array; alg=alg)

# Print each local array sequentially

for i in 0:nranks - 1

    rank == i && @show rank sorted_local_array alg.stats

    MPI.Barrier(comm)

end

```

**Note:** because the data is redistributed between nodes, the vector size must change - hence it

is different to the in-place `Base.sort!`. The input vector is mutated, but another vector - with

potentially different size and elements - is returned. This is the reason for a different function

signature (`mpisort!` with a return value); however, it has the exact same inputs as `Base.sort!`.

Different sorting settings:

```julia

# Automatically uses MPI.COMM_WORLD as communicator; doesn't save sorting stats

sorted_local_array = mpisort!(local_array; alg=SIHSort())

# Reverse sorting; specify communicator explicitly

sorted_local_array = mpisort!(local_array; alg=SIHSort(comm), rev=true)

# Specify key to sort by; see https://docs.julialang.org/en/v1/base/sort/

sorted_local_array = mpisort!(local_array; alg=SIHSort(), by=x->x["key"])

# Different ordering; see https://docs.julialang.org/en/v1/base/sort/#Alternate-orderings

sorted_local_array = mpisort!(local_array; alg=SIHSort(), order=Reverse)

# Save sorting stats

alg = SIHSort(comm)

sorted_local_array = mpisort!(local_array; alg=alg)

@show alg.stats.splitters               # `nranks - 1` elements splitting arrays between nodes

@show alg.stats.num_elements            # `nranks` integers specifying number of elements on each node

# Use different in-place local sorter

alg = SIHSort(comm, nothing)            # Default: standard Base.sort!

alg = SIHSort(comm, QuickSort)          # Specify algorithm, passed to Base.sort!(...; alg=)

alg = SIHSort(comm, v -> mysorter!(v))  # Pass any function that sorts a local vector in-place

```

### Communication and Memory Footprint

Only optimised collective MPI communication is used, in order: Gather, Bcast, Reduce, Bcast,

Alltoall, Allreduce, Alltoallv. I am not aware of a non-IO based algorithm with less communication

(if you do know one, please open an issue!).

If $N$ is the total number of elements spread out across $P$ MPI ranks, then the per-rank memory

footprint of `SIHSort` is:

$$ k P + k P + P + 3(P - 1) + \frac{N + \epsilon}{P} $$

Where $k$ is the number of samples extracted from each node; following [1], we use:

$$ k = 2P \ log_2 P $$

Except for the final redistribution on a single new array of length $\frac{N + \epsilon}{P}$, the

memory footprint only depends on the number of nodes involved, hence it should be scalable to

thousands of MPI ranks. Anyone got a spare 200,000 nodes to benchmark this?

### Note on sorting multi-node GPU arrays

`SIHSort` is generic over the input array type, so it can work with GPU arrays - e.g. Julia

`CUDA.CuArray` - and benefit from MPI-configured, optimised inter-GPU connects.

However, to be fully performant, it needs:

- A single-node `sort` implementation - at the moment, only `CUDA.CuArray` has one; there is great potential in a `KernelAbstractions.jl` sorter, we really need one!

- A fully GPU-based `searchsortedlast` implementation; **we do not have one** yet, so we rely on a binary search where each tested element is copied to the CPU (!), which of course is not great. More great potential in some optimised `KernelAbstractions.jl` kernels!

While it works currently, it is not ideal: sorting 1,000,000 `Int32` values split across 2 MPI

ranks takes \~0.015s on my Intel i9 CPU and \~0.034s on my NVidia Quadro RTX4000 with Max-Q.

### References

This algorithm builds on prior art:

- [1] _Harsh V, Kale L, Solomonik E. Histogram sort with sampling._ : followed main ideas and theoretical results, but with deterministic sampling and original communication and interpolation optimisations.

- [2] _Sundar H, Malhotra D, Biros G. Hyksort: a new variant of hypercube quicksort on distributed memory architectures._

- [3] _Shi H, Schaeffer J. Parallel sorting by regular sampling._

- [4] _Solomonik E, Kale LV. Highly scalable parallel sorting._

- [5] _John Lapeyre, integer base-2 logarithm_ - https://github.com/jlapeyre/ILog2.jl.

- [6] _Byrne S, Wilcox LC, Churavy V. MPI. jl: Julia bindings for the Message Passing Interface._ : absolute heroes who made MPI a joy to use in Julia.

# License

`MPISort.jl` is MIT-licensed. Enjoy.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anicusan/mpisort.jl

Awesome Lists containing this project

README