Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/zuston/r1

Rust based high-performance Apache Uniffle shuffle-server
https://github.com/zuston/r1
Last synced: 25 days ago
JSON representation
Rust based high-performance Apache Uniffle shuffle-server
Host: GitHub
URL: https://github.com/zuston/r1
Owner: zuston
Created: 2024-05-28T07:01:04.000Z (8 months ago)
Default Branch: master
Last Pushed: 2024-12-13T03:33:25.000Z (29 days ago)
Last Synced: 2024-12-13T04:24:38.070Z (29 days ago)
Language: Rust
Homepage:
Size: 504 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 9
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        Another implementation of Apache Uniffle shuffle server (Single binary, no extra dependencies and quick)

## Roadmap

- [ ] Support storing data into s3

- [ ] Support single buffer flush

- [ ] Support huge partition limit

- [ ] Quick decommission that will spill data into remote storage like s3

- [ ] Using DirectIO to access file data

- [x] Support customized protocol to interact with netty based uniffle client

- [ ] Support writing multiple replicas by pipeline mode in server side

- [ ] Create the grafana template to show the metrics dashboard by the unified style

- [ ] Introduce the clippy to validate

- [ ] Zero copy for **urpc** and mem + localfile getting/writing

- [ ] Recover when upgrading

## Benchmark report

#### Environment

| type                | description                                                             |

|---------------------|:------------------------------------------------------------------------|

| Software            | Uniffle 0.8.0 / Hadoop 3.2.2 / Spark 3.1.2                              |

| Hardware            | Machine 96 cores, 512G memory, 1T * 4 SATA SSD, network bandwidth 8GB/s |

| Hadoop Yarn Cluster | 1 * ResourceManager + 40 * NodeManager, every machine 1T * 4 SATA SSD   |

| Uniffle Cluster     | 1 * Coordinator + 1 * Shuffle Server, every machine 1T * 4 SATA SSD     |

#### Configuration

__spark's conf__

```yaml

spark.executor.instances 400

spark.executor.cores 1

spark.executor.memory 2g

spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager

spark.rss.storage.type MEMORY_LOCALFILE

```

__Rust-based shuffle-server conf__

```

store_type = "MEMORY_LOCALFILE"

grpc_port = 21100

coordinator_quorum = ["xxxxx:21000"]

tags = ["riffle2", "datanode", "GRPC", "ss_v5"]

[memory_store]

capacity = "10G"

dashmap_shard_amount = 128

[localfile_store]

data_paths = ["/data1/uniffle/t1", "/data2/uniffle/t1", "/data3/uniffle/t1", "/data4/uniffle/t1"]

healthy_check_min_disks = 0

disk_max_concurrency = 2000

[hybrid_store]

memory_spill_high_watermark = 0.5

memory_spill_low_watermark = 0.2

memory_spill_max_concurrency = 1000

[metrics]

http_port = 19998

push_gateway_endpoint = "http://xxxxx/prometheus/pushgateway"

[runtime_config]

read_thread_num = 40

write_thread_num = 200

grpc_thread_num = 100

http_thread_num = 10

default_thread_num = 20

dispatch_thread_num = 10

```

`GRPC_PARALLELISM=100 WORKER_IP=10.0.0.1 RUST_LOG=info ./uniffle-worker`

#### TeraSort Result

| type/buffer capacity                  | 273G (compressed)  |

|---------------------------------------|:------------------:|

| vanilla spark ESS                     | 4.2min (1.3m/2.9m) |

|                                       |                    |

| riffle(grpc) / 10g                    | 4.0min (1.9m/2.1m) |

| riffle(grpc) / 300g                   | 3.5min (1.4m/2.1m) |

|                                       |                    |

| riffle(urpc) / 10g                    | 3.8min (1.6m/2.2m) |

| riffle(urpc) / 300g                   | 3.2min (1.2m/2.0m) |

|                                       |                    |

| uniffle(grpc)/ 10g                    | 4.0min (1.8m/2.2m) |

| uniffle(grpc)/ 300g                   | 8.6min (2.7m/5.9m) |

|                                       |                    |

| uniffle(netty)(default malloc) 10g    | 5.1min (2.7m/2.4m) |

| uniffle(netty)(jemalloc) 10g          | 4.5min (2.0m/2.5m) |

| uniffle(netty)(default malloc)/ 300g  | 4.0min (1.5m/2.5m) |

| uniffle(netty)(jemalloc)/ 300g        | 6.6min (1.9m/4.7m) |

> tips: the riffle's urpc implements the customized tcp stream's proto, that is named with the NETTY rpc type in java side. 

## Build

`cargo build --release --features hdfs,jemalloc`

Uniffle-x currently treats all compiler warnings as error, with some dead-code warning excluded. When you are developing

and really want to ignore the warnings for now, you can use `cargo --config 'build.rustflags=["-W", "warnings"]' build`

to restore the default behavior. However, before submit your pr, you should fix all the warnings.

## Run

`WORKER_IP={ip} RUST_LOG=info WORKER_CONFIG_PATH=./config.toml ./uniffle-worker`

### HDFS Setup

Benefit from the hdfs-native crate, there is no need to setup the JAVA_HOME and relative dependencies.

If HDFS store is valid, the spark client must specify the conf of `spark.rss.client.remote.storage.useLocalConfAsDefault=true`

```shell

cargo build --features hdfs --release

```

```shell

# configure the kerberos

KRB5_CONFIG=/etc/krb5.conf KRB5CCNAME=/tmp/krb5cc_2002 LOG=info ./uniffle-worker

```

## All config options

```toml

store_type = "MEMORY_LOCALFILE_HDFS"

grpc_port = 19999

coordinator_quorum = ["host1:port", "host2:port"]

urpc_port = 20000

http_monitor_service_port = 20010

heartbeat_interval_seconds = 2

tags = ["GRPC", "ss_v5", "GRPC_NETTY"]

[memory_store]

capacity = "1G"

buffer_ticket_timeout_sec = 300

buffer_ticket_check_interval_sec = 10

dashmap_shard_amount = 128

[localfile_store]

data_paths = ["/var/data/path1", "/var/data/path2"]

min_number_of_available_disks = 1

disk_high_watermark = 0.8

disk_low_watermark = 0.7

disk_max_concurrency = 2000

disk_write_buf_capacity = "1M"

disk_read_buf_capacity = "1M"

disk_healthy_check_interval_sec = 60

[hdfs_store]

max_concurrency = 50

partition_write_max_concurrency = 20

[hdfs_store.kerberos_security_config]

keytab_path = "/path/to/keytab"

principal = "principal@REALM"

[hybrid_store]

memory_spill_high_watermark = 0.8

memory_spill_low_watermark = 0.2

memory_single_buffer_max_spill_size = "1G"

memory_spill_to_cold_threshold_size = "128M"

memory_spill_to_localfile_concurrency = 4000

memory_spill_to_hdfs_concurrency = 500

huge_partition_memory_spill_to_hdfs_threshold_size = "64M"

[runtime_config]

read_thread_num = 100

localfile_write_thread_num = 100

hdfs_write_thread_num = 20

http_thread_num = 2

default_thread_num = 10

dispatch_thread_num = 100

[metrics]

push_gateway_endpoint = "http://example.com/metrics"

push_interval_sec = 10

labels = { env = "production", service = "my_service" }

[log]

path = "/var/log/my_service.log"

rotation = "Daily"

[app_config]

app_heartbeat_timeout_min = 5

huge_partition_marked_threshold = "1G"

huge_partition_memory_limit_percent = 0.75

[tracing]

jaeger_reporter_endpoint = "http://jaeger:14268"

jaeger_service_name = "my_service"

[health_service_config]

alive_app_number_max_limit = 100

```

## Profiling

### Heap profiling

1. build with profile support

    ```shell

    cargo build --release --features memory-prof

    ```

2. Start with profile

    ```shell

    curl localhost:20010/debug/heap/profile > heap.pb.gz

    go tool pprof -http="0.0.0.0:8081" heap.pb.gz

    ```

   

### CPU Profiling

1. build with jemalloc feature

    ```shell

    cargo build --release --features jemalloc

    ```

2. Paste following command to get cpu profile flamegraph

    ```shell

    go tool pprof -http="0.0.0.0:8081" http://{remote_ip}:8080/debug/pprof/profile?seconds=30

    ```

   - localhost:8080: riffle server.

   - remote_ip: pprof server address.

   - seconds=30: Profiling lasts for 30 seconds.

   Then open the URL :8081/ui/flamegraph in your browser to view the flamegraph: