{"id":14982385,"url":"https://github.com/memverge/splash","last_synced_at":"2025-09-02T02:34:26.739Z","repository":{"id":35783581,"uuid":"154596216","full_name":"MemVerge/splash","owner":"MemVerge","description":"Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange","archived":false,"fork":false,"pushed_at":"2024-12-19T22:35:25.000Z","size":682,"stargazers_count":127,"open_issues_count":8,"forks_count":29,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-07-26T19:42:10.911Z","etag":null,"topics":["apache-spark","bigdata","disaggregation","elasticity","java","scala","shuffle","spark","storage"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MemVerge.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-25T02:10:12.000Z","updated_at":"2024-07-30T06:34:55.000Z","dependencies_parsed_at":"2025-01-18T22:37:59.996Z","dependency_job_id":"52958ddc-0ebf-4687-a999-3e6a914c65c1","html_url":"https://github.com/MemVerge/splash","commit_stats":{"total_commits":37,"total_committers":4,"mean_commits":9.25,"dds":0.5405405405405406,"last_synced_commit":"ae598d4ddd4c224ca56b4f09a2864fad966281b9"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MemVerge/splash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MemVerge%2Fsplash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MemVerge%2Fsplash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MemVerge%2Fsplash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MemVerge%2Fsplash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MemVerge","download_url":"https://codeload.github.com/MemVerge/splash/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MemVerge%2Fsplash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273220493,"owners_count":25066395,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-02T02:00:09.530Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","bigdata","disaggregation","elasticity","java","scala","shuffle","spark","storage"],"created_at":"2024-09-24T14:05:19.283Z","updated_at":"2025-09-02T02:34:26.712Z","avatar_url":"https://github.com/MemVerge.png","language":"Scala","readme":"# Splash\n\n[![travis-ci](https://img.shields.io/travis/MemVerge/splash/master.svg)](https://travis-ci.org/MemVerge/splash)\n[![codecov](https://img.shields.io/codecov/c/gh/MemVerge/splash/master.svg)](https://codecov.io/gh/MemVerge/splash)\n[![license](https://img.shields.io/github/license/MemVerge/splash.svg)](LICENSE)\n\nA shuffle manager for Spark that supports different storage plugins.\n\nThe motivation of this project is to supply a fast, flexible and reliable \nshuffle manager that allows the user to plug in his/her favorite backend storage \nand network frameworks for holding and exchanging shuffle data. \n\nIn general, the current shuffle manager in Spark has some shortcomings.\n\n* The local shuffle data have limitations on reliability and performance. \n  * Losing a single node can break the data integrity of the entire cluster.\n  * It is difficult to containerize the application.\n  * In order to improve the shuffle read/write performance, you must upgrade \n    each server in the cluster.\n  * the overall performance of the shuffle stage is affected by the performance \n    of local disk IO when there is heavy shuffling.\n* There is no easy/general solution to plugin external storage to the shuffle \n  service.\n  \n\nWe want to address these issues in this shuffle manager.\n\n---\n\n* [License](#license)\n* [Deployment](#deployment)\n* [Release](#release)\n* [Upgrade](#upgrade)\n* [Service \u0026 Support](#service--support)\n* [Community](#community)\n* [Contributing](#contributing)\n* [Build](#build)\n* [Options](#options)\n* [Plugin Development](#plugin-development)\n* [Shuffle Performance Tool](#shuffle-performance-tool)\n\n---\n\n## License\n[Apache License Version 2.0](LICENSE)\n\n## Deployment\nBy default, we support Spark 2.3.2_2.11 with Hadoop 2.7.  \nIf you want to generate a build with a different Spark version, you need to modify \nthese version parameters in `pom.xml` \n* `spark.version`\n* `hadoop.version`\n* `scala.version`\n\nCheck the [Build](#build) section for how to generate your customized jar.\n\n### Spark\n* You need to include the Splash jar file in your spark default configuration \n  or task configuration.  Make sure you choose the one that is aligned with your \n  Spark and Scala version.  Typically, you only need to add two configurations \n  in your `spark-defaults.conf`\n  \n```\nspark.driver.extraClassPath /path/to/splash.jar\nspark.executor.extraClassPath /path/to/splash.jar\n```\n\n* You can include the plugin jar in the same way.\n* You can configure your Spark application to use the Splash shuffle manager \n  by adding the following option:\n\n```\nspark.shuffle.manager org.apache.spark.shuffle.SplashShuffleManager\n```\n\n* The storage plugin is tunable at the application level.  The user can specify \n  different storage implementations for different applications.\n* Support both on-premise and cloud deployments.\n\n## Release\n* Release numbering follows [Semantic Versioning 2.0.0](https://semver.org/#semantic-versioning-200)\n* Releases are available in project's release page.\n\n## Upgrade\nAlthough the basic functionality of the project has been verified, we still feel \nthat the public API might be modified when more storage plugins are developed. \nTherefore:\n* The public API may change until we reach version 1.0.0.\n\nAccording to the definition of semantic versioning 2.0.0, we do not promise \nbackward compatibility if the first digit in the version is changed.\n\n## Service \u0026 Support\n* Please raise your question in the project's issue page and tag it with \n  `question`.\n* Project documents are available in the `doc` folder.\n\n## Community\nYou can communicate with us in following ways:\n* Start a new thread in [Github issues](https://github.com/MemVerge/splash/issues), \n  recommended.\n* Request to join the WeChat group through [email](mailto://cedric.zhuang@memverge.com) \n  and make sure you include your WeChat ID in the mail.\n\n## Contributing\nPlease check the [Contributing](CONTRIBUTING.md) document for details.\n\n## Build\n\n* Use `mvn install` to build the project.  Optionally, you could use \n  `-DskipTests=true` to disable the unit tests.\n\n  When the build process completes:\n  * A standard jar will be generated at: `./target/splash-\u003cversion\u003e.jar`.  This\n    jar is what you need to deploy to your Spark environment.\n  * A fat jar will be generated at: `./target/splash-\u003cversion\u003e-shaded.jar`\n  * You can find the unit test result in: `./target/surefire-reports`\n  * You can find the coverage report in: `./target/site/jacoco` \n\n* Use `mvn clean` to clean the build output.\n\n* Use `integration-test` or `mvn failsafe:integration-test -DskipIT=false`\n  to run the integration tests.  Those tests should connect to the actual File \n  System.  You could also modify the test source code to test your own storage \n  plugin.\n  * Once the tests complete, the results are available in: \n    `./target/failsafe-reports`\n\n* Use `mvn pmd:pmd` to run static code analysis.\n\n  * Analysis report is available in: `./target/site/pmd.html`\n\n## Options\n* `spark.shuffle.splash.storageFactory` specifies the class name of your \n  factory.  This class must implement \n  [`StorageFactory`](src/main/java/com/memverge/splash/StorageFactory.java)\n* `spark.shuffle.splash.clearShuffleOutput` is a boolean value telling the \n  shuffle manager whether to clear the shuffle output when the shuffle stage \n  completes.\n  \n## Plugin Development\nSplash uses plugins to support different types of storage systems.  The user can \ndevelop their own storage plugins for the shuffle manager.  The user can use \ndifferent types of storage system based on the usage of the file.  For details, \nplease check our [design document](doc/Design.md).\n\nThe Splash project is currently released with a default plugin:\n* the plugin for shared file systems like NFS is implemented by:\n  `com.memverge.splash.shared.SharedFSFactory`\n\nThis plugin serves as an example for developers to develop their own \nstorage plugins.\n\n### Deploy Shared Folder Storage Plugin\nTake NFS as an example, here are the steps to configure Splash with the shared folder plugin.\n* Update the configurations in `spark-defaults.conf`:\n\n```\n# add the Splash jar to the classpath\nspark.driver.extraClassPath /path/to/splash.jar\nspark.executor.extraClassPath /path/to/splash.jar\n\n# set shuffle manager and storage plugin\nspark.shuffle.manager org.apache.spark.shuffle.SplashShuffleManager\nspark.shuffle.splash.storageFactory com.memverge.splash.shared.SharedFSFactory\n\n# set the location of your shared folder\nspark.shuffle.splash.folder /your/share/folder\n```\n* Make sure that all your Spark nodes can access the shared folder you specified in the configuration file.\n* Run some sample Spark applications and you should be able to observe that the application folder is created in a shared folder.\n\n## Shuffle Performance Tool\n\nUse this tool to verify the performance of the storage plugin.  Users could\nalso use this tool to compare different storage plugin implementations or find\nthe regressions of the storage plugin.\n\nNote that this tool bases on the storage interface.  It does not require a Spark\nenvironment.\n\nIt writes the shuffle output and read them with configured arguments.  See the\nconfiguration details below:\n\n* `-h` or `--help`: display the usage\n* `-f` or `--factory`: specify the name of the storage factory\n* `-i` or `--shuffleId`: the test shuffle ID, default to 1\n* `-t` or `--tasks`: the number of concurrent tasks, default to 5\n* `-m` or `--mappers`: the number of mappers, default to 10\n* `-r` or `--reducers`: the number of reducers, default to 10\n* `-d` or `--data`: the number of data blocks, default to 1K\n* `-b` or `--blockSize`: the block/buffer size of each data block,\n  default to 256K\n* `-o` or `--overwrite`: overwrite existing outputs\n\nSample command:\n```\njava -cp target/splash-shaded.jar com.memverge.splash.ShufflePerfTool \n-d 64 -m 200 -r 200 -t 8 -o\n```\n\nSample output\n```\noverwrite, removing existing shuffle for shuffleTest-1                                        \n==========================================                                                    \nWriting 200 shuffle with 8 threads: 100% (200/200)                                            \nWrite shuffle data completed in 7440 milliseconds                                             \n    Reading index file:  0 ms                                                                 \n    storage factory:     com.memverge.splash.shared.SharedFSFactory                           \n    shuffle folder:      \\tmp\\splash\\shuffleTest-1\\shuffle \n    number of mappers:   200                                                                  \n    number of reducers:  200                                                                  \n    total shuffle size:  3GB                                                                  \n    bytes written:       3GB                                                                  \n    bytes read:          0B                                                                   \n    number of blocks:    64                                                                   \n    blocks size:         256KB                                                                \n    partition size:      81KB                                                                 \n    concurrent tasks:    8                                                                    \n    bandwidth:           430MB/s                                                              \n                                                                                              \n==========================================                                                    \nReading 40000 partitions with 8 threads   100% (40000/40000)                                   \nRead shuffle data completed in 35525 milliseconds                                             \n    Reading index file:  15907 ms                                                             \n    storage factory:     com.memverge.splash.shared.SharedFSFactory                           \n    shuffle folder:      \\tmp\\splash\\shuffleTest-1\\shuffle \n    number of mappers:   200                                                                  \n    number of reducers:  200                                                                  \n    total shuffle size:  3GB                                                                  \n    bytes written:       3GB                                                                  \n    bytes read:          3GB                                                                  \n    number of blocks:    64                                                                   \n    blocks size:         256KB                                                                \n    partition size:      81KB                                                                 \n    concurrent tasks:    8                                                                    \n    bandwidth:           90MB/s                                                               \n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmemverge%2Fsplash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmemverge%2Fsplash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmemverge%2Fsplash/lists"}