https://github.com/alibaba/SparkCube
SparkCube is an open-source project for extremely fast OLAP data analysis. SparkCube is an extension of Apache Spark.
https://github.com/alibaba/SparkCube
Last synced: 12 months ago
JSON representation
SparkCube is an open-source project for extremely fast OLAP data analysis. SparkCube is an extension of Apache Spark.
- Host: GitHub
- URL: https://github.com/alibaba/SparkCube
- Owner: alibaba
- License: apache-2.0
- Created: 2020-03-16T08:51:09.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-03-06T14:16:24.000Z (about 3 years ago)
- Last Synced: 2024-11-05T04:34:03.478Z (over 1 year ago)
- Language: Scala
- Homepage:
- Size: 154 KB
- Stars: 130
- Watchers: 12
- Forks: 52
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-java - SparkCube
README
# SparkCube 
SparkCube is an open-source project for extremely fast OLAP data analysis. SparkCube is an extension of [Apache Spark](http://spark.apache.org).
## Build from source
```
mvn -DskipTests package
```
The default Spark version used is 2.4.4.
## Run tests
```
mvn test
```
## Use with Apache Spark
There are several configs you should add to your Spark configuration.
| config | value | comment | |
| ---- | ---- | ---- | ---- |
| spark.sql.extensions | com.alibaba.sparkcube.SparkCube | Add extension. | Required |
| spark.sql.cache.tab.display | true | To show web UI in the certain application, typically Spark Thriftserver. | Required |
| spark.sql.cache.useDatabase | db1,db2,dbn | A list of database names separated by comma. Only tables and views from these databases will be considered for cube building. | Required |
| spark.sql.cache.cacheByPartition | true/false | To store cache by partition. | Optional |
| spark.driver.extraClassPath | /path/to/this/jar | For web UI resources. | Required |
With the configurations above set in your Spark thriftserver, you should be able to see "Cube Management" Tab from the UI of Spark Thriftserver after any `SELECT` command is run. Then you can create/delete/build cubes from this web page.
After you have created appropriate cube, you can query the cube from any spark-sql client using Spark SQL. Note that the cube can be created against table or view, so you can join tables as view to create a complex cube.
If you want a more detailed tutorial for cube creating/building/dropping etc., please refer to
https://help.aliyun.com/document_detail/149293.html
## Learning materials
(Slides)
https://www.slidestalk.com/AliSpark/SparkRelationalCache78971
https://www.slidestalk.com/AliSpark/SparkRelationalCache2019_57927
(Blogs)
https://yq.aliyun.com/articles/703046
https://yq.aliyun.com/articles/703154
https://yq.aliyun.com/articles/713746
https://yq.aliyun.com/articles/725413
(Blogs In English)
https://community.alibabacloud.com/blog/rewriting-the-execution-plan-in-the-emr-spark-relational-cache_595267
https://www.alibabacloud.com/blog/use-emr-spark-relational-cache-to-synchronize-data-across-clusters_595301
https://www.alibabacloud.com/blog/using-data-preorganization-for-faster-queries-in-spark-on-emr_595599