Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
https://github.com/kwartile/spark-benchmark
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: about 1 month ago
JSON representation
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
- Host: GitHub
- URL: https://github.com/kwartile/spark-benchmark
- Owner: kwartile
- Created: 2017-05-23T20:00:03.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-05-26T20:34:01.000Z (over 7 years ago)
- Last Synced: 2023-09-14T15:54:24.770Z (over 1 year ago)
- Topics: apache-spark, benchmark, benchmarking-suite, cdh, cloudera-hadoop, hadoop, hive, impala, performance, scala, spark
- Language: Scala
- Size: 28.3 KB
- Stars: 2
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Spark Benchmark
#### Overview
Spark Benchmark suite helps you evaluate Spark cluster configuration. This benchmark can also be used to compare the speed, throughput, and resource usage of Spark jobs with other big data frameworks such as Impala and Hive. It contains a set of Spark RDD based operations that performs map, filter, reduceByKey, and join operations.#### Data
The benchmark uses the dataset used for Impala performance measurement (http://docs.aws.amazon.com/emr/latest/DeveloperGuide/query-impala-generate-data.html). The dataset consists of three different files:
* Books
* Customers
* Transactions```
> head books
0|5-54687-602-6|FOREIGN-LANGUAGE-STUDY|1989-09-29|Saraiva|83.99
1|7-20527-497-2|PHILOSOPHY|1999-09-14|Kyowon|40.99
2|8-98211-350-2|JUVENILE-NONFICTION|1975-01-11|Wolters Kluwer|173.99
3|6-52228-529-3|MATHEMATICS|2010-06-26|Bungeishunju|24.99
4|8-98702-825-4|HUMOR|1990-07-15|China Publishing Group Corporate|64.99
5|3-11023-371-2|LITERARY-CRITICISM|1971-06-04|AST|137.99
``````
> head customers
Customers
0|Sophia PERKINS|1975-11-18|F|OK|[email protected]|963-341-4876
1|Brianna MURRAY|2001-11-02|F|MT|[email protected]|260-164-6277
2|James SCOTT|1997-09-17|M|UT|[email protected]|920-899-8587
3|Samuel GREEN|2013-05-22|F|CO|[email protected]|263-707-8321
4|Logan COLEMAN|1997-12-10|F|NE|[email protected]|333-318-5685
5|Matthew BENNETT|1975-12-26|M|CO|[email protected]|717-808-3733
6|Jace SPENCER|2013-10-30|M|KS|[email protected]|448-105-3939
``````
> head transactions
0|29948726|124004825|21|2000-10-03 12:08:37
1|76896577|10225228|17|2001-04-23 15:21:18
2|77394742|62037151|23|2008-02-22 11:52:36
3|23558280|21960491|29|2000-06-22 10:14:48
4|5742930|73207419|15|2004-11-26 00:46:53
5|101531051|122609274|13|2008-01-14 05:26:46
```
#### Benchmark
The benchmarks contains four tests:
RDDScan: RDDScan reads the customer file and performs a filter operation. It is equivalent to a select-where statement in SQL.
RDDAggregate: RDDAgregate operation scans the books file and perform reduceByKey to aggregate count of books by category. It then sorts the results based on the book count.
RDDTwoWayJoin: This operation performs a join between books and transactions between 2008 and 2010, aggregates the results on book category, and returns sorted results based on the total transaction amount.
RDDThreeWayJoin: This operation is similar to the above except we perform addition join with the customer table and filter the results on three states#### Supported Platform
The pom file currently includes support for Spark 1.6 on CDH 5.8. But this can be easily modified to run on Spark 2.x and other version of Cloudera, HortonWorks, and Apache distribution.#### Build
Use the standard maven command ```mvn package``` to build.#### Run
* Generate Data: Download the DBGen utlity from http://docs.aws.amazon.com/emr/latest/DeveloperGuide/query-impala-generate-data.html. Follow the instructions to generate data.
* Copy generated data to HDFS.
* Run the following command to run the benchmark. You may need to adjust the memory parameters to tune the job.
```
spark-submit --class com.kwartile.benchmark.spark. --master yarn --executor-memory --executor-cores --num-executors --conf spark.yarn.executor.memoryOverhead= perf-benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar --input-path
```
#### Hive and Impala Query
You can use the following equivalent Hive/Impala query to compare the performance.```
# Scan Query
SELECT COUNT(*)
FROM customers256gb
WHERE name = 'Scarlett STEVENS';
``````
# Aggregation
SELECT category, count(*) cnt
FROM books256gb
GROUP BY category
ORDER BY cnt DESC LIMIT 10;
``````
# Two Way Join
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue
FROM (
SELECT books256gb.category AS book_category, SUM(books256gb.price * transactions256gb.quantity) AS revenue
FROM books256gb JOIN transactions256gb ON (
transactions256gb.book_id = books256gb.id
AND YEAR(transactions256gb.transaction_date) BETWEEN 2008 AND 2010
)
GROUP BY books256gb.category
) tmp
ORDER BY revenue DESC LIMIT 10;
``````
# Three Way Join
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue
FROM (
SELECT books256gb.category AS book_category, SUM(books256gb.price * transactions256gb.quantity) AS revenue
FROM books256gb
JOIN transactions256gb ON (
transactions256gb.book_id = books256gb.id
)
JOIN customers256gb ON (
transactions256gb.customer_id = customers256gb.id
AND customers256gb.state IN ('WA', 'CA', 'NY')
)
GROUP BY books256gb.category
) tmp
ORDER BY revenue DESC LIMIT 10;
```
### Questions & Feedback
Please contact us as [email protected] for any question or enhancement request.