https://github.com/abhioncbr/spark-notes

Notes/blogs/tutorials/talks around basics & better optimzed usage of Apache Spark
https://github.com/abhioncbr/spark-notes

apache-spark catalyst-optimizer joins scheduler shuffle

Last synced: 7 months ago
JSON representation

Notes/blogs/tutorials/talks around basics & better optimzed usage of Apache Spark

Host: GitHub
URL: https://github.com/abhioncbr/spark-notes
Owner: abhioncbr
Created: 2018-09-19T12:14:26.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2018-09-25T00:17:26.000Z (about 7 years ago)
Last Synced: 2025-01-20T07:13:37.542Z (9 months ago)
Topics: apache-spark, catalyst-optimizer, joins, scheduler, shuffle
Homepage:
Size: 7.81 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# spark-notes [Notes/blogs/tutorials/talks around basics & better optimzed usage Apache Spark]

1. Talk on tuning of Spark for big jobs by FB guys: [Tuning Apache Spark for Large Scale Workloads - Sital Kedia & Gaoxiang Liu -2017](https://www.youtube.com/watch?v=5dga0UT4RI8)

2. Talk on Catalyst & Tungsten framework by Sameer Agarwal: [SparkSQL: A Compiler from Queries to RDDs: Spark Summit East talk by Sameer Agarwal](https://www.youtube.com/watch?v=AoVmgzontXo)
* Key takeaways:
* Catalyst framework role in converting sql queries to logical plan to physical plan and than applying transformation to produce optimized query plan.
* Tungsten Framework introduction, sharing concept of [Volcana Iterator Model](https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf) and brief introduction of [Whole-Stage Codegen](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html)

3. Talk on Sql table bucketing support: [Hive Bucketing in Apache Spark - Tejas Patil](https://www.youtube.com/watch?v=6BD-Vv-ViBw)
* Key takeaways:
* Introduction to bucketing and how it is helping in reducing sortinf & shuffling operations done by spark sql planner.
* Comparsion of bucketing support in hive & spark and various jira's tickets around spark sql optimization for bucketing.

4. Talk on Spark's memory model & data aware cache: [A Developer’s View into Spark's Memory Model - Wenchen Fan](https://www.youtube.com/watch?v=-Aq1LMpzaKw)

5. Talk on Spark's cost based optimizer: [Cost Based Optimizer in Apache Spark 2 2 - Ron Hu & Sameer Agarwal](https://www.youtube.com/watch?v=qS_aS99TjCM)

6. Spark Scheduler: [Apache Spark Scheduler](https://www.youtube.com/watch?v=z83CvasZEzM)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abhioncbr/spark-notes

Awesome Lists containing this project

README