{"id":20984455,"url":"https://github.com/abhioncbr/spark-notes","last_synced_at":"2025-03-13T10:45:07.279Z","repository":{"id":76998245,"uuid":"149445576","full_name":"abhioncbr/spark-notes","owner":"abhioncbr","description":"Notes/blogs/tutorials/talks around basics \u0026 better optimzed usage of Apache Spark ","archived":false,"fork":false,"pushed_at":"2018-09-25T00:17:26.000Z","size":8,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-20T07:13:37.542Z","etag":null,"topics":["apache-spark","catalyst-optimizer","joins","scheduler","shuffle"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abhioncbr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-19T12:14:26.000Z","updated_at":"2021-07-10T21:40:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"4b351fa3-cf5b-4e2a-ae38-2f3d01e74b72","html_url":"https://github.com/abhioncbr/spark-notes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhioncbr%2Fspark-notes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhioncbr%2Fspark-notes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhioncbr%2Fspark-notes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhioncbr%2Fspark-notes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abhioncbr","download_url":"https://codeload.github.com/abhioncbr/spark-notes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243392297,"owners_count":20283560,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","catalyst-optimizer","joins","scheduler","shuffle"],"created_at":"2024-11-19T05:53:35.092Z","updated_at":"2025-03-13T10:45:07.257Z","avatar_url":"https://github.com/abhioncbr.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-notes [Notes/blogs/tutorials/talks around basics \u0026amp; better optimzed usage Apache Spark]\n\n1. Talk on tuning of Spark for big jobs by FB guys: [Tuning Apache Spark for Large Scale Workloads - Sital Kedia \u0026 Gaoxiang Liu -2017](https://www.youtube.com/watch?v=5dga0UT4RI8)\n\n2. Talk on Catalyst \u0026 Tungsten framework by Sameer Agarwal: [SparkSQL: A Compiler from Queries to RDDs: Spark Summit East talk by Sameer Agarwal](https://www.youtube.com/watch?v=AoVmgzontXo)\n  * Key takeaways:\n    * Catalyst framework role in converting sql queries to logical plan to physical plan and than applying transformation to      produce optimized query plan.\n    * Tungsten Framework introduction, sharing concept of [Volcana Iterator Model](https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf) and brief introduction of [Whole-Stage Codegen](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html)\n\n3. Talk on Sql table bucketing support: [Hive Bucketing in Apache Spark - Tejas Patil](https://www.youtube.com/watch?v=6BD-Vv-ViBw)\n * Key takeaways:\n   * Introduction to bucketing and how it is helping in reducing sortinf \u0026 shuffling operations done by spark sql planner.\n   * Comparsion of bucketing support in hive \u0026 spark and various jira's tickets around spark sql optimization for bucketing. \n   \n4. Talk on Spark's memory model \u0026 data aware cache: [A Developer’s View into Spark's Memory Model - Wenchen Fan](https://www.youtube.com/watch?v=-Aq1LMpzaKw)\n\n5. Talk on Spark's cost based optimizer: [Cost Based Optimizer in Apache Spark 2 2 - Ron Hu \u0026 Sameer Agarwal](https://www.youtube.com/watch?v=qS_aS99TjCM)\n\n6. Spark Scheduler: [Apache Spark Scheduler](https://www.youtube.com/watch?v=z83CvasZEzM)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhioncbr%2Fspark-notes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabhioncbr%2Fspark-notes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhioncbr%2Fspark-notes/lists"}