{"id":20823315,"url":"https://github.com/gorros/spark-scala-tips","last_synced_at":"2026-05-21T03:31:44.246Z","repository":{"id":197540844,"uuid":"142393698","full_name":"gorros/spark-scala-tips","owner":"gorros","description":"A collection of Spark (Scala) tips or best practices based on my experience.","archived":false,"fork":false,"pushed_at":"2019-03-20T11:39:28.000Z","size":15,"stargazers_count":3,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-12T06:42:32.388Z","etag":null,"topics":["apache-spark"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gorros.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-07-26T05:39:19.000Z","updated_at":"2022-06-28T07:28:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"d603db6e-84f4-4abc-8400-c3f62a2434de","html_url":"https://github.com/gorros/spark-scala-tips","commit_stats":null,"previous_names":["gorros/spark-scala-tips"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gorros/spark-scala-tips","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gorros%2Fspark-scala-tips","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gorros%2Fspark-scala-tips/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gorros%2Fspark-scala-tips/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gorros%2Fspark-scala-tips/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gorros","download_url":"https://codeload.github.com/gorros/spark-scala-tips/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gorros%2Fspark-scala-tips/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33287426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-21T02:57:32.698Z","status":"ssl_error","status_checked_at":"2026-05-21T02:57:31.990Z","response_time":62,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark"],"created_at":"2024-11-17T22:18:04.468Z","updated_at":"2026-05-21T03:31:44.229Z","avatar_url":"https://github.com/gorros.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-scala-tips\nA collection of Spark (Scala) tips or best practices based on my experience.\n\n## Data skew\n### Problem:\nA data skew is a condition when few partitions contain much more data than on average other. This can happen when for example you process website visits and partitioning of data is done by the domain name.\nObviously, some sites can have multiple time more visitors than the others. As a result, you will have a few large partitions and many small ones.\nAnd event thous Spark processes partitions in parallel you will get overall pure performance, since each stage will not be finished until large partitions are not processed.\nTo avoid this conditions, monitor your application in Spark Web UI. You will notice if there are tasks that are taking much longer than the others.\nThis is a sign of data skew. \n### Solution:\nI would say that there is no one solution for all case. But being aware of data skew is the key.\nIf possible use another partitioning key which provides better distribution. This is not always the case since processing logic defines partitioning. \nVery often data skew may occur when you are joining  dataframes, and in this case, partitioning happens by joining the field.\n In this case, best approach is to broadcast smaller dataframe:\n ```\n val joinedDF = broadcast(df1).join(df2, \"id\")\n ```\n Also, if partitioning is not a result of processing logic like in case of join, you can repartition dataframes by a column which will provide more even distribution.\n For example, if we are talking about site visits then partitioning by domain name, date or even hour may result in data skew. But if you partition by \n minute or second this will provide quite even distribution.\n \n \n## Use schemas for JSONs\n### Problem:\nYou probably know that Spark provides convenient methods to read and write JSON data. \nLet's focus on reading. If you have dataset consisting of JSONs and you want to load them as dataframe\nthan you can do this:\n```\nval df = spark.read.json(\"path/to/jsons\")\n```\nIt is pretty straight forward. And this is probably fine as long as you 100% percent sure that all JSONs have same structure so your data frame will have expected schema. \nBut data is not always clean and consistent, so it may occur that when you process another batch of data there will be JSONs with different schema (some fields will miss for example). \nThis will result in a wrong interpretation of schema by Spark and farther issues of processing.\n\n### Solution\nTo avoid this type of issues, we should \"help\" Spark in schema detection, by providing exact schema:\n```\nval df = spark.read.schema(schema).json(\"path/to/jsons\")\n```\nThis way if there will be a field missing in JSON, Spark will fill his filed with null instead of omitting it at all. \nThe schema is especially helpful if you have nested structures. \nAlso, this schema can be used if you have stringified JSONs. You can use `from_json` method to extract JSON from a string.\n\n\n## Working with Redshift\nThere is a very nice [library](https://github.com/databricks/spark-redshift) to load/write data from/to Redshift. The tip regarding loading data from Redshift\nis quite short. **After you load data persist dataframe**:\n```\nval df = getDfFromRedshift(ss, config)\ndf.persist(StorageLevel.MEMORY_AND_DISK)\n# do processing\ndf.unpersist()\n```\nAs we know, dataframe does not contain data it is actually sequence of transformations which are performed only when action is triggered. \nThis way, if Spark wails at some stage it can reconstruct the current state from an initial data source. In our case, an initial data source is Redshift.\nAnd data is loaded to Spark via `UNLAOD` command. So if you do not persist (cache) dataframe, in case calculation fails and Spark resubmits it, it will trigger\nnew `UNLOAD` and therefore unnecessary load on Redshift (not to mention longer overall processing time).\n\nWhereas, while writing data to redshift definitely use `CSV GZIP` as  `tempformat`. Here is a nice [benchmark](https://www.stitchdata.com/blog/redshift-database-benchmarks-copy-performance-of-csv-json-and-avro/) confirming that.\n\n## Working with S3\n\nWhile reading files from S3 bare in mind that depending on API (__s3a__ or __s3n__) number of partitions for files on S3 will be different ([source](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Features)):\n\n```\n\u003cproperty\u003e\n  \u003cname\u003efs.s3a.block.size\u003c/name\u003e\n  \u003cvalue\u003e32M\u003c/value\u003e\n  \u003cdescription\u003eBlock size to use when reading files using s3a: file system.\n  \u003c/description\u003e\n\u003c/property\u003e\n``` \nand\n```\n\u003cproperty\u003e\n  \u003cname\u003efs.s3n.block.size\u003c/name\u003e\n  \u003cvalue\u003e67108864\u003c/value\u003e\n  \u003cdescription\u003eBlock size to use when reading files using the native S3\n  filesystem (s3n: URIs).\u003c/description\u003e\n\u003c/property\u003e\n```\nGenerally, I would suggest using s3a since it is more recent API. But knowing the number of partitions will help you better configure resource allocation. [Here](http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/) you can find quite nice example of calcuation of recources for Spark application. \n\n### Problem:\nSaving dataframe as Parquet files to S3 is a quite common use-case, however, it appears to be much slower than writing the same dataframe to HDFS. As I understand, the reason is that Spark creates `temporary` folder where it stores initial files and then after all tasks are finished it move them to a final destination (usually to the folder where `temporary` is located). In the case of HDFS it is achieved by renaming files, but in the case of S3 there is no such operation, and it should be done as `copy` and `delete`. Also, it seems this is done by one thread, and if the number of files is large the different between HDFS and S3 writes is bigger. \n#### Update\nIn [this](https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/) blog post AWS describes above issue and annonces inprovement for Spark write performance of Parquet files to S3. From my experince , I still find bellow solution faster for large number of file. But, nevertheless, try to use EMR 5.20 and above anyway since EMR 5.20 also inroduced Spark 2.4. \n\n### Solution\nWrite dataframe to temporary HDFS folder and later copy it to s3 using [s3-dist-cp](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html). If you run Spark applications on EMR it will be available as EMR Step or just command line command. So you can use the following method to do that:\n```\nimport scala.sys.process._\n\ndef s3distCp(src: String, dest: String): Unit = {\n    s\"s3-dist-cp --src $src --dest $dest\".!\n}\n```\n***Note***: \nTo be able to use this method, you need Hadoop application to be added and you need to run Spark in client or local mode since s3-dist-cp is not available on slave nodes. If you want to run in cluster mode, then copy _s3-dist-cp_ command to slaves during bootstrap. \n\n(to be continued)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgorros%2Fspark-scala-tips","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgorros%2Fspark-scala-tips","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgorros%2Fspark-scala-tips/lists"}