{"id":14982405,"url":"https://github.com/tupol/spark-utils","last_synced_at":"2025-08-20T03:31:14.660Z","repository":{"id":41330413,"uuid":"129492390","full_name":"tupol/spark-utils","owner":"tupol","description":"Basic framework utilities to quickly start writing production ready Apache Spark applications","archived":false,"fork":false,"pushed_at":"2024-08-25T13:38:06.000Z","size":9027,"stargazers_count":36,"open_issues_count":5,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-12-07T00:04:40.920Z","etag":null,"topics":["apache-spark","convenience","data-sink","data-source","framework","scala","spark","spark-applications","spark-streaming"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tupol.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-14T07:21:37.000Z","updated_at":"2024-08-25T09:51:27.000Z","dependencies_parsed_at":"2024-03-17T12:27:00.601Z","dependency_job_id":"50d4e5a1-c8b4-479f-a596-c8f5a25458f2","html_url":"https://github.com/tupol/spark-utils","commit_stats":{"total_commits":125,"total_committers":4,"mean_commits":31.25,"dds":"0.23199999999999998","last_synced_commit":"8234e3c5ea07a3631c5b41341ed9a201aa54b9c9"},"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupol%2Fspark-utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupol%2Fspark-utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupol%2Fspark-utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupol%2Fspark-utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tupol","download_url":"https://codeload.github.com/tupol/spark-utils/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230388131,"owners_count":18217755,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","convenience","data-sink","data-source","framework","scala","spark","spark-applications","spark-streaming"],"created_at":"2024-09-24T14:05:21.521Z","updated_at":"2024-12-19T06:09:27.030Z","avatar_url":"https://github.com/tupol.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Utils #\n\n[![Maven Central](https://img.shields.io/maven-central/v/org.tupol/spark-utils_2.12.svg)][maven-central] \u0026nbsp;\n[![GitHub](https://img.shields.io/github/license/tupol/spark-utils.svg)][license] \u0026nbsp;\n[![Travis (.org)](https://img.shields.io/travis/tupol/spark-utils.svg)][travis.org] \u0026nbsp;\n[![Codecov](https://img.shields.io/codecov/c/github/tupol/spark-utils.svg)][codecov] \u0026nbsp;\n[![Javadocs](https://www.javadoc.io/badge/org.tupol/spark-utils_2.12.svg)][javadocs] \u0026nbsp;\n[![Gitter](https://badges.gitter.im/spark-utils/spark-utils.svg)][gitter] \u0026nbsp;\n[![Twitter](https://img.shields.io/twitter/url/https/_tupol.svg?color=%2317A2F2)][twitter] \u0026nbsp;\n\n\n## Motivation ##\n\nOne of the biggest challenges after taking the first steps into the world of writing\n[Apache Spark][Spark] applications in [Scala][scala] is taking them to production.\n\nAn application of any kind needs to be easy to run and easy to configure.\n\nThis project is trying to help developers write Spark applications focusing mainly on the \napplication logic rather than the details of configuring the application and setting up the \nSpark context.\n\nThis project is also trying to create and encourage a friendly yet professional environment \nfor developers to help each other, so please do not be shy and join through [gitter], [twitter], \n[issue reports](https://github.com/tupol/spark-utils/issues/new/choose) or pull requests.\n\n\n## ATTENTION!\n\nAt the moment there are a lot of changes happening to the `spark-utils` project, hopefully for the better.\n\nThe latest stable versions, available through Maven Central are\n- Spark 2.4: `0.4.2` to `0.6.2`\n- Spark 3.0: `0.6.2` to `1.0.0-RC6`\n- Spark \u003e= 3.3.0: `1.0.0-RC7` +\n\nThe development version is `1.0.0-R6` which is bringing a clean separation between configuration implementation and the\ncore, and additionally the [PureConfig] based configuration module that brings the power and features of [PureConfig]\nto increase productivity even further and allowing for a more mature configuration framework.\n\nThe new modules are:\n- `spark-utils-io-pureconfig` for the new [PureConfig] implementation\n\nWe completely removed the legacy scalaz based configuration framework.\n\nWe suggest to start considering the new for the future `spark-utils-io-pureconfig`.\n\nMigrating to the new `1.0.0-RC6` is quite easy, as the configuration structure was mainly preserved.\nMore details are available in the [RELEASE-NOTES](RELEASE-NOTES.md).\n\nFor now, some of the documentation related or referenced from this project might be obsolete or outdated,\nbut as the project will get closer to the final release, there will be more improvements. \n\n### Test Results Matrix\n\n| Spark | Scala 2.12 | Scala 2.13 | Report 1.0.0-RC6                                    | Report 1.0.0-RC7                                    |\n|-------|:----------:|:----------:|-----------------------------------------------------|-----------------------------------------------------|\n| 3.0.3 |    YES     |    N/A     | [3.0.3](docs/test-results/test_1.0.0-RC6_3.0.3.out) | N/A                                                 |\n| 3.1.3 |    YES     |    N/A     | [3.1.3](docs/test-results/test_1.0.0-RC6_3.1.3.out) | N/A                                                 |\n| 3.2.4 |    YES     |    YES     | [3.2.4](docs/test-results/test_1.0.0-RC6_3.2.4.out) | N/A                                                 |\n| 3.3.4 |    YES     |    YES     | [3.3.4](docs/test-results/test_1.0.0-RC6_3.3.4.out) | [3.3.4](docs/test-results/test_1.0.0-RC7_3.3.4.out) |\n| 3.4.2 |    YES     |    YES     | [3.4.2](docs/test-results/test_1.0.0-RC6_3.4.2.out) | [3.4.2](docs/test-results/test_1.0.0-RC7_3.4.2.out) |\n| 3.5.1 |    YES     |    YES     | [3.5.1](docs/test-results/test_1.0.0-RC6_3.5.1.out) | [3.5.1](docs/test-results/test_1.0.0-RC7_3.5.1.out) |\n\n\n## Description ##\n\nThis project contains some basic utilities that can help setting up an Apache Spark application project.\n\nThe main point is the simplicity of writing Apache Spark applications just focusing on the logic,\nwhile providing for easy configuration and arguments passing.\n\nThe code sample bellow shows how easy can be to write a file format converter from any acceptable \ntype, with any acceptable parsing configuration options to any acceptable format.\n\n### Batch Application\n\n```scala\nimport org.tupol.spark._\n\nobject FormatConverterExample extends SparkApp[FormatConverterContext, DataFrame] {\n  override def createContext(config: Config) = FormatConverterContext.extract(config)\n  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = {\n    val inputData = spark.source(context.input).read\n    inputData.sink(context.output).write\n  }\n}\n```\n\nOptionally, the `SparkFun` can be used instead of `SparkApp` to make the code even more concise.\n\n```scala\nimport org.tupol.spark._\n\nobject FormatConverterExample extends \n          SparkFun[FormatConverterContext, DataFrame](FormatConverterContext.extract) {\n  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = \n    spark.source(context.input).read.sink(context.output).write\n}\n```\n\n### Configuration\n\nCreating the configuration can be as simple as defining a case class to hold the configuration and\na factory, that helps extract simple and complex data types like input sources and output sinks.\n\n```scala\nimport org.tupol.spark.io._\n\ncase class FormatConverterContext(input: FormatAwareDataSourceConfiguration,\n                                  output: FormatAwareDataSinkConfiguration)\n```\n\nThere are multiple ways that the context can be easily created from configuration files.\nThis project proposes two ways:\n- the new [PureConfig] based framework\n- the legacy ScalaZ based framework\n\n#### Configuration creation based on PureConfig\n\n```scala\nimport com.typesafe.config.Config\n\nobject FormatConverterContext {\n  import pureconfig.generic.auto._\n  import org.tupol.spark.io.pureconf._\n  import org.tupol.spark.io.pureconf.readers._\n  def extract(config: Config): Try[FormatConverterContext] = config.extract[FormatConverterContext]\n}\n```\n\n### Streaming Application\n\nFor structured streaming applications the format converter might look like this:\n\n```scala\nobject StreamingFormatConverterExample extends SparkApp[StreamingFormatConverterContext, DataFrame] {\n  override def createContext(config: Config) = StreamingFormatConverterContext.extract(config)\n  override def run(implicit spark: SparkSession, context: StreamingFormatConverterContext): Try[DataFrame] = {\n    val inputData = spark.source(context.input).read\n    inputData.streamingSink(context.output).write.awaitTermination()\n  }\n}\n```\n\n### Configuration\n\nThe streaming configuration the configuration can be as simple as following:\n\n```scala\nimport org.tupol.spark.io.streaming.structured._\n\ncase class StreamingFormatConverterContext(input: FormatAwareStreamingSourceConfiguration, \n                                           output: FormatAwareStreamingSinkConfiguration)\n```\n\n#### Configuration creation based on PureConfig\n\n```scala\nobject StreamingFormatConverterContext {\n  import com.typesafe.config.Config\n  import pureconfig.generic.auto._\n  import org.tupol.spark.io.pureconf._\n  import org.tupol.spark.io.pureconf.streaming.structured._\n  def extract(config: Config): Try[StreamingFormatConverterContext] = config.extract[StreamingFormatConverterContext]\n}\n```\n\nThe [`SparkRunnable`](docs/spark-runnable.md) and [`SparkApp`](docs/spark-app.md) or \n[`SparkFun`](docs/spark-fun.md) together with the \n[configuration framework](https://github.com/tupol/scala-utils/blob/master/docs/configuration-framework.md)\nprovide for easy Spark application creation with configuration that can be managed through \nconfiguration files or application parameters.\n\nThe IO frameworks for [reading](docs/data-source.md) and [writing](docs/data-sink.md) data frames \nadd extra convenience for setting up batch and structured streaming jobs that transform \nvarious types of files and streams.\n\nLast but not least, there are many utility functions that provide convenience for loading \nresources, dealing with schemas and so on.\n\nMost of the common features are also implemented as *decorators* to main Spark classes, like\n`SparkContext`, `DataFrame` and `StructType` and they are conveniently available by importing \nthe `org.tupol.spark.implicits._` package.\n\n\n## Documentation ##\nThe documentation for the main utilities and frameworks available:\n- [SparkApp](docs/spark-app.md), [SparkFun](docs/spark-fun.md) and [SparkRunnable](docs/spark-runnable.md)\n- [DataSource Framework](docs/data-source.md) for both batch and structured streaming applications\n- [DataSink Framework](docs/data-sink.md) for both batch and structured streaming applications\n\nLatest stable API documentation is available [here](https://www.javadoc.io/doc/org.tupol/spark-utils_2.12/0.4.2).\n\nAn extensive tutorial and walk-through can be found [here](https://github.com/tupol/spark-utils-demos/wiki).\nExtensive samples and demos can be found [here](https://github.com/tupol/spark-utils-demos).\n\nA nice example on how this library can be used can be found in the\n[`spark-tools`](https://github.com/tupol/spark-tools) project, through the implementation\nof a generic format converter and a SQL processor for both batch and structured streams.\n\n\n## Prerequisites ##\n\n* Java 8 or higher\n* Scala 2.12\n* Apache Spark 3.0.X\n\n\n## Getting Spark Utils ##\n\nSpark Utils is published to [Maven Central][maven-central] and [Spark Packages][spark-packages]:\n\n- Group id / organization: `org.tupol`\n- Artifact id / name: `spark-utils`\n- Latest stable versions:\n  - Spark 2.4: `0.4.2` to `0.6.2`\n  - Spark 3.0: `0.6.2` to `1.0.0-RC7`\n  - Spark 3.3: `1.0.0-RC7` to\n\nUsage with SBT, adding a dependency to the latest version of tools to your sbt build definition file:\n\n```scala\nlibraryDependencies += \"org.tupol\" %% \"spark-utils-io-pureconfig\" % \"1.0.0-RC6\"\n```\n\nInclude this package in your Spark Applications using `spark-shell` or `spark-submit`\n```bash\n$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-utils_2.12:1.0.0-RC6\n```\n\n\n## Starting a New **`spark-utils`** Project ##\n\n***Note*** [spark-utils-g8] was not yet updated for the 1.x version.\n\nThe simplest way to start a new `spark-utils` is to make use of the \n[`spark-apps.seed.g8`][spark-utils-g8] template project.\n\nTo fill in manually the project options run\n```\ng8 tupol/spark-apps.seed.g8\n```\n\nThe default options look like the following:\n```\nname [My Project]:\nappname [My First App]:\norganization [my.org]:\nversion [0.0.1-SNAPSHOT]:\npackage [my.org.my_project]:\nclassname [MyFirstApp]:\nscriptname [my-first-app]:\nscalaVersion [2.12.12]:\nsparkVersion [3.2.1]:\nsparkUtilsVersion [0.4.0]:\n```\n\n\nTo fill in the options in advance\n```\ng8 tupol/spark-apps.seed.g8 --name=\"My Project\" --appname=\"My App\" --organization=\"my.org\" --force\n```\n\n\n## What's new? ##\n\n**1.0.0-RC7**\n\n- Adapt towards the latest Apache Spark versions from 3.3.x\n- Added `StreamingTrigger.AvailableNow`\n- Build with Spark 3.3.x and tested against Spark 3.3.0 to 3.5.1 \n\n**1.0.0-RC1 to 1.0.0-RC6**\n\nMajor library redesign\n- Cross compile Scala 2.12 and 2.13\n- Building with JDK 17 targeting Java 8\n- Added test java options to handle the JDK 17\n- Cross compile Scala 2.12 and 2.13\n- Build with Spark 3.2.x and tested against Spark 3.x\n- Removed the `spark-utils-io-pureconfig` module\n- Added configuration module based on [PureConfig]\n- `DataSource` exposes `reader` in addition to `read`\n- `DataSink` and `DataAwareSink` expose `writer` in addition to `write`\n- Added `SparkSessionOps.streamingSource`\n- Refactored `TypesafeConfigBuilder`, which has two implementations now: `SimpleTypesafeConfigBuilder` and `FuzzyTypesafeConfigBuilder`\n- Small improvements to `SharedSparkSession`\n- Documentation improvements  \n\n\n**0.6.2**\n\n- Fixed `core` dependency to `scala-utils`; now using `scala-utils-core`\n- Refactored the `core`/`implicits` package to make the *implicits* a little more *explicit*\n\n\nFor previous versions please consult the [release notes](RELEASE-NOTES.md).\n\n\n## License ##\n\nThis code is open source software licensed under the [MIT License](LICENSE).\n\n[scala]: https://scala-lang.org/\n[spark]: https://spark.apache.org/\n[spark-utils-g8]: https://github.com/tupol/spark-apps.seed.g8\n[maven-central]: https://mvnrepository.com/artifact/org.tupol/spark-utils-core\n[spark-packages]: https://spark-packages.org/package/tupol/spark-utils\n[license]: https://github.com/tupol/spark-utils/blob/master/LICENSE\n[travis.org]: https://travis-ci.com/tupol/spark-utils \n[codecov]: https://codecov.io/gh/tupol/spark-utils\n[javadocs]: https://www.javadoc.io/doc/org.tupol/spark-utils_2.12\n[gitter]: https://gitter.im/spark-utils/spark-utils\n[twitter]: https://twitter.com/_tupol\n[PureConfig]: https://pureconfig.github.io/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftupol%2Fspark-utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftupol%2Fspark-utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftupol%2Fspark-utils/lists"}