{"id":18810402,"url":"https://github.com/absaoss/jdbc2s","last_synced_at":"2025-04-13T20:30:58.163Z","repository":{"id":37761689,"uuid":"237987723","full_name":"AbsaOSS/Jdbc2S","owner":"AbsaOSS","description":"A JDBC streaming source for Spark","archived":false,"fork":false,"pushed_at":"2024-02-19T09:29:39.000Z","size":121,"stargazers_count":8,"open_issues_count":8,"forks_count":2,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-04-12T07:05:53.675Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-03T14:42:39.000Z","updated_at":"2023-05-29T14:46:53.000Z","dependencies_parsed_at":"2022-08-31T08:41:52.891Z","dependency_job_id":null,"html_url":"https://github.com/AbsaOSS/Jdbc2S","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FJdbc2S","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FJdbc2S/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FJdbc2S/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FJdbc2S/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/Jdbc2S/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223603279,"owners_count":17172073,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:20:05.916Z","updated_at":"2024-11-07T23:20:06.463Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"    Copyright 2020 ABSA Group Limited\n    \n    Licensed under the Apache License, Version 2.0 (the \"License\");\n    you may not use this file except in compliance with the License.\n    You may obtain a copy of the License at\n    \n        http://www.apache.org/licenses/LICENSE-2.0\n    \n    Unless required by applicable law or agreed to in writing, software\n    distributed under the License is distributed on an \"AS IS\" BASIS,\n    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n    See the License for the specific language governing permissions and\n    limitations under the License.\n\n# Jdbc2S - JDBC Streaming Source for Spark\n\nSupport for multiple data types(e.g. dates, ints, doubles, etc) as offset trackers.\n\nCurrently only supports Spark DataSourceV1.\n\nWill be expanded to support DataSourceV2 in the future.\n\n### Coordinates for Maven POM dependency\n#### Jdbc2S for Scala 2.11\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/jdbc2s_2.11/badge.svg)](https://search.maven.org/artifact/za.co.absa/jdbc2s_2.11/1.0.0/jar)\n\n## Motivation\n\nStreaming data from RDBMS is not very usual, and when it happens, it is usually through [Change Data Capture(CDC)](https://en.wikipedia.org/wiki/Change_data_capture).\n\nHowever, in some cases (e.g. legacy systems), situations happen that require RDBMS to be used as the source of some data\nstreaming pipeline, e.g. data ingested from mainframes into databases in an hourly fashion.\n\nSpark is a popular streaming processing engine but it only supports RDBMS sources in batch mode, through a JDBC data source.\n\nThis project brings the same capabilities available on Spark JDBC batch DataFrames to the streaming world.\n\n\n## Features\n\n### Support for multiple data types as offset trackers\nSupport for multiple data types as offset trackers means that any data type can be used as an offset (date, string, int, double, custom, etc).\n\nIMPORTANT: updates and deletions will only be identified if they also advance the offset field.\n\n#### Caveats\nThe field must be convertible to a string representation and must also be increasing, since the comparison\nbetween two offsets is not done using the `\u003c` or `\u003c=` operators but the `!=` one.\n\nThe queries, however, are done using `\u003e`, `\u003e=`, `\u003c` and `\u003c=`. They are inclusive for the last value. More specifically, \nit will be inclusive every time the first 'start' argument is empty and exclusive whenever it is not.\n\nAs an example, the code below will be executed when Spark invokes [getBatch(None,Offset)](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L61):  \n\n```sql\nSELECT fields FROM TABLE WHERE offsetField \u003e= start_offset AND offsetField \u003c= end_offset\n```\n\nbut when the start offset is defined, i.e. [getBatch(Offset,Offset)](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L61):\n\n```sql\nSELECT fields FROM TABLE WHERE offsetField \u003e start_offset AND offsetField \u003c= end_offset\n```\n\n \n### Piggybacked on Spark JDBC batch source\nThis source works by wrapping the RDD from a batch DataFrame inside a streaming DataFrame thus there is nothing substantially new.\n\n#### Caveats\nTo simplify the implementation, this source relies on the method `internalCreateDataFrame` from [SQLContext](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L385).\nThat method, however, is package-private, thus, this source had to be put in the package `org.apache.spark.sql.execution.streaming.sources`\nto be able to access that method.\n\nFor more details, check [this section](https://github.com/AbsaOSS/Jdbc2S/blob/master/src/main/scala/org/apache/spark/sql/execution/streaming/sources/JDBCStreamingSourceV1.scala#L427).\n\n\n### Full support for checkpointing\nThis source supports checkpointing as any other streaming source.\n\n#### Caveats\nSpark requires offsets to have JSON representations so that they can be stored in the Write-Ahead Log in that format.\nWhen the query is restarted, the last committed offset is loaded as an instance of [SerializedOffset](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/SerializedOffset.scala).\n\nAlso, Spark streaming engine assumes that V1 sources have the `getBatch` method invoked once the checkpointed offset is loaded, \nas explained in [this comment](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L302).\n\nThis source, however, processes all the data informed by the last offset, thus, if it processed the offsets informed at query \nrestart time, there would be duplicates. Also, it uses its own offset definition, [JDBCSingleFieldOffset](https://github.com/AbsaOSS/Jdbc2S/blob/master/src/main/scala/za/co/absa/spark/jdbc/streaming/source/offsets/JDBCSingleFieldOffset.scala).\n\nSo, the way to connect all these pieces is to proceed like this: if the end offset provided to [getBatch](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L61)\nis of type `SerializedOffset` and there is no previous offset memoized, the incoming offset is understood as coming\nfrom the checkpoint location. In this case, the checkpoint offset is memoized and an empty DataFrame is returned.\n\nIn the next iteration, when calling the same method, the start offset will be the `SerializedOffset` instance previously used,\nbut it will have been processed already in the last batch, so in this case, the algorithm proceeds normally.\n\nFor more information, check [this documentation](https://github.com/AbsaOSS/Jdbc2S/blob/master/src/main/scala/org/apache/spark/sql/execution/streaming/sources/JDBCStreamingSourceV1.scala#L285)\n\n## Usage\n\nTo use this source, the configurations below can be used.\n\n### Parameters\nThere are two parameters for the V1 source, one mandatory and another optional.\n\n1. **Mandatory**: `offset.field` and `offset.field.date.format` IF `offset.field` is of DATE type.\n\nThis parameters specifies the name of the field to be used as the offset.\n\n```scala\n// assuming this is the case class used in the dataset\ncase class Transaction(user: String, value: Double, date: Date)\n\nval jdbcOptions = {\n    Map(\n      \"user\" -\u003e \"a_user\",\n      \"password\" -\u003e \"a_password\",\n      \"database\" -\u003e \"h2_db\",\n      \"driver\" -\u003e \"org.h2.Driver\",\n      \"url\" -\u003e \"jdbc:h2:mem:myDb;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=false\",\n      \"dbtable\" -\u003e \"transactions\"\n    )\n}\n\nval stream = spark.readStream\n    .format(format)\n    .options(jdbcOptions + (\"offset.field\" -\u003e \"date\") + (\"offset.field.date.format\" -\u003e \"YYYY-MM-DD\")) // use the field 'date' as the offset field\n    .load\n```\n\n\n2. **Optional**: `start.offset`\n\nThis parameter defines the start offset to be used when running the query. If not specified, it will be calculated from\nthe data.\n\nIMPORTANT: if the field being used as offset is not indexed, specifying the initial offset may significantly increase performance.\n\n```scala\n// assuming this is the case class used in the dataset\ncase class Transaction(user: String, value: Double, date: Date)\n\nval jdbcOptions = {\n    Map(\n      \"user\" -\u003e \"a_user\",\n      \"password\" -\u003e \"a_password\",\n      \"database\" -\u003e \"h2_db\",\n      \"driver\" -\u003e \"org.h2.Driver\",\n      \"url\" -\u003e \"jdbc:h2:mem:myDb;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=false\",\n      \"dbtable\" -\u003e \"transactions\"\n    )\n}\n\nval stream = spark.readStream\n    .format(format)\n    // runs the query starting from the 10th of January until the last date there is data available\n    .options(jdbcOptions + (\"offset.field\" -\u003e \"date\") + (\"offset.start\" -\u003e \"2020-01-10\") + (\"offset.field.date.format\" -\u003e \"YYYY-MM-DD\"))\n    .load\n```\n\n### Source name\nYou can refer to this source either, as a fully qualified provider name or by its short name.\n\n#### Fully qualified provider name\nThe fully qualified for the V1 source is **za.co.absa.spark.jdbc.streaming.source.providers.JDBCStreamingSourceProviderV1**.\n\nTo use it, you can do:\n\n```scala\n    val format = \"za.co.absa.spark.jdbc.streaming.source.providers.JDBCStreamingSourceProviderV1\"\n\n    val stream = spark.readStream\n      .format(format)\n      .options(params)\n      .load\n```\n\n#### Short name\nThe short name for the V1 source is `jdbc-streaming-v1` as in [here](https://github.com/AbsaOSS/Jdbc2S/blob/master/src/main/scala/za/co/absa/spark/jdbc/streaming/source/providers/JDBCStreamingSourceProviderV1.scala#L47)\n\nTo use it you'll need:\n\n1. Create the directory `META-INF/services` under `src/main/resources`.\n2. Add a file named `org.apache.spark.sql.sources.DataSourceRegister`.\n3. Inside that file, add `za.co.absa.spark.jdbc.streaming.source.providers.JDBCStreamingSourceProviderV1`.\n\nAfter doing that, you'll be able to do:\n\n```scala\n    val stream = spark.readStream\n      .format(\"jdbc-streaming-v1\")\n      .options(params)\n      .load\n```\n\n\n#### Examples\nExamples can be found in the package `za.co.absa.spark.jdbc.streaming.source.examples`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fjdbc2s","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fjdbc2s","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fjdbc2s/lists"}