{"id":18810362,"url":"https://github.com/absaoss/abris","last_synced_at":"2025-04-04T07:09:26.495Z","repository":{"id":37493011,"uuid":"131604608","full_name":"AbsaOSS/ABRiS","owner":"AbsaOSS","description":"Avro SerDe for Apache Spark structured APIs.","archived":false,"fork":false,"pushed_at":"2024-07-22T08:33:58.000Z","size":988,"stargazers_count":233,"open_issues_count":20,"forks_count":76,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-03-28T06:08:55.229Z","etag":null,"topics":["avro","avro-schema","kafka","schema-registry","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-30T14:17:48.000Z","updated_at":"2025-03-24T17:34:32.000Z","dependencies_parsed_at":"2024-02-06T09:29:46.314Z","dependency_job_id":"b562dfef-a496-4cf4-ae63-7f3faddd926a","html_url":"https://github.com/AbsaOSS/ABRiS","commit_stats":null,"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FABRiS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FABRiS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FABRiS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2FABRiS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/ABRiS/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247135145,"owners_count":20889421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avro","avro-schema","kafka","schema-registry","spark"],"created_at":"2024-11-07T23:19:57.920Z","updated_at":"2025-04-04T07:09:26.461Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n# ABRiS - Avro Bridge for Spark\n\n- Pain free Spark/Avro integration.\n\n- Seamlessly integrate with Confluent platform, including Schema Registry with all available [naming strategies](https://docs.confluent.io/current/schema-registry/serializer-formatter.html#how-the-naming-strategies-work) and schema evolution.\n\n- Seamlessly convert your Avro records from anywhere (e.g. Kafka, Parquet, HDFS, etc) into Spark Rows. \n\n- Convert your Dataframes into Avro records without even specifying a schema.\n\n- Go back-and-forth Spark Avro (since Spark 2.4).\n\n\n### Coordinates for Maven POM dependency\n\n| Scala  | Abris   |\n|:------:|:-------:|\n| 2.11   | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/abris_2.11/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/abris_2.11) |\n| 2.12   | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/abris_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/abris_2.12) |\n| 2.13   | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/abris_2.13/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/abris_2.13) |\n\n## Supported versions\n\n| Abris   |     Spark     | Scala       |\n|:-----:  |:-------------:|:-----:      |\n| 6.2.0 - 6.x.x   | 3.2.1 - 3.5.x | 2.12 / 2.13 |\n| 6.0.0 - 6.1.1   |     3.2.0     | 2.12 / 2.13 |\n| 5.0.0 - 5.x.x   | 3.0.x / 3.1.x | 2.12        |\n| 5.0.0 - 5.x.x   |     2.4.x     | 2.11 / 2.12 |\n\nFrom version 6.0.0, ABRiS only supports Spark 3.2.x.\n\nABRiS 5.0.x is still supported for older versions of Spark (see [branch-5](https://github.com/AbsaOSS/ABRiS/tree/branch-5))\n\n## Older Versions\nThis is documentation for Abris **version 6**. Documentation for older versions is located in corresponding branches:\n[branch-5](https://github.com/AbsaOSS/ABRiS/tree/branch-5),\n[branch-4](https://github.com/AbsaOSS/ABRiS/tree/branch-4),\n[branch-3.2](https://github.com/AbsaOSS/ABRiS/tree/branch-3.2).\n\n## Confluent Schema Registry Version\nAbris by default uses Confluent client version 6.2.0.\n\n## Installation\nAbris needs `spark-avro` to run, make sure you include the `spark-avro` dependency when using Abris.\nThe version of `spark-avro` and `Spark` should be identical.\n\nExample: submitting a Spark job:\n```\n./bin/spark-submit \\\n    --packages org.apache.spark:spark-avro_2.12:3.5.0,za.co.absa:abris_2.12:6.4.0 \\\n    ...rest of submit params...\n```\n\nExample: using Abris in maven project:\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.apache.spark\u003c/groupId\u003e\n    \u003cartifactId\u003espark-core_2.12\u003c/artifactId\u003e\n    \u003cversion\u003e3.5.0\u003c/version\u003e\n    \u003cscope\u003eprovided\u003c/scope\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.apache.spark\u003c/groupId\u003e\n    \u003cartifactId\u003espark-avro_2.12\u003c/artifactId\u003e\n    \u003cversion\u003e3.5.0\u003c/version\u003e \u003c!-- version must be the same as Spark --\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003eza.co.absa\u003c/groupId\u003e\n    \u003cartifactId\u003eabris_2.12\u003c/artifactId\u003e\n    \u003cversion\u003e6.4.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nExample: using Abris in SBT project:\n```Scala\nlibraryDependencies ++= Seq(\n  \"org.apache.spark\" %% \"spark-core\" % \"3.5.0\" % Provided,\n  \"org.apache.spark\" %% \"spark-avro\" % \"3.5.0\",\n  \"za.co.absa\" %% \"abris\" % \"6.4.0\"\n)\n```\n\n\n## Usage\n\nABRiS API is in it's most basic form almost identical to Spark built-in support for Avro, but it provides additional functionality. \nMainly it's support of schema registry and also seamless integration with confluent Avro data format.\n\nThe API consists of two Spark SQL expressions (`to_avro` and `from_avro`) and fluent configurator (`AbrisConfig`)\n\nUsing the configurator you can choose from four basic config types:\n* `toSimpleAvro`, `toConfluentAvro`, `fromSimpleAvro` and `fromConfluentAvro`\n\nAnd configure what you want to do, mainly how to get the avro schema.\n\nExample of usage:\n```Scala\nval abrisConfig = AbrisConfig\n  .fromConfluentAvro\n  .downloadReaderSchemaByLatestVersion\n  .andTopicNameStrategy(\"topic123\")\n  .usingSchemaRegistry(\"http://localhost:8081\")\n\nimport za.co.absa.abris.avro.functions.from_avro\nval deserialized = dataFrame.select(from_avro(col(\"value\"), abrisConfig) as 'data)\n```\n\nDetailed instructions for many use cases are in separated documents:\n\n- [How to use Abris with vanilla avro (with examples)](documentation/vanilla-avro-documentation.md)\n- [How to use Abris with Confluent avro (with examples)](documentation/confluent-avro-documentation.md)\n- [How to use Abris in Python (with examples)](documentation/python-documentation.md)\n\nFull runnable examples can be found in the ```za.co.absa.abris.examples``` package. You can also take a look at unit tests in package ```za.co.absa.abris.avro.sql```.\n\n**IMPORTANT**: Spark dependencies have `provided` scope in the `pom.xml`, so when running the examples, please make sure that you either, instruct your IDE to include dependencies with \n`provided` scope, or change the scope directly.\n\n### Confluent Avro format    \nThe format of Avro binary data is defined in [Avro specification](http://avro.apache.org/docs/current/spec.html). \nConfluent format extends it and prepends the schema id before the actual record. \nThe Confluent expressions in this library expect this format and add the id after the Avro data are generated or remove it before they are parsed.\n\nYou can find more about Confluent and Schema Registry in [Confluent documentation](https://docs.confluent.io/current/schema-registry/index.html).\n\n\n### Schema Registry security and other additional settings\n\nOnly Schema registry client setting that is mandatory is the url, \nbut if you need to provide more the configurer allows you to provide a whole map.\n\nFor example, you may want to provide `basic.auth.user.info` and `basic.auth.credentials.source` required for user authentication.\nYou can do it this way:\n\n```scala\nval registryConfig = Map(\n  AbrisConfig.SCHEMA_REGISTRY_URL -\u003e \"http://localhost:8081\",\n  \"basic.auth.credentials.source\" -\u003e \"USER_INFO\",\n  \"basic.auth.user.info\" -\u003e \"srkey:srvalue\"\n)\n\nval abrisConfig = AbrisConfig\n  .fromConfluentAvro\n  .downloadReaderSchemaByLatestVersion\n  .andTopicNameStrategy(\"topic123\")\n  .usingSchemaRegistry(registryConfig) // use the map instead of just url\n```\n\n## Other Features\n\n### Generating Avro schema from Spark data frame column\nThere is a helper method that allows you to generate schema automatically from spark column. \nAssuming you have a data frame containing column \"input\". You can generate schema for data in that column like this:\n```scala\nval schema = AvroSchemaUtils.toAvroSchema(dataFrame, \"input\")\n```\n\n### Using schema manager to directly download or register schema\nYou can use SchemaManager directly to do operations with schema registry. \nThe configuration is identical to Schema Registry Client.\nThe SchemaManager is just a wrapper around the client providing helpful methods and abstractions.\n\n```scala\nval schemaRegistryClientConfig = Map( ...configuration... )\nval schemaManager = SchemaManagerFactory.create(schemaRegistryClientConfig)\n\n// Downloading schema:\nval schema = schemaManager.getSchemaById(42)\n\n// Registering schema:\nval schemaString = \"{...avro schema json...}\"\nval subject = SchemaSubject.usingTopicNameStrategy(\"fooTopic\")\nval schemaId = schemaManager.register(subject, schemaString)\n\n// and more, check SchemaManager's methods\n```\n\n### De-serialisation Error Handling\nThere are 2 ways ABRiS handles de-serialisation errors:\n\n#### FailFast (Default)\nGiven no provided de-serialisation handler, a failure will result in a spark exception being thrown \nand with the error being outputted. This is the default procedure.\n\n#### SpecificRecordHandler\nThe second option requires providing a default record that will be outputted in the event of a failure.\nThis should be used as a flag to be deleted outside ABRiS that should mean the spark job will not stop. \nBeware however, a null or empty record will also result in an error so a record with a different input should be chosen.\n\nThis can be provided as such:\n```scala\nval abrisConfig = AbrisConfig\n  .fromConfluentAvro\n  .downloadReaderSchemaByLatestVersion\n  .andTopicNameStrategy(\"topic123\")\n  .usingSchemaRegistry(registryConfig)\n  .withSchemaConverter(\"custom\")\n  .withExceptionHandler(new SpecificRecordExceptionHandler(providedDefaultRecord))\n```\n\nThis is only for confluent-based configuration, not for standard avro.\n\n#### PermissiveRecordExceptionHandler\nThe third option is to use the `PermissiveRecordExceptionHandler`. In case of a deserialization failure, this handler replaces the problematic record with a fully null record, instead of throwing an exception. This allows the data processing pipeline to continue without interruption.\n\nThe main use case for this option is when you want to prioritize continuity of processing over individual record integrity. It's especially useful when dealing with large datasets where occasional malformed records could be tolerated.\n\nHere's how to use it:\n\n```scala\nval abrisConfig = AbrisConfig\n  .fromConfluentAvro\n  .downloadReaderSchemaByLatestVersion\n  .andTopicNameStrategy(\"topic123\")\n  .usingSchemaRegistry(registryConfig)\n  .withSchemaConverter(\"custom\")\n  .withExceptionHandler(new PermissiveRecordExceptionHandler())\n```\n\nWith this configuration, in the event of a deserialization error, the `PermissiveRecordExceptionHandler` will log a warning, substitute the malformed record with a fully null one, and allow the data processing pipeline to continue.\n\n\n### Data Conversions\nThis library also provides convenient methods to convert between Avro and Spark schemas. \n\nIf you have an Avro schema which you want to convert into a Spark SQL one - to generate your Dataframes, for instance - you can do as follows: \n\n```scala\nval avroSchema: Schema = AvroSchemaUtils.load(\"path_to_avro_schema\")\nval sqlSchema: StructType = SparkAvroConversions.toSqlType(avroSchema) \n```  \n\nYou can also do the inverse operation by running:\n\n```scala\nval sqlSchema = new StructType(new StructField ....\nval avroSchema = SparkAvroConversions.toAvroSchema(sqlSchema, avro_schema_name, avro_schema_namespace)\n```\n\n#### Custom data conversions\nIf you would like to use custom logic to convert from Avro to Spark, you can implement the `SchemaConverter` trait.\nThe custom class is loaded in ABRiS using the service provider interface (SPI), so you need to register your class in your\n`META-INF/services` resource directory. You can then configure the custom class with its short name or the fully qualified name.\n\n**Example**\n\nCustom schema converter implementation\n```scala\npackage za.co.absa.abris.avro.sql\nimport org.apache.avro.Schema\nimport org.apache.spark.sql.types.DataType\n\nclass CustomSchemaConverter extends SchemaConverter {\n  override val shortName: String = \"custom\"\n  override def toSqlType(avroSchema: Schema): DataType = ???\n}\n```\n\nProvider configuration file `META-INF/services/za.co.absa.abris.avro.sql.SchemaConverter`:\n```\nza.co.absa.abris.avro.sql.CustomSchemaConverter\n```\n\nAbris configuration\n```scala\nval abrisConfig = AbrisConfig\n  .fromConfluentAvro\n  .downloadReaderSchemaByLatestVersion\n  .andTopicNameStrategy(\"topic123\")\n  .usingSchemaRegistry(registryConfig)\n  .withSchemaConverter(\"custom\")\n```\n\n## Multiple schemas in one topic\nThe naming strategies RecordName and TopicRecordName allow for a one topic to receive different payloads, \ni.e. payloads containing different schemas that do not have to be compatible, \nas explained [here](https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html#subject-name-strategy).\n\nWhen you read such data from Kafka they will be stored as binary column in a dataframe, \nbut once you convert them to Spark types they cannot be in one dataframe, \nbecause all rows in dataframe must have the same schema.\n\nSo if you have multiple incompatible types of avro data in a dataframe you must first sort them out to several dataframes.\nOne for each schema. Then you can use Abris and convert the avro data.\n\n## How to measure code coverage\n```shell\n./mvn clean verify -Pcode-coverage,scala-2.12\nor\n./mvn clean verify -Pcode-coverage,scala-2.13\n```\nCode coverage reports will be generated on paths:\n```\n{local-path}\\ABRiS\\target\\jacoco\n```\n\n---\n\n    Copyright 2018 ABSA Group Limited\n    \n    Licensed under the Apache License, Version 2.0 (the \"License\");\n    you may not use this file except in compliance with the License.\n    You may obtain a copy of the License at\n    \n        http://www.apache.org/licenses/LICENSE-2.0\n    \n    Unless required by applicable law or agreed to in writing, software\n    distributed under the License is distributed on an \"AS IS\" BASIS,\n    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n    See the License for the specific language governing permissions and\n    limitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fabris","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fabris","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fabris/lists"}