{"id":13794990,"url":"https://github.com/databricks/spark-avro","last_synced_at":"2025-05-12T21:32:54.499Z","repository":{"id":21331948,"uuid":"24648767","full_name":"databricks/spark-avro","owner":"databricks","description":"Avro Data Source for Apache Spark","archived":true,"fork":false,"pushed_at":"2018-12-19T19:32:32.000Z","size":405,"stargazers_count":539,"open_issues_count":77,"forks_count":310,"subscribers_count":70,"default_branch":"branch-4.0","last_synced_at":"2024-08-04T23:09:10.092Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://databricks.com/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README-for-old-spark-versions.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-09-30T17:50:58.000Z","updated_at":"2024-08-03T10:45:24.000Z","dependencies_parsed_at":"2022-08-20T19:00:22.942Z","dependency_job_id":null,"html_url":"https://github.com/databricks/spark-avro","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-avro","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-avro/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-avro/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-avro/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/spark-avro/tar.gz/refs/heads/branch-4.0","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225157000,"owners_count":17429698,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T23:00:50.865Z","updated_at":"2024-11-18T09:31:31.826Z","avatar_url":"https://github.com/databricks.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Avro Data Source for Apache Spark (2.3.x or earlier)\n\nDatabricks has donated this library to the Apache Spark project, as of [Spark 2.4.0](https://spark.apache.org/releases/spark-release-2-4-0.html). Databricks customers can also use this library directly on the [Databricks Unified Analytics Platform](https://www.databricks.com) without any additional dependency configurations. The rest of this file is the README for older versions.\n\n\nA library for reading and writing Avro data from [Spark SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html).\n\n[![Build Status](https://travis-ci.org/databricks/spark-avro.svg?branch=master)](https://travis-ci.org/databricks/spark-avro)\n[![codecov.io](http://codecov.io/github/databricks/spark-avro/coverage.svg?branch=master)](http://codecov.io/github/databricks/spark-avro?branch=master)\n\n## Requirements\n\nThis documentation is for version 4.0.0 of this library, which supports Spark 2.2. For\ndocumentation on earlier versions of this library, see the links below.\n\nThis library has different versions for Spark 1.2, 1.3, 1.4+, 2.0 - 2.1, and 2.2:\n\n| Spark Version | Compatible version of Avro Data Source for Spark |\n| ------------- | ------------------------------------------------ |\n| `1.2`         | `0.2.0`                                          |\n| `1.3`         | [`1.0.0`](https://github.com/databricks/spark-avro/tree/v1.0.0) |\n| `1.4+`        | [`2.0.1`](https://github.com/databricks/spark-avro/tree/v2.0.1) |\n| `2.0 - 2.1`   | [`3.2.0`](https://github.com/databricks/spark-avro/tree/v3.2.0) |\n| `2.2`         | `4.0.0` (this version)                           |\n\n## Linking\n\nThis library is cross-published for Scala 2.11, so 2.10 users should replace 2.11 with 2.10 in the commands listed below.\n\nYou can link against this library in your program at the following coordinates:\n\n**Using SBT:**\n\n```\nlibraryDependencies += \"com.databricks\" %% \"spark-avro\" % \"4.0.0\"\n```\n\n**Using Maven:**\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.databricks\u003c/groupId\u003e\n    \u003cartifactId\u003espark-avro_2.11\u003c/artifactId\u003e\n    \u003cversion\u003e4.0.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### With `spark-shell` or `spark-submit`\n\nThis library can also be added to Spark jobs launched through `spark-shell` or `spark-submit` by using the `--packages` command line option.\nFor example, to include it when starting the spark shell:\n\n```\n$ bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0\n```\n\nUnlike using `--jars`, using `--packages` ensures that this library and its dependencies will be added to the classpath. The `--packages` argument can also be used with `bin/spark-submit`.\n\n## Features\n\nAvro Data Source for Spark supports reading and writing of Avro data from Spark SQL.\n\n- **Automatic schema conversion:** It supports most conversions between Spark SQL and Avro records, making Avro a first-class citizen in Spark.\n- **Partitioning:** This library allows developers to easily read and write partitioned data\nwitout any extra configuration. Just pass the columns you want to partition on, just like you would for Parquet.\n- **Compression:**  You can specify the type of compression to use when writing Avro out to\ndisk. The supported types are `uncompressed`, `snappy`, and `deflate`. You can also specify the deflate level.\n- **Specifying record names:** You can specify the record name and namespace to use by passing a map of parameters with `recordName` and `recordNamespace`.\n\n## Supported types for Avro -\u003e Spark SQL conversion\n\nThis library supports reading all Avro types. It uses the following mapping from Avro types to Spark SQL types:\n\n| Avro type | Spark SQL type |\n| --------- |----------------|\n| boolean   | BooleanType    |\n| int       | IntegerType    |\n| long      | LongType       |\n| float     | FloatType      |\n| double    | DoubleType     |\n| bytes     | BinaryType     |\n| string    | StringType     |\n| record    | StructType     |\n| enum      | StringType     |\n| array     | ArrayType      |\n| map       | MapType        |\n| fixed     | BinaryType     |\n| union     | See below      |\n\nIn addition to the types listed above, it supports reading `union` types. The following three types are considered basic `union` types:\n\n1. `union(int, long)` will be mapped to `LongType`.\n2. `union(float, double)` will be mapped to `DoubleType`.\n3. `union(something, null)`, where `something` is any supported Avro type. This will be mapped to the same Spark SQL type as that of `something`, with `nullable` set to `true`.\n\nAll other `union` types are considered complex. They will be mapped to `StructType` where field names are `member0`, `member1`, etc., in accordance with members of the `union`. This is consistent with the behavior when converting between Avro and Parquet.\n\nAt the moment, it ignores docs, aliases and other properties present in the Avro file.\n\n## Supported types for Spark SQL -\u003e Avro conversion\n\nThis library supports writing of all Spark SQL types into Avro. For most types, the mapping from Spark types to Avro types is straightforward (e.g. IntegerType gets converted to int); however, there are a few special cases which are listed below:\n\n| Spark SQL type | Avro type |\n| ---------------|-----------|\n| ByteType       | int       |\n| ShortType      | int       |\n| DecimalType    | string    |\n| BinaryType     | bytes     |\n| TimestampType  | long      |\n| DateType       | long      |\n| StructType     | record    |\n\n## Examples\n\nThe recommended way to read or write Avro data from Spark SQL is by using Spark's DataFrame APIs, which are available in Scala, Java, Python, and R.\n\nThese examples use an Avro file available for download\n[here](https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro):\n\n### Scala API\n\n```scala\n// import needed for the .avro method to be added\nimport com.databricks.spark.avro._\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder().master(\"local\").getOrCreate()\n\n// The Avro records get converted to Spark types, filtered, and\n// then written back out as Avro records\nval df = spark.read.avro(\"src/test/resources/episodes.avro\")\ndf.filter(\"doctor \u003e 5\").write.avro(\"/tmp/output\")\n```\n\nAlternatively you can specify the format to use instead:\n\n```scala\nval spark = SparkSession.builder().master(\"local\").getOrCreate()\nval df = spark.read\n    .format(\"com.databricks.spark.avro\")\n    .load(\"src/test/resources/episodes.avro\")\n\ndf.filter(\"doctor \u003e 5\").write.format(\"com.databricks.spark.avro\").save(\"/tmp/output\")\n```\n\nYou can specify a custom Avro schema:\n\n```scala\nimport org.apache.avro.Schema\nimport org.apache.spark.sql.SparkSession\n\nval schema = new Schema.Parser().parse(new File(\"user.avsc\"))\nval spark = SparkSession.builder().master(\"local\").getOrCreate()\nspark\n  .read\n  .format(\"com.databricks.spark.avro\")\n  .option(\"avroSchema\", schema.toString)\n  .load(\"src/test/resources/episodes.avro\").show()\n```\n\nYou can also specify Avro compression options:\n\n```scala\nimport com.databricks.spark.avro._\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder().master(\"local\").getOrCreate()\n\n// configuration to use deflate compression\nspark.conf.set(\"spark.sql.avro.compression.codec\", \"deflate\")\nspark.conf.set(\"spark.sql.avro.deflate.level\", \"5\")\n\nval df = spark.read.avro(\"src/test/resources/episodes.avro\")\n\n// writes out compressed Avro records\ndf.write.avro(\"/tmp/output\")\n```\n\nYou can write partitioned Avro records like this:\n\n```scala\nimport com.databricks.spark.avro._\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder().master(\"local\").getOrCreate()\n\nval df = spark.createDataFrame(\n  Seq(\n    (2012, 8, \"Batman\", 9.8),\n    (2012, 8, \"Hero\", 8.7),\n    (2012, 7, \"Robot\", 5.5),\n    (2011, 7, \"Git\", 2.0))\n  ).toDF(\"year\", \"month\", \"title\", \"rating\")\n\ndf.toDF.write.partitionBy(\"year\", \"month\").avro(\"/tmp/output\")\n```\n\nYou can specify the record name and namespace like this:\n\n```scala\nimport com.databricks.spark.avro._\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder().master(\"local\").getOrCreate()\nval df = spark.read.avro(\"src/test/resources/episodes.avro\")\n\nval name = \"AvroTest\"\nval namespace = \"com.databricks.spark.avro\"\nval parameters = Map(\"recordName\" -\u003e name, \"recordNamespace\" -\u003e namespace)\n\ndf.write.options(parameters).avro(\"/tmp/output\")\n```\n\n### Java API\n\n```java\nimport org.apache.spark.sql.*;\nimport org.apache.spark.sql.functions;\n\nSparkSession spark = SparkSession.builder().master(\"local\").getOrCreate();\n\n// Creates a DataFrame from a specified file\nDataset\u003cRow\u003e df = spark.read().format(\"com.databricks.spark.avro\")\n  .load(\"src/test/resources/episodes.avro\");\n\n// Saves the subset of the Avro records read in\ndf.filter(functions.expr(\"doctor \u003e 5\")).write()\n  .format(\"com.databricks.spark.avro\")\n  .save(\"/tmp/output\");\n```\n\n### Python API\n\n```python\n# Creates a DataFrame from a specified directory\ndf = spark.read.format(\"com.databricks.spark.avro\").load(\"src/test/resources/episodes.avro\")\n\n#  Saves the subset of the Avro records read in\nsubset = df.where(\"doctor \u003e 5\")\nsubset.write.format(\"com.databricks.spark.avro\").save(\"/tmp/output\")\n```\n\n### SQL API\nAvro data can be queried in pure SQL by registering the data as a temporary table.\n\n```sql\nCREATE TEMPORARY TABLE episodes\nUSING com.databricks.spark.avro\nOPTIONS (path \"src/test/resources/episodes.avro\")\n```\n## Building From Source\nThis library is built with [SBT](http://www.scala-sbt.org/0.13/docs/Command-Line-Reference.html),\nwhich is automatically downloaded by the included shell script.  To build a JAR file simply run\n`build/sbt package` from the project root.\n\n## Testing\nTo run the tests, you should run `build/sbt test`. In case you are doing improvements that target\nspeed, you can generate a sample Avro file and check how long it takes to read that Avro file\nusing the following commands:\n\n```\nbuild/sbt \"test:run-main com.databricks.spark.avro.AvroFileGenerator NUMBER_OF_RECORDS NUMBER_OF_FILES\"\n```\n\nwill create sample avro files in `target/avroForBenchmark/`. You can specify the number of records\nfor each file, as well as the overall number of files.\n\n```\nbuild/sbt \"test:run-main com.databricks.spark.avro.AvroReadBenchmark\"\n```\n\nruns `count()` on the data inside `target/avroForBenchmark/` and tells you how the operation took.\n\nSimilarly, you can do benchmarks on how long it takes to write DataFrame as Avro file with\n\n```\nbuild/sbt \"test:run-main com.databricks.spark.avro.AvroWriteBenchmark NUMBER_OF_ROWS\"\n```\n\nwhere `NUMBER_OF_ROWS` is an optional parameter that allows you to specify the number of rows in DataFrame that we will be writing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-avro","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Fspark-avro","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-avro/lists"}