{"id":30294434,"url":"https://github.com/linkedin/spark-tfrecord","last_synced_at":"2025-08-17T01:34:54.834Z","repository":{"id":38388782,"uuid":"253932867","full_name":"linkedin/spark-tfrecord","owner":"linkedin","description":"Read and write Tensorflow TFRecord data from Apache Spark.","archived":false,"fork":false,"pushed_at":"2024-04-22T05:36:44.000Z","size":102,"stargazers_count":293,"open_issues_count":17,"forks_count":56,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-07-21T05:56:04.474Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linkedin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-04-07T23:09:35.000Z","updated_at":"2025-06-19T13:00:11.000Z","dependencies_parsed_at":"2024-04-22T06:46:10.712Z","dependency_job_id":null,"html_url":"https://github.com/linkedin/spark-tfrecord","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/linkedin/spark-tfrecord","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fspark-tfrecord","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fspark-tfrecord/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fspark-tfrecord/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fspark-tfrecord/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linkedin","download_url":"https://codeload.github.com/linkedin/spark-tfrecord/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fspark-tfrecord/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270796216,"owners_count":24647319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-17T01:34:54.357Z","updated_at":"2025-08-17T01:34:54.820Z","avatar_url":"https://github.com/linkedin.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark-TFRecord\n\nA library for reading and writing [Tensorflow TFRecord](https://www.tensorflow.org/how_tos/reading_data/) data from [Apache Spark](http://spark.apache.org/).\nThe implementation is based on [Spark Tensorflow Connector](https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector), but it is rewritten in Spark FileFormat trait to provide the partitioning function.\n\n## Including the library\n\nThe artifacts are published to [bintray](https://bintray.com/linkedin/maven/spark-tfrecord) and [maven central](https://search.maven.org/search?q=spark-tfrecord) repositories.\n\n- Version 0.1.x targets Spark 2.3 and Scala 2.11\n- Version 0.2.x targets Spark 2.4 and both Scala 2.11 and 2.12\n- Version 0.3.x targets Spark 3.0 and Scala 2.12\n- Version 0.4.x targets Spark 3.2 and Scala 2.12\n- Version 0.5.x targets Spark 3.2 and Scala 2.13\n- Version 0.6.x targets Spark 3.4 and both Scala 2.12 and 2.13\n- Version 0.7.x targets Spark 3.5 and both Scala 2.12 and 2.13\n\nTo use the package, please include the dependency as follows\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.linkedin.sparktfrecord\u003c/groupId\u003e\n  \u003cartifactId\u003espark-tfrecord_2.12\u003c/artifactId\u003e\n  \u003cversion\u003eyour.version\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## Building the library\nThe library can be built with Maven 3.3.9 or newer as shown below:\n\n```sh\n# Build Spark-TFRecord\ngit clone https://github.com/linkedin/spark-tfrecord.git\ncd spark-tfrecord\nmvn -Pscala-2.12 clean install\n\n# One can specify the spark version and tensorflow hadoop version, for example\nmvn -Pscala-2.12 clean install -Dspark.version=3.0.0 -Dtensorflow.hadoop.version=1.15.0\n```\n\n## Using Spark Shell\nRun this library in Spark using the `--jars` command line option in `spark-shell`, `pyspark` or `spark-submit`. For example:\n\n```sh\n$SPARK_HOME/bin/spark-shell --jars target/spark-tfrecord_2.12-0.3.0.jar\n```\n\n## Features\nThis library allows reading TensorFlow records in local or distributed filesystem as [Spark DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html).\nWhen reading TensorFlow records into Spark DataFrame, the API accepts several options:\n* `load`: input path to TensorFlow records. Similar to Spark can accept standard Hadoop globbing expressions.\n* `schema`: schema of TensorFlow records. Optional schema defined using Spark StructType. If not provided, the schema is inferred from TensorFlow records.\n* `recordType`: input format of TensorFlow records. By default it is Example. Possible values are:\n  * `Example`: TensorFlow [Example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) records\n  * `SequenceExample`: TensorFlow [SequenceExample](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) records\n  * `ByteArray`: `Array[Byte]` type in scala.\n\nWhen writing Spark DataFrame to TensorFlow records, the API accepts several options:\n* `save`: output path to TensorFlow records. Output path to TensorFlow records on local or distributed filesystem.\ncompression. While reading compressed TensorFlow records, `codec` can be inferred automatically, so this option is not required for reading.\n* `recordType`: output format of TensorFlow records. By default it is Example. Possible values are:\n  * `Example`: TensorFlow [Example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) records\n  * `SequenceExample`: TensorFlow [SequenceExample](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) records\n  * `ByteArray`: `Array[Byte]` type in scala. For use cases when writing objects other than tensorflow Example or SequenceExample. For example, [protos](https://developers.google.com/protocol-buffers) can be transformed to byte arrays using `.toByteArray`.\n\nThe writer support partitionBy operation. So the following command will partition the output by \"partitionColumn\".\n```\ndf.write.mode(SaveMode.Overwrite).partitionBy(\"partitionColumn\").format(\"tfrecord\").option(\"recordType\", \"Example\").save(output_dir)\n```\nNote we use `format(\"tfrecord\")` instead `format(\"tfrecords\")`. So if you migrate from Spark-Tensorflow-Connector, make sure this is changed accordingly.\n\n## Schema inference\nThis library supports automatic schema inference when reading TensorFlow records into Spark DataFrames.\nSchema inference is expensive since it requires an extra pass through the data.\n\nThe schema inference rules are described in the table below:\n\n| TFRecordType             | Feature Type  | Inferred Spark Data Type  |\n| ------------------------ |:--------------|:--------------------------|\n| Example, SequenceExample | Int64List     | LongType if all lists have length=1, else ArrayType(LongType) |\n| Example, SequenceExample | FloatList     | FloatType if all lists have length=1, else ArrayType(FloatType) |\n| Example, SequenceExample | BytesList     | StringType if all lists have length=1, else ArrayType(StringType) |\n| SequenceExample          | FeatureList of Int64List | ArrayType(ArrayType(LongType)) |\n| SequenceExample          | FeatureList of FloatList | ArrayType(ArrayType(FloatType)) |\n| SequenceExample          | FeatureList of BytesList | ArrayType(ArrayType(StringType)) |\n\n## Supported data types\n\nThe supported Spark data types are listed in the table below:\n\n| Type            | Spark DataTypes                          |\n| --------------- |:------------------------------------------|\n| Scalar          | IntegerType, LongType, FloatType, DoubleType, DecimalType, StringType, BinaryType |\n| Array           | VectorType, ArrayType of IntegerType, LongType, FloatType, DoubleType, DecimalType, BinaryType, or StringType |\n| Array of Arrays | ArrayType of ArrayType of IntegerType, LongType, FloatType, DoubleType, DecimalType, BinaryType, or StringType |\n\n## Usage Examples\n\n### Python API\n\n#### TF record Import/export\n\nRun PySpark with the spark_connector in the jars argument as shown below:\n\n`$SPARK_HOME/bin/pyspark --jars target/spark-tfrecord_2.12-0.3.0.jar`\n\nThe following Python code snippet demonstrates usage on test data.\n\n```python\nfrom pyspark.sql.types import *\n\npath = \"test-output.tfrecord\"\n\nfields = [StructField(\"id\", IntegerType()), StructField(\"IntegerCol\", IntegerType()),\n          StructField(\"LongCol\", LongType()), StructField(\"FloatCol\", FloatType()),\n          StructField(\"DoubleCol\", DoubleType()), StructField(\"VectorCol\", ArrayType(DoubleType(), True)),\n          StructField(\"StringCol\", StringType())]\nschema = StructType(fields)\ntest_rows = [[11, 1, 23, 10.0, 14.0, [1.0, 2.0], \"r1\"], [21, 2, 24, 12.0, 15.0, [2.0, 2.0], \"r2\"]]\nrdd = spark.sparkContext.parallelize(test_rows)\ndf = spark.createDataFrame(rdd, schema)\ndf.write.mode(\"overwrite\").format(\"tfrecord\").option(\"recordType\", \"Example\").save(path)\ndf = spark.read.format(\"tfrecord\").option(\"recordType\", \"Example\").load(path)\ndf.show()\n```\n\n### Scala API\nRun Spark shell with the spark_connector in the jars argument as shown below:\n```sh\n$SPARK_HOME/bin/spark-shell --jars target/spark-tfrecord_2.12-0.3.0.jar\n```\n\nThe following Scala code snippet demonstrates usage on test data.\n\n```scala\nimport org.apache.commons.io.FileUtils\nimport org.apache.spark.sql.{ DataFrame, Row }\nimport org.apache.spark.sql.catalyst.expressions.GenericRow\nimport org.apache.spark.sql.types._\n\nval path = \"test-output.tfrecord\"\nval testRows: Array[Row] = Array(\nnew GenericRow(Array[Any](11, 1, 23L, 10.0F, 14.0, List(1.0, 2.0), \"r1\")),\nnew GenericRow(Array[Any](21, 2, 24L, 12.0F, 15.0, List(2.0, 2.0), \"r2\")))\nval schema = StructType(List(StructField(\"id\", IntegerType),\n                             StructField(\"IntegerCol\", IntegerType),\n                             StructField(\"LongCol\", LongType),\n                             StructField(\"FloatCol\", FloatType),\n                             StructField(\"DoubleCol\", DoubleType),\n                             StructField(\"VectorCol\", ArrayType(DoubleType, true)),\n                             StructField(\"StringCol\", StringType)))\n\nval rdd = spark.sparkContext.parallelize(testRows)\n\n//Save DataFrame as TFRecords\nval df: DataFrame = spark.createDataFrame(rdd, schema)\ndf.write.format(\"tfrecord\").option(\"recordType\", \"Example\").save(path)\n\n//Read TFRecords into DataFrame.\n//The DataFrame schema is inferred from the TFRecords if no custom schema is provided.\nval importedDf1: DataFrame = spark.read.format(\"tfrecord\").option(\"recordType\", \"Example\").load(path)\nimportedDf1.show()\n\n//Read TFRecords into DataFrame using custom schema\nval importedDf2: DataFrame = spark.read.format(\"tfrecord\").schema(schema).load(path)\nimportedDf2.show()\n```\n\n#### Use partitionBy\nThe following example shows to how to use partitionBy, which is not supported by [Spark Tensorflow Connector](https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector)\n\n```scala\n\n// launch spark-shell with the following command:\n// SPARK_HOME/bin/spark-shell --jar target/spark-tfrecord_2.12-0.3.0.jar\n\nimport org.apache.spark.sql.SaveMode\n\nval df = Seq((8, \"bat\"),(8, \"abc\"), (1, \"xyz\"), (2, \"aaa\")).toDF(\"number\", \"word\")\ndf.show\n\n// scala\u003e df.show\n// +------+----+\n// |number|word|\n// +------+----+\n// |     8| bat|\n// |     8| abc|\n// |     1| xyz|\n// |     2| aaa|\n// +------+----+\n\nval tf_output_dir = \"/tmp/tfrecord-test\"\n\n// dump the tfrecords to files.\ndf.repartition(3, col(\"number\")).write.mode(SaveMode.Overwrite).partitionBy(\"number\").format(\"tfrecord\").option(\"recordType\", \"Example\").save(tf_output_dir)\n\n// ls /tmp/tfrecord-test\n// _SUCCESS        number=1        number=2        number=8\n\n// read back the tfrecords from files.\nval new_df = spark.read.format(\"tfrecord\").option(\"recordType\", \"Example\").load(tf_output_dir)\nnew_df.show\n\n// scala\u003e new_df.show\n// +----+------+\n// |word|number|\n// +----+------+\n// | bat|     8|\n// | abc|     8|\n// | xyz|     1|\n// | aaa|     2|\n```\n## Contributing\n\nPlease read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.\n\n## License\n\nThis project is licensed under the BSD 2-CLAUSE LICENSE - see the [LICENSE.md](LICENSE.md) file for details\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fspark-tfrecord","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinkedin%2Fspark-tfrecord","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fspark-tfrecord/lists"}