{"id":13454901,"url":"https://github.com/databricks/tensorframes","last_synced_at":"2025-05-15T08:05:19.019Z","repository":{"id":5636657,"uuid":"53160128","full_name":"databricks/tensorframes","owner":"databricks","description":"[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark","archived":false,"fork":false,"pushed_at":"2024-07-30T20:59:38.000Z","size":1836,"stargazers_count":748,"open_issues_count":54,"forks_count":161,"subscribers_count":77,"default_branch":"master","last_synced_at":"2025-05-08T05:19:08.623Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-03-04T19:25:19.000Z","updated_at":"2025-04-04T03:59:17.000Z","dependencies_parsed_at":"2024-06-16T05:32:55.352Z","dependency_job_id":"d0c6b6a0-ab95-4c3c-80dc-043ba8d7f479","html_url":"https://github.com/databricks/tensorframes","commit_stats":{"total_commits":193,"total_committers":19,"mean_commits":"10.157894736842104","dds":"0.26943005181347146","last_synced_commit":"a6c753613c010b724a225b6ea97151e4f25d00df"},"previous_names":["tjhunter/tensorframes"],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftensorframes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftensorframes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftensorframes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftensorframes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/tensorframes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254301427,"owners_count":22047903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T08:00:59.155Z","updated_at":"2025-05-15T08:05:18.969Z","avatar_url":"https://github.com/databricks.png","language":"Scala","funding_links":[],"categories":["Libraries","分布式机器学习"],"sub_categories":[],"readme":"![build](https://travis-ci.org/databricks/tensorframes.svg)\n\n# TensorFrames (Deprecated)\n\n\u003e **Note**:  TensorFrames is deprecated. You can use [`pandas UDF`](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs) instead.\n\nExperimental [TensorFlow](https://www.tensorflow.org/) binding for Scala and \n[Apache Spark](http://spark.apache.org/).\n\nTensorFrames (TensorFlow on Spark DataFrames) lets you manipulate Apache Spark's DataFrames with \nTensorFlow programs.\n\n\u003e This package is experimental and is provided as a technical preview only. While the \n\u003e interfaces are all implemented and working, there are still some areas of low performance.\n\nSupported platforms:\n\n\u003e This package only officially supports linux 64bit platforms as a target.\n\u003e Contributions are welcome for other platforms.\n\nSee the file `project/Dependencies.scala` for adding your own platform.\n\nOfficially TensorFrames supports Spark 2.4+ and Scala 2.11.\n\nSee the [user guide](https://github.com/databricks/tensorframes/wiki/TensorFrames-user-guide) for\n extensive information about the API.\n\nFor questions, see the [TensorFrames mailing list](https://groups.google.com/forum/#!forum/tensorframes).\n\nTensorFrames is available as a\n [Spark package](http://spark-packages.org/package/databricks/tensorframes).\n\n## Requirements\n\n - A working version of Apache Spark (2.4 or greater)\n\n - Java 8+\n \n - (Optional) python 2.7+/3.6+ if you want to use the python interface.\n \n - (Optional) the python TensorFlow package if you want to use the python interface. See the \n [official instructions](https://www.tensorflow.org/install/)\n  on how to get the latest release of TensorFlow.\n\n - (Optional) pandas \u003e= 0.19.1 if you want to use the python interface\n\nAdditionally, for developement, you need the following dependencies:\n\n - protoc 3.x\n\n - nose \u003e= 1.3 \n\n\n## How to run in python\n\nAssuming that `SPARK_HOME` is set, you can use PySpark like any other Spark package.\n\n```bash\n$SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.6.0-s_2.11\n```\n\nHere is a small program that uses TensorFlow to add 3 to an existing column.\n\n```python\nimport tensorflow as tf\nimport tensorframes as tfs\nfrom pyspark.sql import Row\n\ndata = [Row(x=float(x)) for x in range(10)]\ndf = sqlContext.createDataFrame(data)\nwith tf.Graph().as_default() as g:\n    # The TensorFlow placeholder that corresponds to column 'x'.\n    # The shape of the placeholder is automatically inferred from the DataFrame.\n    x = tfs.block(df, \"x\")\n    # The output that adds 3 to x\n    z = tf.add(x, 3, name='z')\n    # The resulting dataframe\n    df2 = tfs.map_blocks(z, df)\n\n# The transform is lazy as for most DataFrame operations. This will trigger it:\ndf2.collect()\n\n# Notice that z is an extra column next to x\n\n# [Row(z=3.0, x=0.0),\n#  Row(z=4.0, x=1.0),\n#  Row(z=5.0, x=2.0),\n#  Row(z=6.0, x=3.0),\n#  Row(z=7.0, x=4.0),\n#  Row(z=8.0, x=5.0),\n#  Row(z=9.0, x=6.0),\n#  Row(z=10.0, x=7.0),\n#  Row(z=11.0, x=8.0),\n#  Row(z=12.0, x=9.0)]\n```\n\nThe second example shows the block-wise reducing operations: we compute the sum of a field containing \nvectors of integers, working with blocks of rows for more efficient processing.\n\n```python\n# Build a DataFrame of vectors\ndata = [Row(y=[float(y), float(-y)]) for y in range(10)]\ndf = sqlContext.createDataFrame(data)\n# Because the dataframe contains vectors, we need to analyze it first to find the\n# dimensions of the vectors.\ndf2 = tfs.analyze(df)\n\n# The information gathered by TF can be printed to check the content:\ntfs.print_schema(df2)\n# root\n#  |-- y: array (nullable = false) double[?,2]\n\n# Let's use the analyzed dataframe to compute the sum and the elementwise minimum \n# of all the vectors:\n# First, let's make a copy of the 'y' column. This will be very cheap in Spark 2.0+\ndf3 = df2.select(df2.y, df2.y.alias(\"z\"))\nwith tf.Graph().as_default() as g:\n    # The placeholders. Note the special name that end with '_input':\n    y_input = tfs.block(df3, 'y', tf_name=\"y_input\")\n    z_input = tfs.block(df3, 'z', tf_name=\"z_input\")\n    y = tf.reduce_sum(y_input, [0], name='y')\n    z = tf.reduce_min(z_input, [0], name='z')\n    # The resulting dataframe\n    (data_sum, data_min) = tfs.reduce_blocks([y, z], df3)\n\n# The final results are numpy arrays:\nprint(data_sum)\n# [45., -45.]\nprint(data_min)\n# [0., -9.]\n```\n\n*Notes*\n\nNote the scoping of the graphs above. This is important because TensorFrames finds which \nDataFrame column to feed to TensorFrames based on the placeholders of the graph. Also, it is \n good practice to keep small graphs when sending them to Spark.\n \nFor small tensors (scalars and vectors), TensorFrames usually infers the shapes of the \ntensors without requiring a preliminary analysis. If it cannot do it, an error message will \nindicate that you need to run the DataFrame through `tfs.analyze()` first.\n\nLook at the python documentation of the TensorFrames package to see what methods are available.\n\n\n## How to run in Scala\n\nThe scala support is a bit more limited than python. In scala, operations can be loaded from \n an existing graph defined in the ProtocolBuffers format, or using a simple scala DSL. The\n Scala DSL only features a subset of TensorFlow transforms. It is very easy to extend\n though, so other transforms will be added without much effort in the future.\n\nYou simply use the published package:\n\n```bash\n$SPARK_HOME/bin/spark-shell --packages databricks:tensorframes:0.6.0-s_2.11\n```\n\nHere is the same program as before:\n\n```scala\nimport org.tensorframes.{dsl =\u003e tf}\nimport org.tensorframes.dsl.Implicits._\n\nval df = spark.createDataFrame(Seq(1.0-\u003e1.1, 2.0-\u003e2.2)).toDF(\"a\", \"b\")\n\n// As in Python, scoping is recommended to prevent name collisions.\nval df2 = tf.withGraph {\n    val a = df.block(\"a\")\n    // Unlike python, the scala syntax is more flexible:\n    val out = a + 3.0 named \"out\"\n    // The 'mapBlocks' method is added using implicits to dataframes.\n    df.mapBlocks(out).select(\"a\", \"out\")\n}\n\n// The transform is all lazy at this point, let's execute it with collect:\ndf2.collect()\n// res0: Array[org.apache.spark.sql.Row] = Array([1.0,4.0], [2.0,5.0])   \n```\n\n## How to compile and install for developers\nIt is recommended you use [Conda Environment](https://conda.io/docs/user-guide/tasks/manage-environments.html) to guarantee that the build environment\ncan be reproduced. Once you have installed Conda, you can set the environment from\nthe root of project:\n\n```bash\nconda create -q -n tensorframes-environment python=$PYTHON_VERSION\n```\n\nThis will create an environment for your project. We recommend using Python version 3.7 or 2.7.13.\nAfter the environemnt is created, you can activate it and install all dependencies as follows:\n\n```bash\nconda activate tensorframes-environment\npip install --user -r python/requirements.txt\n```\n\nYou also need to compile the scala code. The recommended procedure is to use the assembly:\n\n```bash\nbuild/sbt tfs_testing/assembly\n# Builds the spark package:\nbuild/sbt distribution/spDist\n```\n\nAssuming that SPARK_HOME is set and that you are in the root directory of the project:\n\n```bash\n$SPARK_HOME/bin/spark-shell --jars $PWD/target/testing/scala-2.11/tensorframes-assembly-0.6.1-SNAPSHOT.jar\n```\n\nIf you want to run the python version:\n \n```bash\nPYTHONPATH=$PWD/target/testing/scala-2.11/tensorframes-assembly-0.6.1-SNAPSHOT.jar \\\n$SPARK_HOME/bin/pyspark --jars $PWD/target/testing/scala-2.11/tensorframes-assembly-0.6.1-SNAPSHOT.jar\n```\n\n## Acknowledgements\n\nBefore TensorFlow released its Java API, this project was built on the great\n[javacpp](https://github.com/bytedeco/javacpp) project, that implements the low-level bindings\nbetween TensorFlow and the Java virtual machine.\n\nMany thanks to Google for the release of TensorFlow.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Ftensorframes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Ftensorframes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Ftensorframes/lists"}