https://github.com/hablapps/sparkoptics
Optics for Spark DataFrames
https://github.com/hablapps/sparkoptics
dataframe dataframes optics scala spark spark-sql
Last synced: 6 months ago
JSON representation
Optics for Spark DataFrames
- Host: GitHub
- URL: https://github.com/hablapps/sparkoptics
- Owner: hablapps
- License: apache-2.0
- Created: 2019-03-01T13:34:29.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2021-03-05T09:39:29.000Z (almost 5 years ago)
- Last Synced: 2025-06-06T11:08:08.157Z (7 months ago)
- Topics: dataframe, dataframes, optics, scala, spark, spark-sql
- Language: Scala
- Homepage:
- Size: 58.6 KB
- Stars: 47
- Watchers: 13
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://travis-ci.com/hablapps/sparkOptics)
[](https://maven-badges.herokuapp.com/maven-central/org.hablapps/spark-optics_2.11)
[](https://gitter.im/hablapps/sparkOptics?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
[](https://mybinder.org/v2/gh/hablapps/sparkOptics/binder?filepath=%2Fnotebooks%2FSparkLenses.ipynb)
# Spark-optics
Modify your complex structures in spark-sql dataframes with optics.
## Getting Started
Need to set an inner element in a complex structure?
```scala
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
val df: DataFrame = ???
import org.hablapps.sparkOptics._
df.select(Lens("field.subfield")(df.schema).set(lit(13)):_*)
```
Want to try it right now, click on the binder icon to lunch a interactive notebook.
### Installing
Compiled for Scala 2.11 with Spark 2.3 and Scala 2.12 with Spark 2.4. Tested with spark 2.3, 2.4 in scala 2.11 and for 3.0.1 in scala 2.12.
```sbtshell
libraryDependencies += "org.hablapps" %% "spark-optics" % "1.0.0"
```
Spark lens doesn't have any dependencies beyond Spark itself.
### Implemented optics
#### Lens
Optic to focus in a column of a provided schema.
#### Protolens
Optic equivalent to a lens, but without an specific schema, so that it can be applied to any dataframe that contains the specified column.
### Motivation and larger example
Working with complex structures in Spark Sql's DataFrames can be hard.
A common case if you are working with complex structures is to modify the inner elements of the structure. For instance:
```scala
case class Street(number: Int, name: String)
case class Address(city: String, street: Street)
case class Company(name: String, address: Address)
case class Employee(name: String, company: Company)
val employee = Employee("john", Company("awesome inc", Address("london", Street(23, "high street"))))
val df = List(employee).toDS.toDF
```
```
root
|-- name: string (nullable = true)
|-- company: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = true)
| | | |-- number: integer (nullable = false)
| | | |-- name: string (nullable = true)
```
In order to modify an inner element, like changing the name of the street, we need to do something like this:
```scala
val mDF = df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
mDF.printSchema
val longCodeEmployee = mDF.as[Employee].head
```
```
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
mDF: DataFrame = [name: string, company: struct>>]
longCodeEmployee: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET"))))
```
This can be simplified by using spark-optics. It allows you to focus in the element that you want to modify,
and the optics will recreate the structure for you.
```scala
import org.hablapps.sparkOptics._
import org.apache.spark.sql.functions._
val df: DataFrame = List(employee).toDF
val streetNameLens = Lens("company.address.street.name")(df.schema)
val modifiedDF = df.select(streetNameLens.modify(upper):_*)
modifiedDF.printSchema
modifiedDF.as[Employee].head
```
```
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
streetNameLens: Lens = Lens(company.address.street.name)
modifiedDF: DataFrame = [name: string, company: struct>>]
res19_3: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET")))
)
```