Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/izeigerman/twinkle
The collection of helpers and utils for Apache Spark
https://github.com/izeigerman/twinkle
apache-spark scala spark
Last synced: 3 days ago
JSON representation
The collection of helpers and utils for Apache Spark
- Host: GitHub
- URL: https://github.com/izeigerman/twinkle
- Owner: izeigerman
- License: apache-2.0
- Created: 2017-06-29T00:16:05.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-11-10T20:15:50.000Z (over 7 years ago)
- Last Synced: 2024-12-15T14:23:42.429Z (about 2 months ago)
- Topics: apache-spark, scala, spark
- Language: Scala
- Size: 25.4 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Twinkle
Twinkle - is the collection of tools and utils that can make it easier to use Apache Spark in some scenarios.
## DataFrame Utils
### Resolve the column ambiguity
In some cases it's possible to end up with DataFrame that contains multiple columns with the same name. For example this can happen as a result of `join` operation. Twinkle has a [solution](https://github.com/izeigerman/twinkle/blob/master/twinkle/src/main/scala/twinkle/dataframe/AmbiguousColumnsUtils.scala) for this:
```scala
val df1 = spark.createDataFrame(Seq(
(0, "value1")
)).toDF("id", "column1")val df2 = spark.createDataFrame(Seq(
(0, "value2")
)).toDF("id", "column2")val joined = df1.join(df2, df1("id") === df2("id"), "inner")
import twinkle._
joined.resolveAmbiguity().show()// or
joined.renameAmbiguousColumns(2 -> "id2").show()
```
Result:
```
+---+-------+-------+
| id|column1|column2|
+---+-------+-------+
| 0| value1| value2|
+---+-------+-------+or
+---+-------+---+-------+
| id|column1|id2|column2|
+---+-------+---+-------+
| 0| value1| 0| value2|
+---+-------+---+-------+
```