Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/izeigerman/twinkle

The collection of helpers and utils for Apache Spark
https://github.com/izeigerman/twinkle

apache-spark scala spark

Last synced: 3 days ago
JSON representation

The collection of helpers and utils for Apache Spark

Host: GitHub
URL: https://github.com/izeigerman/twinkle
Owner: izeigerman
License: apache-2.0
Created: 2017-06-29T00:16:05.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2017-11-10T20:15:50.000Z (over 7 years ago)
Last Synced: 2024-12-15T14:23:42.429Z (about 2 months ago)
Topics: apache-spark, scala, spark
Language: Scala
Size: 25.4 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Twinkle

Twinkle - is the collection of tools and utils that can make it easier to use Apache Spark in some scenarios.

## DataFrame Utils

### Resolve the column ambiguity

In some cases it's possible to end up with DataFrame that contains multiple columns with the same name. For example this can happen as a result of `join` operation. Twinkle has a [solution](https://github.com/izeigerman/twinkle/blob/master/twinkle/src/main/scala/twinkle/dataframe/AmbiguousColumnsUtils.scala) for this:

```scala

val df1 = spark.createDataFrame(Seq(

  (0, "value1")

)).toDF("id", "column1")

val df2 = spark.createDataFrame(Seq(

  (0, "value2")

)).toDF("id", "column2")

val joined = df1.join(df2, df1("id") === df2("id"), "inner")

import twinkle._

joined.resolveAmbiguity().show()

// or

joined.renameAmbiguousColumns(2 -> "id2").show()

```

Result:

```

+---+-------+-------+

| id|column1|column2|

+---+-------+-------+

|  0| value1| value2|

+---+-------+-------+

or

+---+-------+---+-------+

| id|column1|id2|column2|

+---+-------+---+-------+

|  0| value1|  0| value2|

+---+-------+---+-------+

```