Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ankushkhanna/spark-common
Spark Commons, some hacks to simplify programming with Spark.
https://github.com/ankushkhanna/spark-common
spark transfomer
Last synced: 29 days ago
JSON representation
Spark Commons, some hacks to simplify programming with Spark.
- Host: GitHub
- URL: https://github.com/ankushkhanna/spark-common
- Owner: AnkushKhanna
- Created: 2016-06-01T09:32:14.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-11-09T17:18:57.000Z (over 7 years ago)
- Last Synced: 2024-11-16T12:34:43.562Z (3 months ago)
- Topics: spark, transfomer
- Language: Scala
- Size: 18.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Chaining multiple transformers:
Multiple times while trying to use more than one transformation, I was required to chain up Transformers or build a pipeline.Although pipeline was a go to way. Sometime it became overloaded while experimenting with different transformations.
Ex: Maintaining intermediate transformation columns and passing column names between transformers.
Thus I built a [Transform](https://github.com/AnkushKhanna/spark-common/blob/master/src/main/scala/common/transfomration/Transform.scala#L7-L14) class
which was extended by the most common transformers I used, Tokenizer, Hashing, TFIDF.Thus now to chain Tokenizer and Hashing, we can use:
```
val transform = new Transform with TTokenize with THashing
```
To add TFIDF the transformer would look like:
```
val transform = new Transform with TTokenize with THashing with TIDF
```
This works from left to right. So first Tokenizer would be applied then Hashing and at last TFIDF.This make life easier while trying out different Transformer combinations, without the headache of maintaining intermediate columns.
[See source code](https://github.com/AnkushKhanna/spark-common/blob/master/src/main/scala/common/transfomration/Transform.scala)
[See usage code](https://github.com/AnkushKhanna/spark-common/blob/master/src/test/scala/common/transfomration/TransformTest.scala)
To extend this class with further transformations, you can check out [Source code extension](https://github.com/AnkushKhanna/spark-common/blob/master/src/main/scala/common/transfomration/Transform.scala#L27-L38)