Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ashwanthkumar/mrcube
Scalding CUBE operators
https://github.com/ashwanthkumar/mrcube
Last synced: about 6 hours ago
JSON representation
Scalding CUBE operators
- Host: GitHub
- URL: https://github.com/ashwanthkumar/mrcube
- Owner: ashwanthkumar
- Created: 2013-08-21T16:16:54.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2015-01-04T12:27:12.000Z (almost 10 years ago)
- Last Synced: 2024-04-14T09:19:01.913Z (7 months ago)
- Language: Scala
- Size: 180 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![Build Status](https://snap-ci.com/ashwanthkumar/mrcube/branch/master/build_image)](https://snap-ci.com/ashwanthkumar/mrcube/branch/master)
# Scalding MRCube
Scalding CUBE operators
Naive `cubify` and `rollup` methods on richPipe.
For each Input Tuple
- cubify generates 2^n tuples, where n is the number of fields we are cubing on
- rollup generates n+1 tuples, where n is the number of fields we are rolling up on### Dev
```bash
$ git clone https://github.com/ashwanthkumar/mrcube.git
$ sbt test
```### Dependencies
For Maven,
```xmlin.ashwanthkumar
mrcube_2.10
0.12.0```
For SBT,
```sbt
libraryDependencies += "in.ashwanthkumar" %% "mrcube" % "0.12.0"
```### Cubify
If the input tuple is ``("ipod", "miami", "2012", "200000")`` the output generated from the job is
```
("ipod", "miami", "2012", "1", "200000.0")
("ipod", "miami", "null", "1", "200000.0")
("ipod", "null", "null", "1", "200000.0")
("ipod", "null", "2012", "1", "200000.0")
("null", "null", "2012", "1", "200000.0")
("null", "miami", "null", "1", "200000.0")
("null", "miami", "2012", "1", "200000.0")
("null", "null", "null", "1", "200000.0")
```Instead of "null" you can pass in another custom string to cubify.
```scala
import in.ashwanthkumar.mrcube._class CubifyJob(args: Args) extends Job(args) {
Csv(args("input"), fields = ('product, 'location, 'year, 'sales))
.read
.cubify(('product, 'location, 'year))
.groupBy('product, 'location, 'year) { _.size('size).sum[Int]('sales) }
.write(Csv(args("output")))
}
```### Rollup
If the input tuple is ``("ipod", "miami", "2012", "200000")`` the output generated from the job is
```
("ipod", "miami", "2012", "1", "200000.0")
("null", "miami", "2012", "1", "200000.0")
("null", "null", "2012", "1", "200000.0")
("null", "null", "null", "1", "200000.0")
```Similarly instead of "null" you can pass in another custom string to rollup.
```scala
import in.ashwanthkumar.mrcube._class RollupJob(args: Args) extends Job(args) {
Csv(args("input"), fields = ('product, 'location, 'year, 'sales))
.read
.rollup(('product, 'location, 'year))
.groupBy('product, 'location, 'year) { _.size('size).sum[Int]('sales) }
.write(Csv(args("output")))}
```### References
1. [Distributed Cube Materialization on Holistic Measures](http://arnab.org/files/mrcube.pdf) by [Dr. Arnam Nandi](http://arnab.org/) et. al
2. CUBE Operator in Pig - [PIG 2167](https://issues.apache.org/jira/browse/PIG-2167)### License
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0