Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ashwanthkumar/scalding-dataflow

Scalding Runner for Google Dataflow
https://github.com/ashwanthkumar/scalding-dataflow

Last synced: about 6 hours ago
JSON representation

Scalding Runner for Google Dataflow

Awesome Lists containing this project

README

        

[![Build Status](https://snap-ci.com/ashwanthkumar/scalding-dataflow/branch/master/build_image)](https://snap-ci.com/ashwanthkumar/scalding-dataflow/branch/master)

# scalding-dataflow
Scalding Runner for Google Dataflow SDK. This project is a WIP, try it at your own risk.

## Usage

You can use it in your own SBT projects
### built.sbt
```sbt
resolvers += Resolver.sonatypeRepo("snapshots")

// For more updated version check out the last run version of Build pipeline
libraryDependencies += "in.ashwanthkumar" %% "scalding-dataflow" % "1.0.23-SNAPSHOT"
```

### pom.xml
```xml

in.ashwanthkumar
scalding-dataflow_2.10

1.0.23

....



oss.sonatype.org-snapshot
http://oss.sonatype.org/content/repositories/snapshots

false


true



```

Pass the following options to the program (_WordCount_) when running it

`--runner=ScaldingPipelineRunner --name=Main-Test --mode=local`

```java
PipelineOptions options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.create();
Pipeline pipeline = Pipeline.create(options);

pipeline.apply(TextIO.Read.from("kinglear.txt").named("Source"))
.apply(Count.perElement())
.apply(ParDo.of(new DoFn, String>() {
@Override
public void processElement(ProcessContext c) throws Exception {
KV kv = c.element();
c.output(String.format("%s\t%d", kv.getKey(), kv.getValue()));
}
}))
.apply(TextIO.Write.to("out.txt").named("Sink"));

pipeline.run();
```

If you want to run it on HDFS (experimental), change the `mode=local` to `mode=hdfs`

## Todos
### Translators
- [x] ParDo.Bound
- [x] Filter
- [x] Keys
- [x] Values
- [x] KvSwap
- [x] ParDo.Bound with sideInputs
- [x] Combine
- [x] Flatten
- [ ] ParDo.BoundMulti
- [x] Combine.GroupedValues
- [x] Combine.PerKey
- [ ] View.AsSingleton
- [ ] View.AsIterable
- [ ] Window.Bound

### IO
- [x] Text
- [ ] Custom Cascading Scheme
- [ ] Iterable of Items
- [ ] Google SDK's Coder for SerDe

### Scalding
- [x] Move to TypedPipes
- [ ] Test it on Hadoop Mode