Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ashwanthkumar/scalding-dataflow
Scalding Runner for Google Dataflow
https://github.com/ashwanthkumar/scalding-dataflow
Last synced: about 6 hours ago
JSON representation
Scalding Runner for Google Dataflow
- Host: GitHub
- URL: https://github.com/ashwanthkumar/scalding-dataflow
- Owner: ashwanthkumar
- Created: 2015-10-01T13:58:16.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2015-10-11T05:04:27.000Z (about 9 years ago)
- Last Synced: 2024-05-21T09:27:56.275Z (6 months ago)
- Language: Scala
- Size: 320 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![Build Status](https://snap-ci.com/ashwanthkumar/scalding-dataflow/branch/master/build_image)](https://snap-ci.com/ashwanthkumar/scalding-dataflow/branch/master)
# scalding-dataflow
Scalding Runner for Google Dataflow SDK. This project is a WIP, try it at your own risk.## Usage
You can use it in your own SBT projects
### built.sbt
```sbt
resolvers += Resolver.sonatypeRepo("snapshots")// For more updated version check out the last run version of Build pipeline
libraryDependencies += "in.ashwanthkumar" %% "scalding-dataflow" % "1.0.23-SNAPSHOT"
```### pom.xml
```xml
in.ashwanthkumar
scalding-dataflow_2.10
1.0.23
....
oss.sonatype.org-snapshot
http://oss.sonatype.org/content/repositories/snapshots
false
true
```Pass the following options to the program (_WordCount_) when running it
`--runner=ScaldingPipelineRunner --name=Main-Test --mode=local`
```java
PipelineOptions options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.create();
Pipeline pipeline = Pipeline.create(options);pipeline.apply(TextIO.Read.from("kinglear.txt").named("Source"))
.apply(Count.perElement())
.apply(ParDo.of(new DoFn, String>() {
@Override
public void processElement(ProcessContext c) throws Exception {
KV kv = c.element();
c.output(String.format("%s\t%d", kv.getKey(), kv.getValue()));
}
}))
.apply(TextIO.Write.to("out.txt").named("Sink"));pipeline.run();
```If you want to run it on HDFS (experimental), change the `mode=local` to `mode=hdfs`
## Todos
### Translators
- [x] ParDo.Bound
- [x] Filter
- [x] Keys
- [x] Values
- [x] KvSwap
- [x] ParDo.Bound with sideInputs
- [x] Combine
- [x] Flatten
- [ ] ParDo.BoundMulti
- [x] Combine.GroupedValues
- [x] Combine.PerKey
- [ ] View.AsSingleton
- [ ] View.AsIterable
- [ ] Window.Bound### IO
- [x] Text
- [ ] Custom Cascading Scheme
- [ ] Iterable of Items
- [ ] Google SDK's Coder for SerDe### Scalding
- [x] Move to TypedPipes
- [ ] Test it on Hadoop Mode