{"id":18649748,"url":"https://github.com/ashwanthkumar/scalding-dataflow","last_synced_at":"2025-11-05T10:30:30.036Z","repository":{"id":140451182,"uuid":"43499490","full_name":"ashwanthkumar/scalding-dataflow","owner":"ashwanthkumar","description":"Scalding Runner for Google Dataflow","archived":false,"fork":false,"pushed_at":"2015-10-11T05:04:27.000Z","size":328,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-05-21T09:27:56.275Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashwanthkumar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-10-01T13:58:16.000Z","updated_at":"2016-02-07T07:08:58.000Z","dependencies_parsed_at":"2023-03-13T12:11:22.007Z","dependency_job_id":null,"html_url":"https://github.com/ashwanthkumar/scalding-dataflow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwanthkumar%2Fscalding-dataflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwanthkumar%2Fscalding-dataflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwanthkumar%2Fscalding-dataflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwanthkumar%2Fscalding-dataflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashwanthkumar","download_url":"https://codeload.github.com/ashwanthkumar/scalding-dataflow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239456405,"owners_count":19641843,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T06:40:30.623Z","updated_at":"2025-11-05T10:30:29.972Z","avatar_url":"https://github.com/ashwanthkumar.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://snap-ci.com/ashwanthkumar/scalding-dataflow/branch/master/build_image)](https://snap-ci.com/ashwanthkumar/scalding-dataflow/branch/master)\n\n# scalding-dataflow\nScalding Runner for Google Dataflow SDK. This project is a WIP, try it at your own risk.\n\n## Usage\n\nYou can use it in your own SBT projects\n### built.sbt\n```sbt\nresolvers += Resolver.sonatypeRepo(\"snapshots\")\n\n// For more updated version check out the last run version of Build pipeline\nlibraryDependencies += \"in.ashwanthkumar\" %% \"scalding-dataflow\" % \"1.0.23-SNAPSHOT\"\n```\n\n### pom.xml\n```xml\n  \u003cdependency\u003e\n    \u003cgroupId\u003ein.ashwanthkumar\u003c/groupId\u003e\n    \u003cartifactId\u003escalding-dataflow_2.10\u003c/artifactId\u003e\n    \u003c!-- For more updated version check out the last run version of Build pipeline --\u003e\n    \u003cversion\u003e1.0.23\u003c/version\u003e\n  \u003c/dependency\u003e\n\n  ....\n\n  \u003crepositories\u003e\n    \u003crepository\u003e\n      \u003cid\u003eoss.sonatype.org-snapshot\u003c/id\u003e\n      \u003curl\u003ehttp://oss.sonatype.org/content/repositories/snapshots\u003c/url\u003e\n      \u003creleases\u003e\n        \u003cenabled\u003efalse\u003c/enabled\u003e\n      \u003c/releases\u003e\n      \u003csnapshots\u003e\n        \u003cenabled\u003etrue\u003c/enabled\u003e\n      \u003c/snapshots\u003e\n    \u003c/repository\u003e\n  \u003c/repositories\u003e\n```\n\nPass the following options to the program (_WordCount_) when running it\n\n`--runner=ScaldingPipelineRunner --name=Main-Test --mode=local`\n\n```java\n  PipelineOptions options = PipelineOptionsFactory\n    .fromArgs(args)\n    .withValidation()\n    .create();\n  Pipeline pipeline = Pipeline.create(options);\n\n  pipeline.apply(TextIO.Read.from(\"kinglear.txt\").named(\"Source\"))\n    .apply(Count.\u003cString\u003eperElement())\n    .apply(ParDo.of(new DoFn\u003cKV\u003cString, Long\u003e, String\u003e() {\n      @Override\n      public void processElement(ProcessContext c) throws Exception {\n        KV\u003cString, Long\u003e kv = c.element();\n        c.output(String.format(\"%s\\t%d\", kv.getKey(), kv.getValue()));\n      }\n    }))\n    .apply(TextIO.Write.to(\"out.txt\").named(\"Sink\"));\n\n  pipeline.run();\n```\n\nIf you want to run it on HDFS (experimental), change the `mode=local` to `mode=hdfs`\n\n## Todos\n### Translators\n- [x] ParDo.Bound\n- [x] Filter\n- [x] Keys\n- [x] Values\n- [x] KvSwap\n- [x] ParDo.Bound with sideInputs\n- [x] Combine\n- [x] Flatten\n- [ ] ParDo.BoundMulti\n- [x] Combine.GroupedValues\n- [x] Combine.PerKey\n- [ ] View.AsSingleton\n- [ ] View.AsIterable\n- [ ] Window.Bound\n\n### IO\n- [x] Text\n- [ ] Custom Cascading Scheme\n- [ ] Iterable of Items\n- [ ] Google SDK's Coder for SerDe\n\n### Scalding\n- [x] Move to TypedPipes\n- [ ] Test it on Hadoop Mode\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwanthkumar%2Fscalding-dataflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashwanthkumar%2Fscalding-dataflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwanthkumar%2Fscalding-dataflow/lists"}