Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/twitter/summingbird
Streaming MapReduce with Scalding and Storm
https://github.com/twitter/summingbird
Last synced: 8 days ago
JSON representation
Streaming MapReduce with Scalding and Storm
- Host: GitHub
- URL: https://github.com/twitter/summingbird
- Owner: twitter
- License: apache-2.0
- Archived: true
- Created: 2012-09-25T22:38:35.000Z (over 12 years ago)
- Default Branch: develop
- Last Pushed: 2022-01-19T17:31:02.000Z (about 3 years ago)
- Last Synced: 2024-09-26T22:02:09.491Z (4 months ago)
- Language: Scala
- Homepage: https://twitter.com/summingbird
- Size: 6.31 MB
- Stars: 2,139
- Watchers: 291
- Forks: 267
- Open Issues: 163
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-streaming - summingbird - library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding. (Table of Contents / DSL)
- awesome-bdccai-tools - Twitter **Summingbird**
- awesome-streaming - summingbird - library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding. (Table of Contents / DSL)
README
## Summingbird
[![status: retired](https://opensource.twitter.dev/status/retired.svg)](https://opensource.twitter.dev/status/#retired)
[![Build Status](https://secure.travis-ci.org/twitter/summingbird.png)](http://travis-ci.org/twitter/summingbird)
[![Codecov branch](https://img.shields.io/codecov/c/github/twitter/summingbird/develop.svg?maxAge=3600)](https://codecov.io/github/twitter/summingbird)
[![Latest version](https://index.scala-lang.org/twitter/summingbird/summingbird-core/latest.svg?color=orange)](https://index.scala-lang.org/twitter/summingbird/summingbird-core)
[![Chat](https://badges.gitter.im/twitter/summingbird.svg)](https://gitter.im/twitter/summingbird?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including [Storm](https://github.com/nathanmarz/storm) and [Scalding](https://github.com/twitter/scalding).
![Summingbird Logo](https://raw.github.com/twitter/summingbird/develop/logo/summingbird_logo.png)
While a word-counting aggregation in pure Scala might look like this:
```scala
def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.foreach { case (k, v) => store.update(k, store.get(k) + v) }
```Counting words in Summingbird looks like this:
```scala
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
```The logic is exactly the same, and the code is almost the same. The main difference is that you can execute the Summingbird program in "batch mode" (using [Scalding](https://github.com/twitter/scalding)), in "realtime mode" (using [Storm](https://github.com/nathanmarz/storm)), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.
Summingbird provides you with the primitives you need to build rock solid production systems.
## Getting Started: Word Count with Twitter
The `summingbird-example` project allows you to run the wordcount program above on a sample of Twitter data using a local Storm topology and memcache instance. You can find the actual job definition in [ExampleJob.scala](https://github.com/twitter/summingbird/blob/develop/summingbird-example/src/main/scala/com/twitter/summingbird/example/ExampleJob.scala).
First, make sure you have `memcached` installed locally. If not, if you're on OS X, you can get it by installing [Homebrew](http://brew.sh/) and running this command in a shell:
```bash
brew install memcached
```When this is finished, run the `memcached` command in a separate terminal.
Now you'll need to set up access to the Twitter Streaming API. [This blog post](http://tugdualgrall.blogspot.com/2012/11/couchbase-create-large-dataset-using.html) has a great walkthrough, so open that page, head over to https://dev.twitter.com/ and get your various keys and tokens. Once you have these, clone the Summingbird repository:
```bash
git clone https://github.com/twitter/summingbird.git
cd summingbird
```And open [StormRunner.scala](https://github.com/twitter/summingbird/blob/develop/summingbird-example/src/main/scala/com/twitter/summingbird/example/StormRunner.scala) in your editor. Replace the dummy variables under `config` variable with your auth tokens:
```scala
lazy val config = new ConfigurationBuilder()
.setOAuthConsumerKey("mykey")
.setOAuthConsumerSecret("mysecret")
.setOAuthAccessToken("token")
.setOAuthAccessTokenSecret("tokensecret")
.setJSONStoreEnabled(true) // required for JSON serialization
.build
```You're all ready to go! Now it's time to unleash Storm on your Twitter stream. Make sure the `memcached` terminal is still open, then start Storm from the `summingbird` directory:
```bash
./sbt "summingbird-example/run --local"
```Storm should puke out a bunch of output, then stabilize and hang. This means that Storm is updating your local memcache instance with counts of every word that it sees in each tweet.
To query the aggregate results in Memcached, you'll need to open an SBT repl in a new terminal:
```bash
./sbt summingbird-example/console
```At the launched repl, run the following:
```scala
scala> import com.twitter.summingbird.example._
import com.twitter.summingbird.example._scala> StormRunner.lookup("i")
res0: Option[Long] = Some(5)
scala> StormRunner.lookup("i")
res1: Option[Long] = Some(52)
```Boom. Counts for the word `"i"` are growing in realtime.
See the [wiki page](https://github.com/twitter/summingbird/wiki/Getting-started-with-summingbird-example) for a more detailed explanation of the configuration required to get this job up and running and some ideas for where to go next.
## Documentation
To learn more and find links to tutorials and information around the web, check out the [Summingbird Wiki](https://github.com/twitter/summingbird/wiki).
The latest ScalaDocs are hosted on Summingbird's [Github Project Page](http://twitter.github.io/summingbird).
## Contact
Discussion occurs primarily on the [Summingbird mailing list](https://groups.google.com/forum/#!forum/summingbird). Issues should be reported on the GitHub issue tracker. Simpler issues appropriate for first-time contributors looking to help out are tagged "newbie".
IRC: freenode channel #summingbird
Follow [@summingbird](https://twitter.com/summingbird) on Twitter for updates.
Please feel free to use the beautiful [Summingbird logo](https://drive.google.com/folderview?id=0B3i3pDi3yVgNMHV0TXVkTGZteWM&usp=sharing) artwork anywhere.
## Get Involved + Code of Conduct
Pull requests and bug reports are always welcome!We use a lightweight form of project governence inspired by the one used by Apache projects.
Please see [Contributing and Committership](https://github.com/twitter/analytics-infra-governance#contributing-and-committership) for our code of conduct and our pull request review process.
The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a [Committer](COMMITTERS.md) in order to get your PR accepted.The current list of active committers (who can +1 a pull request) can be found here: [Committers](COMMITTERS.md)
A list of contributors to the project can be found here: [Contributors](https://github.com/twitter/summingbird/graphs/contributors)
## Maven
Summingbird modules are published on maven central. The current groupid and version for all modules is, respectively, `"com.twitter"` and `0.9.1`.
Current published artifacts are
* `summingbird-core_2.11`
* `summingbird-core_2.10`
* `summingbird-batch_2.11`
* `summingbird-batch_2.10`
* `summingbird-client_2.11`
* `summingbird-client_2.10`
* `summingbird-storm_2.11`
* `summingbird-storm_2.10`
* `summingbird-scalding_2.11`
* `summingbird-scalding_2.10`
* `summingbird-builder_2.11`
* `summingbird-builder_2.10`The suffix denotes the scala version.
## Authors (alphabetically)
* Oscar Boykin
* Ian O'Connell
* Sam Ritchie
* Ashutosh Singhal## License
Copyright 2013 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0