{"id":13440387,"url":"https://github.com/twitter/scalding","last_synced_at":"2025-10-05T23:21:47.311Z","repository":{"id":2198248,"uuid":"3146558","full_name":"twitter/scalding","owner":"twitter","description":"A Scala API for Cascading","archived":false,"fork":false,"pushed_at":"2023-05-28T19:18:59.000Z","size":19978,"stargazers_count":3515,"open_issues_count":316,"forks_count":708,"subscribers_count":318,"default_branch":"develop","last_synced_at":"2025-05-07T01:45:55.202Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://twitter.com/scalding","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/twitter.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2012-01-10T16:22:08.000Z","updated_at":"2025-04-15T19:47:10.000Z","dependencies_parsed_at":"2022-08-06T12:01:18.023Z","dependency_job_id":"f612417c-887e-4c57-a760-5c8fb7ba10cb","html_url":"https://github.com/twitter/scalding","commit_stats":{"total_commits":2955,"total_committers":221,"mean_commits":"13.371040723981901","dds":0.8988155668358714,"last_synced_commit":"a0516e0f43b966a04244e8e8bad80438d61773ea"},"previous_names":[],"tags_count":94,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Fscalding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Fscalding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Fscalding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Fscalding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/twitter","download_url":"https://codeload.github.com/twitter/scalding/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254059520,"owners_count":22007771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:01:22.319Z","updated_at":"2025-10-05T23:21:42.272Z","avatar_url":"https://github.com/twitter.png","language":"Scala","readme":"# Scalding\n\n[![Build status](https://github.com/twitter/scalding/actions/workflows/CI.yml/badge.svg?branch=develop)](https://github.com/twitter/scalding/actions)\n[![Coverage Status](https://img.shields.io/codecov/c/github/twitter/scalding/develop.svg?maxAge=3600)](https://codecov.io/github/twitter/scalding)\n[![Latest version](https://index.scala-lang.org/twitter/scalding/scalding-core/latest.svg?color=orange)](https://index.scala-lang.org/twitter/scalding/scalding-core)\n[![Chat](https://badges.gitter.im/twitter/scalding.svg)](https://gitter.im/twitter/scalding?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\nScalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of [Cascading](http://www.cascading.org/), a Java library that abstracts away low-level Hadoop details. Scalding is comparable to [Pig](http://pig.apache.org/), but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.\n\n![Scalding Logo](https://raw.github.com/twitter/scalding/develop/logo/scalding.png)\n\n## Word Count\n\nHadoop is a distributed system for counting words. Here is how it's done in Scalding.\n\n```scala\npackage com.twitter.scalding.examples\n\nimport com.twitter.scalding._\nimport com.twitter.scalding.source.TypedText\n\nclass WordCountJob(args: Args) extends Job(args) {\n  TypedPipe.from(TextLine(args(\"input\")))\n    .flatMap { line =\u003e tokenize(line) }\n    .groupBy { word =\u003e word } // use each word for a key\n    .size // in each group, get the size\n    .write(TypedText.tsv[(String, Long)](args(\"output\")))\n\n  // Split a piece of text into individual words.\n  def tokenize(text: String): Array[String] = {\n    // Lowercase each word and remove punctuation.\n    text.toLowerCase.replaceAll(\"[^a-zA-Z0-9\\\\s]\", \"\").split(\"\\\\s+\")\n  }\n}\n```\n\nNotice that the `tokenize` function, which is standard Scala, integrates naturally with the rest of the MapReduce job. This is a very powerful feature of Scalding. (Compare it to the use of UDFs in Pig.)\n\nYou can find more example code under [examples/](https://github.com/twitter/scalding/tree/master/scalding-commons/src/main/scala/com/twitter/scalding/examples). If you're interested in comparing Scalding to other languages, see our [Rosetta Code page](https://github.com/twitter/scalding/wiki/Rosetta-Code), which has several MapReduce tasks in Scalding and other frameworks (e.g., Pig and Hadoop Streaming).\n\n## Documentation and Getting Started\n\n* [**Getting Started**](https://github.com/twitter/scalding/wiki/Getting-Started) page on the [Scalding Wiki](https://github.com/twitter/scalding/wiki)\n* [Scalding Scaladocs](http://twitter.github.com/scalding) provide details beyond the API References. Prefer using this as it's always up to date.\n* [**REPL in Wonderland**](tutorial/WONDERLAND.md) a hands-on tour of the scalding REPL requiring only git and java installed.\n* [**Runnable tutorials**](https://github.com/twitter/scalding/tree/master/tutorial) in the source.\n* The API Reference, including many example Scalding snippets:\n  * [Type-safe API Reference](https://github.com/twitter/scalding/wiki/Type-safe-api-reference)\n  * [Fields-based API Reference](https://github.com/twitter/scalding/wiki/Fields-based-API-Reference)\n* The Matrix Library provides a way of working with key-attribute-value scalding pipes:\n  * The [Introduction to Matrix Library](https://github.com/twitter/scalding/wiki/Introduction-to-Matrix-Library) contains an overview and a \"getting started\" example\n  * The [Matrix API Reference](https://github.com/twitter/scalding/wiki/Matrix-API-Reference) contains the Matrix Library API reference with examples\n* [**Introduction to Scalding Execution**](https://github.com/twitter/scalding/wiki/Calling-Scalding-from-inside-your-application) contains general rules and examples of calling Scalding from inside another application.\n\nPlease feel free to use the beautiful [Scalding logo](https://drive.google.com/folderview?id=0B3i3pDi3yVgNbm9pMUdDcHFKVEk\u0026usp=sharing) artwork anywhere.\n\n## Contact\nFor user questions or scalding development (internals, extending, release planning):\n\u003chttps://groups.google.com/forum/#!forum/scalding-dev\u003e (Google search also works as a first step)\n\nIn the remote possibility that there exist bugs in this code, please report them to:\n\u003chttps://github.com/twitter/scalding/issues\u003e\n\nFollow [@Scalding](http://twitter.com/scalding) on Twitter for updates.\n\nChat: [![Gitter](https://badges.gitter.im/twitter/scalding.svg)](https://gitter.im/twitter/scalding?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge)\n\n## Get Involved + Code of Conduct\nPull requests and bug reports are always welcome!\n\nWe use a lightweight form of project governence inspired by the one used by Apache projects.\nPlease see [Contributing and Committership](https://github.com/twitter/analytics-infra-governance#contributing-and-committership) for our code of conduct and our pull request review process.\nThe TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a [Committer](COMMITTERS.md) in order to get your PR accepted.\n\nThe current list of active committers (who can +1 a pull request) can be found here: [Committers](COMMITTERS.md)\n\nA list of contributors to the project can be found here: [Contributors](https://github.com/twitter/scalding/graphs/contributors)\n\n## Building\nThere is a script (called sbt) in the root that loads the correct sbt version to build:\n\n1. ```./sbt update``` (takes 2 minutes or more)\n2. ```./sbt test```\n3. ```./sbt assembly``` (needed to make the jar used by the scald.rb script)\n\nThe test suite takes a while to run. When you're in sbt, here's a shortcut to run just one test:\n\n```\u003e test-only com.twitter.scalding.FileSourceTest```\n\nPlease refer to [FAQ page](https://github.com/twitter/scalding/wiki/Frequently-asked-questions#issues-with-sbt) if you encounter problems when using sbt.\n\nWe use Github Actions to verify the build:\n[![Build Status](https://github.com/twitter/scalding/actions/workflows/CI.yml/badge.svg?branch=develop)](https://github.com/twitter/scalding/actions)\n\nWe use [Coveralls](https://coveralls.io/r/twitter/scalding) for code coverage results:\n[![Coverage Status](https://coveralls.io/repos/twitter/scalding/badge.png?branch=develop)](https://coveralls.io/r/twitter/scalding?branch=develop)\n\nScalding modules are available from maven central.\n\nThe current groupid and version for all modules is, respectively, `\"com.twitter\"` and  `0.17.2`.\n\nCurrent published artifacts are\n\n* `scalding-core_2.11`, `scalding-core_2.12`\n* `scalding-args_2.11`, `scalding-args_2.12`\n* `scalding-date_2.11`, `scalding-date_2.12`\n* `scalding-commons_2.11`, `scalding-commons_2.12`\n* `scalding-avro_2.11`, `scalding-avro_2.12`\n* `scalding-parquet_2.11`, `scalding-parquet_2.12`\n* `scalding-repl_2.11`, `scalding-repl_2.12`\n\n\nThe suffix denotes the scala version.\n\n## Adopters\n\n* Ebay\n* Etsy\n* Sharethrough\n* Snowplow Analytics\n* Soundcloud\n* Twitter\n\nTo see a full list of users or to add yourself, see the [wiki](https://github.com/twitter/scalding/wiki/Powered-By)\n\n## Authors:\n* Avi Bryant \u003chttp://twitter.com/avibryant\u003e\n* Oscar Boykin \u003chttp://twitter.com/posco\u003e\n* Argyris Zymnis \u003chttp://twitter.com/argyris\u003e\n\nThanks for assistance and contributions:\n\n* Sam Ritchie \u003chttp://twitter.com/sritchie\u003e\n* Aaron Siegel: \u003chttp://twitter.com/asiegel\u003e\n* Ian O'Connell \u003chttp://twitter.com/0x138\u003e\n* Alex Levenson \u003chttp://twitter.com/THISWILLWORK\u003e\n* Jonathan Coveney \u003chttp://twitter.com/jco\u003e\n* Kevin Lin \u003chttp://twitter.com/reconditesea\u003e\n* Brad Greenlee: \u003chttp://twitter.com/bgreenlee\u003e\n* Edwin Chen \u003chttp://twitter.com/edchedch\u003e\n* Arkajit Dey: \u003chttp://twitter.com/arkajit\u003e\n* Krishnan Raman: \u003chttp://twitter.com/dxbydt_jasq\u003e\n* Flavian Vasile \u003chttp://twitter.com/flavianv\u003e\n* Chris Wensel \u003chttp://twitter.com/cwensel\u003e\n* Ning Liang \u003chttp://twitter.com/ningliang\u003e\n* Dmitriy Ryaboy \u003chttp://twitter.com/squarecog\u003e\n* Dong Wang \u003chttp://twitter.com/dongwang218\u003e\n* Josh Attenberg \u003chttp://twitter.com/jattenberg\u003e\n* Juliet Hougland \u003chttps://twitter.com/j_houg\u003e\n* Eddie Xie \u003chttps://twitter.com/eddiex\u003e\n\nA full list of [contributors](https://github.com/twitter/scalding/graphs/contributors) can be found on GitHub.\n\n## License\n\nCopyright 2016 Twitter, Inc.\n\nLicensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)\n\n","funding_links":[],"categories":["Scala","Table of Contents","Distributed Programming","Big Data","`Distributed Programming `","[](https://github.com/josephmisiti/awesome-machine-learning/blob/master/README.md#scala)Scala","大数据"],"sub_categories":["General-Purpose Machine Learning","Big Data","Spring Cloud框架"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwitter%2Fscalding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftwitter%2Fscalding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwitter%2Fscalding/lists"}