{"id":19779921,"url":"https://github.com/jondot/crunch","last_synced_at":"2025-04-30T21:32:33.927Z","repository":{"id":23461593,"uuid":"26825664","full_name":"jondot/crunch","owner":"jondot","description":"A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.","archived":false,"fork":false,"pushed_at":"2014-11-19T23:29:11.000Z","size":968,"stargazers_count":213,"open_issues_count":1,"forks_count":16,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-07-12T15:45:10.533Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jondot.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-18T19:17:28.000Z","updated_at":"2024-05-04T02:49:21.000Z","dependencies_parsed_at":"2022-08-21T23:50:55.572Z","dependency_job_id":null,"html_url":"https://github.com/jondot/crunch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jondot%2Fcrunch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jondot%2Fcrunch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jondot%2Fcrunch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jondot%2Fcrunch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jondot","download_url":"https://codeload.github.com/jondot/crunch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224225350,"owners_count":17276435,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T05:38:09.179Z","updated_at":"2024-11-12T05:38:09.702Z","avatar_url":"https://github.com/jondot.png","language":"Go","readme":"\n![](/media/logo.png)\n\nA fast to iterate, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.\n\nUse [crunch-starter](https://github.com/jondot/crunch-starter) for a boilerplate project to kickstart a production\nsetup.\n\n\n## Quick Start\n\nCrunch is optimized to be a big-bang-for-the-buck libary, yet almost\nevery aspect is extensible.\n\nLet's say you have a log of semi-structured and deeply nested JSON. Each\nline contains a record.\n\nYou would like to:\n\n1. Parse JSON records\n2. Extract fields\n3. Cleanup/process fields\n4. Extract features - run custom code on field values and\n   output the result as new field(s)\n\n![](/media/crunch.gif)\n\n\nSo here's a detailed view:\n\n```go\n// Describe your row\ntransform := crunch.NewTransformer()\nrow := crunch.NewRow()\n// Use \"field_name type\". Types are Hive types.\nrow.FieldWithValue(\"ev_smp int\", \"1.0\")\n// If no type given, assume 'string'\nrow.FieldWithDefault(\"ip\", \"0.0.0.0\", makeQuery(\"head.x-forwarded-for\"), transform.AsIs)\nrow.FieldWithDefault(\"ev_ts\", \"\", makeQuery(\"action.timestamp\"), transform.AsIs)\nrow.FieldWithDefault(\"ev_source\", \"\", makeQuery(\"action.source\"), transform.AsIs)\nrow.Feature(\"doing ip to location\", []string{\"country\", \"city\"},\n  func(r crunch.DataReader, row *crunch.Row)[]string{\n    // call your \"standard\" Go code for doing ip2location\n    return ip2location(row[\"ip\"])\n  })\n\n// By default, will build a hadoop-compatible streamer process that understands json: (stdin[JSON] to stdout[TSV])\n// Also will plug-in Crunch's CLI utility functions (use -help)\ncrunch.ProcessJson(row)\n```\n\nBuild your processor\n\n```\n$ go build my_processor.go\n```\n\nGenerate a Pig driver that uses `my_processor`, and a Hive table\ncreation DDL.\n\n```\n$ ./my_processor -crunch.stubs=\".\"\n```\n\nYou can now ship your binary and scripts (crunch.hql, crunch.pig) to\nyour cluster.\n\nIn your cluster, you can now setup your table with Hive and run an ETL job with Pig:\n\n```\n$ hive -f crunch.hql\n$ pig -stop_on_failure --param inurl=s3://inbucket/logs/dt=20140304 --param outurl=s3://outbucket/success/dt=20140304 crunch.pig\n```\n\n## Row Setup\n\nThe row setup is the most important part of the processor.\n\nMake a row:\n\n```go\ntransform := crunch.NewTransformer()\nrow := crunch.NewRow()\n```\n\nAnd start describing fields in it:\n\n```Go\nrow.FieldWithDefault(\"name type\", \"default-value\", \u003clookup function\u003e, \u003ctransform function\u003e)\n```\n\nA field description is:\n\n* A `name type` pair, where types are Hive types.\n* A default value (for `FieldWithDefault`, there are variants of this -- see the API docs)\n* A lookup function (the 'Extract' part of ETL) - see one in the\n  example processor. It outputs an `interface{}`\n* A transform function, which eventually should represent that\n  `interface{}` as a string type but its contents can changed based on semantics (JSON, int values, dates, etc).\n\n\n\n## The Processor\nCrunch comes with a built in processor rig, that packs its API into\na ready-made processor:\n\n```go\ncrunch.ProcessJson(row)\n```\nThis processor reads JSON and outputs Hadoop-streaming TSV that is compatible with [Pig STREAM](https://pig.apache.org/docs/r0.11.1/basic.html#STREAM) (which we use later), based on your row description and functions.\n\nIt also injects the following commands into your binary:\n\n```\n$ ./simple_processor -help\nUsage of ./simple_processor:\n  -crunch.cpuprofile=\"\": Turn on CPU profiling and write to the specified file.\n  -crunch.hivetemplate=\"\": Custom Hive template for stub generation.\n  -crunch.pigtemplate=\"\": Custom Pig template for stub generation.\n  -crunch.stubs=\"\": Generate stubs and output to given path, and exit.\n```\n\n## Building a binary\n\nSince go packs all dependencies into your binary, this makes a great\ndelivery package to hadoop.\n\nSimply take a starter processor from `/examples` and build your processor based on it. Then build it:\n\n```\n$ go build simple_processor.go\n$ ./simple_processor -crunch.stubs=\".\"\nGenerated crunch.pig\nGenerated crunch.hql\n```\n\nThe resulting binary should be ready for action, using Pig (see next\nsection)\n\n## Generating Pig and Hive stubs\n\nCrunch injects useful commands into your processor, one of them supports\nscript generation to create your Hive table, and your Pig job.\n\n```\n$ ./simple_processor -crunch.stubs=\".\"\nGenerated crunch.pig\nGenerated crunch.hql\n```\n\nYou can use your own templates with the `-crunch.hivetemplate` and `-crunch.pigtemplate` flags, as long as you include a `%%schema%%` (and `%%process%%` for the pig script) special pragma so that Crunch will replace it with the actual Pig or Hive schema.\n\n## Extending Crunch\n\n[this section is WIP]\n\nCrunch is packaged into use-cases accessible from the crunch package, `crunch.ProcessJson` to name one.\n\nHowever beneath the usecase facade, lies an extensible API which lets\nyou have any kind of granularity over using Crunch.\n\nSome detailed examples can be seen in `/examples/detailed_processor.go`.\n\n\n\n# Contributing\n\nFork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).\n\n\n# Copyright\n\nCopyright (c) 2014 [Dotan Nahum](http://gplus.to/dotan) [@jondot](http://twitter.com/jondot). See MIT-LICENSE for further details.\n\n\n","funding_links":[],"categories":["Go","Hadoop"],"sub_categories":["Talks/Articles"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjondot%2Fcrunch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjondot%2Fcrunch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjondot%2Fcrunch/lists"}