{"id":13393716,"url":"https://github.com/bcongdon/corral","last_synced_at":"2025-04-04T20:15:25.056Z","repository":{"id":40660079,"uuid":"127805115","full_name":"bcongdon/corral","owner":"bcongdon","description":"🐎 A serverless MapReduce framework written for AWS Lambda","archived":false,"fork":false,"pushed_at":"2021-12-08T20:51:11.000Z","size":1502,"stargazers_count":693,"open_issues_count":6,"forks_count":40,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-03-28T19:11:52.792Z","etag":null,"topics":["aws-lambda","mapreduce","mapreduce-framework","serverless"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bcongdon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-04-02T19:51:55.000Z","updated_at":"2025-03-24T15:56:48.000Z","dependencies_parsed_at":"2022-08-27T01:11:04.086Z","dependency_job_id":null,"html_url":"https://github.com/bcongdon/corral","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcongdon%2Fcorral","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcongdon%2Fcorral/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcongdon%2Fcorral/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcongdon%2Fcorral/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bcongdon","download_url":"https://codeload.github.com/bcongdon/corral/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242681,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-lambda","mapreduce","mapreduce-framework","serverless"],"created_at":"2024-07-30T17:00:59.146Z","updated_at":"2025-04-04T20:15:25.033Z","avatar_url":"https://github.com/bcongdon.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"# 🐎 corral\n\n\u003e Serverless MapReduce\n\n[![Build Status](https://travis-ci.org/bcongdon/corral.svg?branch=master)](https://travis-ci.org/bcongdon/corral)\n[![Go Report Card](https://goreportcard.com/badge/github.com/bcongdon/corral)](https://goreportcard.com/report/github.com/bcongdon/corral)\n[![codecov](https://codecov.io/gh/bcongdon/corral/branch/master/graph/badge.svg)](https://codecov.io/gh/bcongdon/corral)\n[![GoDoc](https://godoc.org/github.com/bcongdon/corral?status.svg)](https://godoc.org/github.com/bcongdon/corral)\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"img/logo.svg\" width=\"50%\"/\u003e\n\u003c/p\u003e\n\nCorral is a MapReduce framework designed to be deployed to serverless platforms, like [AWS Lambda](https://aws.amazon.com/lambda/).\nIt presents a lightweight alternative to Hadoop MapReduce. Much of the design philosophy was inspired by Yelp's [mrjob](https://pythonhosted.org/mrjob/) --\ncorral retains mrjob's ease-of-use while gaining the type safety and speed of Go.\n\nCorral's runtime model consists of stateless, transient executors controlled by a central driver. Currently, the best environment for deployment is AWS Lambda,\nbut corral is modular enough that support for other serverless platforms can be added as support for Go in cloud functions improves.\n\nCorral is best suited for data-intensive but computationally inexpensive tasks, such as ETL jobs.\n\nMore details about corral's internals can be found in [this blog post](https://benjamincongdon.me/blog/2018/05/02/Introducing-Corral-A-Serverless-MapReduce-Framework/).\n\n**Contents:**\n---\n\u003c!-- START doctoc generated TOC please keep comment here to allow auto update --\u003e\n\u003c!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --\u003e\n\n\n- [Examples](#examples)\n- [Deploying in Lambda](#deploying-in-lambda)\n  - [AWS Credentials](#aws-credentials)\n- [Configuration](#configuration)\n  - [Configuration Settings](#configuration-settings)\n    - [Framework Settings](#framework-settings)\n    - [Lambda Settings](#lambda-settings)\n  - [Command Line Flags](#command-line-flags)\n  - [Environment Variables](#environment-variables)\n  - [Config Files](#config-files)\n- [Architecture](#architecture)\n  - [Input Files / Splits](#input-files--splits)\n  - [Mappers](#mappers)\n  - [Partition / Shuffle](#partition--shuffle)\n  - [Reducers / Output](#reducers--output)\n- [Contributing](#contributing)\n  - [Running Tests](#running-tests)\n- [License](#license)\n- [Previous Work / Attributions](#previous-work--attributions)\n\n\u003c!-- END doctoc generated TOC please keep comment here to allow auto update --\u003e\n\n## Examples\n\nEvery good MapReduce framework needs a WordCount™ example. Here's how to write a \"word count\" in corral:\n\n```golang\ntype wordCount struct{}\n\nfunc (w wordCount) Map(key, value string, emitter corral.Emitter) {\n\tfor _, word := range strings.Fields(value) {\n\t\temitter.Emit(word, \"\")\n\t}\n}\n\nfunc (w wordCount) Reduce(key string, values corral.ValueIterator, emitter corral.Emitter) {\n\tcount := 0\n\tfor range values.Iter() {\n\t\tcount++\n\t}\n\temitter.Emit(key, strconv.Itoa(count))\n}\n\nfunc main() {\n\twc := wordCount{}\n\tjob := corral.NewJob(wc, wc)\n\n\tdriver := corral.NewDriver(job)\n\tdriver.Main()\n}\n```\n\nThis can be invoked locally by building/running the above source and adding input files as arguments:\n\n```sh\ngo run word_count.go /path/to/some_file.txt\n```\n\nBy default, job output will be stored relative to the current directory.\n\nWe can also input/output to S3 by pointing to an S3 bucket/files for input/output:\n```\ngo run word_count.go --out s3://my-output-bucket/ s3://my-input-bucket/*\n```\n\nMore comprehensive examples can be found in [the examples folder](https://github.com/bcongdon/corral/tree/master/examples).\n\n## Deploying in Lambda\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"img/word_count.gif\" width=\"100%\"/\u003e\n\u003c/p\u003e\n\nNo formal deployment step needs run to deploy a corral application to Lambda. Instead, add the `--lambda` flag to an invocation of a corral app, and the project code will be automatically recompiled for Lambda and uploaded.\n\nFor example, \n```\n./word_count --lambda s3://my-input-bucket/* --out s3://my-output-bucket\n```\n\nNote that you must use `s3` for input/output directories, as local data files will not be present in the Lambda environment.\n\n**NOTE**: Due to the fact that corral recompiles application code to target Lambda, invocation of the command with the `--lambda` flag must be done in the root directory of your application's source code.\n\n### AWS Credentials\n\nAWS credentials are automatically loaded from the environment. See [this page](https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/sessions.html) for details.\n\nAs per the AWS documentation, AWS credentials are loaded in order from:\n\n1. Environment variables\n1. Shared credentials file\n1. IAM role (if executing in AWS Lambda or EC2)\n\nIn short, setup credentials in `.aws/credentials` as one would with any other AWS powered service. If you have more than one profile in `.aws/credentials`, make sure to set the `AWS_PROFILE` environment variable to select the profile to be used.\n\n## Configuration\n\nThere are a number of ways to specify configuraiton for corral applications. To hard-code configuration, there are a variety of [Options](https://godoc.org/github.com/bcongdon/corral#Option) that may be used when instantiating a Job.\n\nConfiguration values are used in the order, with priority given to whichever location is set first:\n\n1. Hard-coded job [Options](https://godoc.org/github.com/bcongdon/corral#Option).\n1. Command line flags\n1. Environment variables\n1. Configuration file\n1. Default values\n\n### Configuration Settings\n\nBelow are the config settings that may be changed. \n\n#### Framework Settings\n* `splitSize` (int64) - The maximum size (in bytes) of any single file input split. (Default: 100Mb)\n* `mapBinSize` (int64) - The maximum size (in bytes) of the combined input size to a mapper. (Default: 512Mb)\n* `reduceBinSize` (int64) - The maximum size (in bytes) of the combined input size to a reducer. This is an \"expected\" maximum, assuming uniform key distribution. (Default: 512Mb)\n* `maxConcurrency` (int) - The maximum number of executors (local, Lambda, or otherwise) that may run concurrently. (Default: `100`)\n* `workingLocation` (string) - The location (local or S3) to use for writing intermediate and output data.\n* `verbose` (bool) - Enables debug logging if set to `true`\n\n#### Lambda Settings\n* `lambdaFunctionName` (string) - The name to use for created Lambda functions. (Default: `corral_function`)\n* `lambdaManageRole` (bool) - Whether corral should manage creating an IAM role for Lambda execution. (Default: `true`)\n* `lambdaRoleARN` (string) - If `lambdaManageRole` is disabled, the ARN specified in `lambdaRoleARN` is used as the Lambda function's executor role.\n* `lambdaTimeout` (int64) - The timeout (maximum function duration) in seconds of created Lambda functions. See [AWS lambda docs](https://docs.aws.amazon.com/lambda/latest/dg/resource-model.html) for details. (Default: `180`)\n* `lambdaMemory` (int64) - The maximum memory that a Lambda function may use. See [AWS lambda docs](https://docs.aws.amazon.com/lambda/latest/dg/resource-model.html) for details. (Default: `1500`)\n\n### Command Line Flags\n\nThe following flags are available at runtime as command-line flags:\n```\n      --lambda            Use lambda backend\n      --memprofile file   Write memory profile to file\n  -o, --out directory     Output directory (can be local or in S3)\n      --undeploy          Undeploy the Lambda function and IAM permissions without running the driver\n  -v, --verbose           Output verbose logs\n```\n\n### Environment Variables\n\nCorral leverages [Viper](https://github.com/spf13/viper) for specifying config. Any of the above configuration settings can be set as environment variables by upper-casing the setting name, and prepending `CORRAL_`.\n\nFor example, `lambdaFunctionName` can be configured using an env var by setting `CORRAL_LAMBDAFUNCTIONNAME`.\n\n### Config Files\n\nCorral will read settings from a file called `corralrc`. Corral checks to see if this file exists in the current directory (`.`). It can also read global settings from `$HOME/.corral/corralrc`.\n\nReference the \"Configuration Settings\" section for the configuration keys that may be used.\n\nConfig files can be in JSON, YAML, or TOML format. See [Viper](https://github.com/spf13/viper) for more details.\n\n## Architecture\n\nBelow is a high-level diagram describing the MapReduce architecture corral uses.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"img/architecture.svg\" width=\"80%\"/\u003e\n\u003c/p\u003e\n\n### Input Files / Splits\n\nInput files are split byte-wise into contiguous chunks of maximum size `splitSize`. These splits are packed into \"input bins\" of maximum size `mapBinSize`. The bin packing algorithm tries to assign contiguous chunks of a single file to the same mapper, but this behavior is not guaranteed.\n\nThere is a one-to-one correspondance between an \"input bin\" and the data that a mapper reads. i.e. Each mapper is assigned to process exactly 1 input bin. For jobs that run on Lambda, you should tune `mapBinSize`, `splitSize`, and `lambdaTimeout` accordingly so that mappers are able to process their entire input before timing out.\n\nInput data is stramed into the mapper, so the entire input data needn't fit in memory.\n\n### Mappers\n\nInput data is fed into the map function line-by-line. Input splits are calculated byte-wise, but this is rectified during the Map phase into a logical split \"by line\" (to prevent partial reads, or the loss of records that span input splits).\n\nMappers may maintain state if desired (though not encouraged).\n\n### Partition / Shuffle\n\nKey/value pairs emitted during the map stage are written to intermediate files. Keys are partitioned into one `N` buckets, where `N` is the number of reducers. As a result, each mapper may write to as many as `N` separate files.\n\nThis results in a set of files labeled `map-binX-Y` where `X` is a number between 0 and N-1, and `Y` is the mapper's ID (a number between 0 and the number of mappers).\n\n### Reducers / Output\n\nCurrently, reducer input must be able to fit in memory. This is because keys are only partitioned, not sorted. The reducer performs an in-memory per-key partition.\n\nReducers receive per-key values in an arbitrary order. It is guaranteed that all values for a given key will be provided in a single call to Reduce by-key.\n\nValues emitted from a reducer will be stored in tab separated format (i.e. `KEY\\tVALUE`) in files labeled `output-X` where `X` is the reducer's ID (a number between 0 and the number of reducers).\n\nReducers may maintain state if desired (though not encouraged).\n\n## Contributing\n\nContributions to corral are more than welcomed! In general, the preference is to discuss potential changes in the issues before changes are made.\n\nMore information is included in the [CONTRIBUTING.md](CONTRIBUTING.md)\n\n### Running Tests\n\nTo run tests, run the following command in the root project directory:\n\n```\ngo test ./...\n```\n\nNote that some tests (i.e. the tests of `corfs`) require AWS credentials to be present.\n\nThe main corral has TravisCI setup. If you fork this repo, you can enable TravisCI on your fork. You will need to set the following environment variables for all the tests to work:\n\n* `AWS_ACCESS_KEY_ID`: Credentials access key\n* `AWS_SECRET_ACCESS_KEY`: Credentials secret key\n* `AWS_DEFAULT_REGION`: Region to use for S3 tests\n* `AWS_TEST_BUCKET`: The S3 bucket to use for tests (just the name; i.e. `testBucket` instead of `s3://testBucket`)\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\n\n## Previous Work / Attributions\n\n- [lambda-refarch-mapreduce](https://github.com/awslabs/lambda-refarch-mapreduce) - Python/Node.JS reference MapReduce Architecture\n    - Uses a \"recursive\" style reducer instead of parallel reducers\n    - Requires that all reducer output can fit in memory of a single lambda function\n- [mrjob](https://github.com/Yelp/mrjob)\n    - Excellent Python library for writing MapReduce jobs for Hadoop, EMR/Dataproc, and others\n- [dmrgo](https://github.com/dgryski/dmrgo)\n    - mrjob-inspired Go MapReduce library\n- [Zappa](https://github.com/Miserlou/Zappa)\n\t- Serverless Python toolkit. Inspired much of the way that corral does automatic Lambda deployment\n- Logo: [Fence by Vitaliy Gorbachev from the Noun Project](https://thenounproject.com/search/?q=fence\u0026i=1291185)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcongdon%2Fcorral","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbcongdon%2Fcorral","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcongdon%2Fcorral/lists"}