{"id":37135620,"url":"https://github.com/in4it/gomap","last_synced_at":"2026-01-14T15:48:46.416Z","repository":{"id":118634690,"uuid":"260924829","full_name":"in4it/gomap","owner":"in4it","description":"Run your MapReduce workloads as a single binary on a single machine with multiple CPUs and high memory. Pricing of a lot of small machines vs heavy machines is the same on most cloud providers.","archived":false,"fork":false,"pushed_at":"2020-06-05T12:32:35.000Z","size":81,"stargazers_count":21,"open_issues_count":0,"forks_count":8,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-06-20T11:57:12.949Z","etag":null,"topics":["bigdata","cloud","mapreduce"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/in4it.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-03T13:22:49.000Z","updated_at":"2022-09-16T22:16:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"f6f88ff0-9b16-466b-836c-0316ee4b5843","html_url":"https://github.com/in4it/gomap","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/in4it/gomap","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/in4it%2Fgomap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/in4it%2Fgomap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/in4it%2Fgomap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/in4it%2Fgomap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/in4it","download_url":"https://codeload.github.com/in4it/gomap/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/in4it%2Fgomap/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28425040,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T15:24:48.085Z","status":"ssl_error","status_checked_at":"2026-01-14T15:23:41.940Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","cloud","mapreduce"],"created_at":"2026-01-14T15:48:45.790Z","updated_at":"2026-01-14T15:48:46.400Z","avatar_url":"https://github.com/in4it.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gomap\n[![Travis Status for in4it/gomap](https://travis-ci.org/in4it/gomap.svg?branch=master)](https://travis-ci.org/in4it/gomap)\n[![godoc for in4it/gomap](https://godoc.org/github.com/in4it/gomap?status.svg)](https://pkg.go.dev/github.com/in4it/gomap/pkg/context?tab=doc)\n\nRun your MapReduce workloads as a single binary on a single machine with multiple CPUs and high memory. Pricing of a lot of small machines vs heavy machines is the same on most cloud providers.\n\n# Usage\n\n## Import\nContext to start using gomap:\n```\nimport \"github.com/in4it/gomap/pkg/context\"\n```\nUtils and types (for conversions):\n```\nimport (\n  \"github.com/in4it/gomap/pkg/utils\"\n  \"github.com/in4it/gomap/pkg/types\"\n)\n```\n\n## WordCount Example\n\n```go\npackage main\n\nimport (\n  \"github.com/in4it/gomap/pkg/context\"\n  \"github.com/in4it/gomap/pkg/utils\"\n  \"github.com/in4it/gomap/pkg/types\"\n)\n\n// Print a wordcount of an input file\nfunc main() {\n\tc := context.New()\n\terr := c.Read(\"testdata/sentences.txt\").FlatMap(func(str types.RawInput) []types.RawOutput {\n\t\treturn utils.StringArrayToRawOutput(strings.Split(string(str), \" \"))\n\t}).MapToKV(func(input types.RawInput) (types.RawOutput, types.RawOutput) {\n\t\treturn utils.RawInputToRawOutput(input), utils.StringToRawOutput(\"1\")\n\t}).ReduceByKey(func(a, b types.RawInput) types.RawOutput {\n\t\treturn utils.IntToRawOutput(utils.RawInputToInt(a) + utils.RawInputToInt(b))\n\t}).Run().Print()\n\t\n\tif err != nil {\n\t\tfmt.Printf(\"Error: %s\", err)\n\t\tos.Exit(1)\n\t}\n}\n```\n\n## Parquet example\n```go\npackage main\n\nimport (\n  \"github.com/in4it/gomap/pkg/context\"\n  \"github.com/in4it/gomap/pkg/utils\"\n  \"github.com/in4it/gomap/pkg/types\"\n)\n\n// define parquet schema\ntype ParquetLine struct {\n  Word  string `parquet:\"name=word, type=UTF8\"`\n  Count int64  `parquet:\"name=count, type=INT64\"` \n}\n\n// Print a wordcount of an input file\nfunc main() {\n\tc := context.New()\n\terr := c.ReadParquet(\"s3://bucket/directory/\", new(ParquetLine)).MapToKV(func(input types.RawInput) (types.RawOutput, types.RawOutput) {\n\t\tvar line ParquetLine\n\t\terr := utils.RawDecode(input, \u0026line)\n\t\tif err != nil {\n\t\t\tpanic(err)\n\t\t}\n\t\treturn utils.StringToRawOutput(line.Word), utils.RawEncode([]ParquetLine{line})\n\t}).ReduceByKey(func(a, b types.RawInput) types.RawOutput {\n\t\tvar line1 []ParquetLine\n\t\tvar line2 []ParquetLine\n\t\terr := utils.RawDecode(a, \u0026line1)\n\t\tif err != nil {\n\t\t\tpanic(err)\n\t\t}\n\t\terr = utils.RawDecode(b, \u0026line2)\n\t\tif err != nil {\n\t\t\tpanic(err)\n\t\t}\n\t\treturn utils.RawEncode(append(line1, line2...))\n\t}).Run().Foreach(func(key, value types.RawOutput) {\n\t\tvar lines []ParquetLine\n\t\terr := utils.RawDecode(value, \u0026lines)\n\t\tif err != nil {\n\t\t\tpanic(err)\n\t\t}\n    //\n    // you can now use string(key) and lines ([]ParquetLine)\n    //\n\t})\n\n\tif err != nil {\n\t\tpanic(c.err)\n\t}\n```\n\n## Memory usage and spill to disk\nIf you don't want to keep the full memory set in memory, you can specify a buffer limit. Between steps (Map, FlatMap, ReduceByKey, ...), a buffer is kept. By configuring a different writer, you can influence the memory usage.\n\n### Default writer (MemoryWriter)\n```go\n\tc := New()\n\tc.SetConfig(Config{\n\t\tbufferWriter: writers.NewMemoryWriter(),\n\t})\n```\n\n### Memory and Disk Writer (MemoryAndDiskWriter)\n```go\n\tc := New()\n\tc.SetConfig(Config{\n\t\t// argument expects bytes. after 5 MB, the buffer will start spilling to disk. \n\t\tbufferWriter: writers.NewMemoryAndDiskWriter(1024 /* kb */ * 1024 /* mb */ * 5), \n\t})\n```\n\n## Current implemented functions\n| Function | Description |\n| -------- | ----------- |\n| Map | Transform a value |\n| FlatMap | Transform and flatten a value into a slice |\n| MapToKV | Transform a map to a key value pair |\n| ReduceByKey | Group unique keys and apply a reduce function |\n| Foreach | Loop over the output of unique keys in a key value result |\n| Filter | Filter values |\n| Print | Print output |\n| Get | Get output values |\n| GetKV | Get output keys and values |\n\n## Current inputs\n* Textfiles (local \u0026 S3 using s3:// prefix)\n* Parquet (local \u0026 S3 using s3:// prefix)\n\n## Concurrency\nMultiple input files are split into goroutines. If you have multiple cores, the goroutines can run in parallel\n\n# Run gomap on AWS\nYou can run gomap on AWS on a spot instance using the launcher.\n\n## Configuration\n\nExample launch specification (if the AMI is not supplied, it'll launch the latest ubuntu bionic AMI):\n```\n{\n    \"IamInstanceProfile\": {\n      \"Arn\": \"arn:aws:iam::1234567890:instance-profile/gomap\"\n    },\n    \"InstanceType\": \"r4.large\",\n    \"NetworkInterfaces\": [\n      {\n        \"DeviceIndex\": 0,\n        \"Groups\": [\"sg-0123456789\"],\n        \"SubnetId\": \"subnet-01234567890\"\n      }\n    ]  \n}\n```\n\nNote: the instance profile should have s3 \u0026 cloudwatch logs access\n\n## Run\n\nDownload the wordcount and launch binary from the release page, and run:\n```\naws s3 cp wordcount-linux-amd64 s3://yourbucket/binaries/wordcount\n./launch -launchSpecification launchspec.json -region eu-west-1 -cmd \"./wordcount -input s3://yourbucket/inputfile.txt\" -executable s3://yourbucket/binaries/wordcount\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fin4it%2Fgomap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fin4it%2Fgomap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fin4it%2Fgomap/lists"}