{"id":13412558,"url":"https://github.com/chrislusf/gleam","last_synced_at":"2025-05-13T19:14:08.907Z","repository":{"id":37587749,"uuid":"66631967","full_name":"chrislusf/gleam","owner":"chrislusf","description":"Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.","archived":false,"fork":false,"pushed_at":"2025-04-20T21:47:11.000Z","size":8783,"stargazers_count":3515,"open_issues_count":40,"forks_count":291,"subscribers_count":143,"default_branch":"master","last_synced_at":"2025-04-27T20:01:48.854Z","etag":null,"topics":["distributed-computing","distributed-systems","golang","map-reduce"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chrislusf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-08-26T08:44:48.000Z","updated_at":"2025-04-27T05:56:16.000Z","dependencies_parsed_at":"2024-04-28T05:34:13.840Z","dependency_job_id":"ee770370-0a24-4693-bf17-bec64bd2c898","html_url":"https://github.com/chrislusf/gleam","commit_stats":{"total_commits":712,"total_committers":30,"mean_commits":"23.733333333333334","dds":0.1938202247191011,"last_synced_commit":"21a93f5696c27bda4563ed5a593b2c9cabd913f4"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislusf%2Fgleam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislusf%2Fgleam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislusf%2Fgleam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislusf%2Fgleam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chrislusf","download_url":"https://codeload.github.com/chrislusf/gleam/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254010813,"owners_count":21998995,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-computing","distributed-systems","golang","map-reduce"],"created_at":"2024-07-30T20:01:26.078Z","updated_at":"2025-05-13T19:14:08.882Z","avatar_url":"https://github.com/chrislusf.png","language":"Go","funding_links":["https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick\u0026hosted_button_id=EEECLJ8QGTTPC"],"categories":["Distributed Systems","Go","Go (134)","Relational Databases","分布式系统","golang","分佈式系統","\u003cspan id=\"分布式系统-distributed-systems\"\u003e分布式系统 Distributed Systems\u003c/span\u003e"],"sub_categories":["Search and Analytic Databases","Advanced Console UIs","SQL 查询语句构建库","检索及分析资料库","高级控制台界面","高級控制台界面","\u003cspan id=\"高级控制台用户界面-advanced-console-uis\"\u003e高级控制台用户界面 Advanced Console UIs\u003c/span\u003e"],"readme":"# Gleam\n[![Build Status](https://travis-ci.org/chrislusf/gleam.svg?branch=master)](https://travis-ci.org/chrislusf/gleam)\n[![GoDoc](https://godoc.org/github.com/chrislusf/gleam/flow?status.svg)](https://godoc.org/github.com/chrislusf/gleam/flow)\n[![Wiki](https://img.shields.io/badge/docs-wiki-blue.svg)](https://github.com/chrislusf/gleam/wiki)\n[![Go Report Card](https://goreportcard.com/badge/github.com/chrislusf/gleam)](https://goreportcard.com/report/github.com/chrislusf/gleam)\n[![codecov](https://codecov.io/gh/chrislusf/gleam/branch/master/graph/badge.svg)](https://codecov.io/gh/chrislusf/gleam)\n\nGleam is a high performance and efficient distributed execution system, and also\nsimple, generic, flexible and easy to customize.\n\nGleam is built in Go, and the user defined computation can be written in Go, \nUnix pipe tools, or any streaming programs.\n\n### High Performance\n\n* Pure Go mappers and reducers have high performance and concurrency.\n* Data flows through memory, optionally to disk.\n* Multiple map reduce steps are merged together for better performance.\n\n\n### Memory Efficient\n\n* Gleam does not have the common GC problem that plagued other languages. Each executor runs in a separated OS process. The memory is managed by the OS. One machine can host many more executors.\n* Gleam master and agent servers are memory efficient, consuming about 10 MB memory.\n* Gleam tries to automatically adjust the required memory size based on data size hints, avoiding the try-and-error manual memory tuning effort.\n\n### Flexible\n* The Gleam flow can run standalone or distributed.\n* Adjustable in memory mode or OnDisk mode.\n\n### Easy to Customize\n* The Go code is much simpler to read than Scala, Java, C++.\n\n# One Flow, Multiple ways to execute\nGleam code defines the flow, specifying each dataset(vertex) and computation step(edge), and build up a directed\nacyclic graph(DAG). There are multiple ways to execute the DAG.\n\nThe default way is to run locally. This works in most cases.\n\nHere we mostly talk about the distributed mode.\n\n## Distributed Mode\nThe distributed mode has several names to explain: Master, Agent, Executor, Driver.\n\n### Gleam Driver\n\n* Driver is the program users write, it defines the flow, and talks to Master, Agents, and Executors.\n\n### Gleam Master\n\n* The Master is one single server that collects resource information from Agents.\n* It stores transient resource information and can be restarted.\n* When the Driver program starts, it asks the Master for available Executors on Agents.\n\n### Gleam Agent\n\n* Agents runs on any machine that can run computations.\n* Agents periodically send resource usage updates to Master.\n* When the Driver program has executors assigned, it talks to the Agents to start Executors.\n* Agents also manage datasets generated by each Executors.\n\n### Gleam Executor\n* Executors are started by Agents. They will read inputs from external or previous datasets, process them, and output to a new dataset.\n\n### Dataset\n\n* The datasets are managed by Agents. By default, the data run only through memory and network, not touching slow disk.\n* Optionally the data can be persist to disk.\n\nBy leaving it in memory, the flow can have back pressure, and can support stream computation naturally.\n\n# Documentation\n* [Gleam Wiki](https://github.com/chrislusf/gleam/wiki)\n* [Installation](https://github.com/chrislusf/gleam/wiki/Installation)\n* [Gleam Flow API GoDoc](https://godoc.org/github.com/chrislusf/gleam/flow)\n* [gleam-dev on Slack](https://join.slack.com/t/gleam-dev/shared_invite/enQtMzIzMjYxMTg0MDgxLWFhYjhhM2E5NDVhNDA1OWM0NjZjMWQ0ZGY5NGJkZDZkNzU3OTUzNzNhZmNhYzIxNjc1ZmU1MzMyMzk4NTk4ZGM)\n# Standalone Example\n\n## Word Count\n\n#### Word Count\n\nBasically, you need to register the Go functions first.\nIt will return a mapper or reducer function id, which we can pass it to the flow.\n\n```go\npackage main\n\nimport (\n\t\"flag\"\n\t\"strings\"\n\n\t\"github.com/chrislusf/gleam/distributed\"\n\t\"github.com/chrislusf/gleam/flow\"\n\t\"github.com/chrislusf/gleam/gio\"\n\t\"github.com/chrislusf/gleam/plugins/file\"\n)\n\nvar (\n\tisDistributed   = flag.Bool(\"distributed\", false, \"run in distributed or not\")\n\tTokenize  = gio.RegisterMapper(tokenize)\n\tAppendOne = gio.RegisterMapper(appendOne)\n\tSum = gio.RegisterReducer(sum)\n)\n\nfunc main() {\n\n\tgio.Init()   // If the command line invokes the mapper or reducer, execute it and exit.\n\tflag.Parse() // optional, since gio.Init() will call this also.\n\n\tf := flow.New(\"top5 words in passwd\").\n\t\tRead(file.Txt(\"/etc/passwd\", 2)).  // read a txt file and partitioned to 2 shards\n\t\tMap(\"tokenize\", Tokenize).    // invoke the registered \"tokenize\" mapper function.\n\t\tMap(\"appendOne\", AppendOne).  // invoke the registered \"appendOne\" mapper function.\n\t\tReduceByKey(\"sum\", Sum).         // invoke the registered \"sum\" reducer function.\n\t\tSort(\"sortBySum\", flow.OrderBy(2, true)).\n\t\tTop(\"top5\", 5, flow.OrderBy(2, false)).\n\t\tPrintlnf(\"%s\\t%d\")\n\n\tif *isDistributed {\n\t\tf.Run(distributed.Option())\n\t} else {\n\t\tf.Run()\n\t}\n\n}\n\nfunc tokenize(row []interface{}) error {\n\tline := gio.ToString(row[0])\n\tfor _, s := range strings.FieldsFunc(line, func(r rune) bool {\n\t\treturn !('A' \u003c= r \u0026\u0026 r \u003c= 'Z' || 'a' \u003c= r \u0026\u0026 r \u003c= 'z' || '0' \u003c= r \u0026\u0026 r \u003c= '9')\n\t}) {\n\t\tgio.Emit(s)\n\t}\n\treturn nil\n}\n\nfunc appendOne(row []interface{}) error {\n\trow = append(row, 1)\n\tgio.Emit(row...)\n\treturn nil\n}\n\nfunc sum(x, y interface{}) (interface{}, error) {\n\treturn gio.ToInt64(x) + gio.ToInt64(y), nil\n}\n\n```\n\nNow you can execute the binary directly or with \"-distributed\" option to run in distributed mode.\nThe distributed mode would need a simple setup described later.\n\nA bit more blown up example is here, using the predefined mapper or reducer:\nhttps://github.com/chrislusf/gleam/blob/master/examples/word_count_in_go/word_count_in_go.go\n\n\n#### Word Count by Unix Pipe Tools\nHere is another way to do the similar by unix pipe tools.\n\nUnix Pipes are easy for sequential pipes, but limited to fan out, and even more limited to fan in.\n\nWith Gleam, fan-in and fan-out parallel pipes become very easy.\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/chrislusf/gleam/flow\"\n\t\"github.com/chrislusf/gleam/gio\"\n\t\"github.com/chrislusf/gleam/gio/mapper\"\n\t\"github.com/chrislusf/gleam/plugins/file\"\n\t\"github.com/chrislusf/gleam/util\"\n)\n\nfunc main() {\n\n\tgio.Init()\n\n\tflow.New(\"word count by unix pipes\").\n\t\tRead(file.Txt(\"/etc/passwd\", 2)).\n\t\tMap(\"tokenize\", mapper.Tokenize).\n\t\tPipe(\"lowercase\", \"tr 'A-Z' 'a-z'\").\n\t\tPipe(\"sort\", \"sort\").\n\t\tPipe(\"uniq\", \"uniq -c\").\n\t\tOutputRow(func(row *util.Row) error {\n\n\t\t\tfmt.Printf(\"%s\\n\", gio.ToString(row.K[0]))\n\n\t\t\treturn nil\n\t\t}).Run()\n\n}\n```\n\nThis example used OutputRow() to process the output row directly.\n\n## Join two CSV files.\n\nAssume there are file \"a.csv\" has fields \"a1, a2, a3, a4, a5\" \nand file \"b.csv\" has fields \"b1, b2, b3\". \nWe want to join the rows where a1 = b2. \nAnd the output format should be \"a1, a4, b3\".\n\n```go\npackage main\n\nimport (\n\t. \"github.com/chrislusf/gleam/flow\"\n\t\"github.com/chrislusf/gleam/gio\"\n\t\"github.com/chrislusf/gleam/plugins/file\"\n)\n\nfunc main() {\n\n\tgio.Init()\n\n\tf := New(\"join a.csv and b.csv by a1=b2\")\n\ta := f.Read(file.Csv(\"a.csv\", 1)).Select(\"select\", Field(1,4)) // a1, a4\n\tb := f.Read(file.Csv(\"b.csv\", 1)).Select(\"select\", Field(2,3)) // b2, b3\n\n\ta.Join(\"joinByKey\", b).Printlnf(\"%s,%s,%s\").Run()  // a1, a4, b3\n\n}\n\n```\n\n# Distributed Computing\n## Setup Gleam Cluster Locally\nStart a gleam master and several gleam agents\n```go\n// start \"gleam master\" on a server\n\u003e go get github.com/chrislusf/gleam/distributed/gleam\n\u003e gleam master --address=\":45326\"\n\n// start up \"gleam agent\" on some different servers or ports\n\u003e gleam agent --dir=2 --port 45327 --host=127.0.0.1\n\u003e gleam agent --dir=3 --port 45328 --host=127.0.0.1\n```\n\n## Setup Gleam Cluster on Kubernetes\n\nInstall [Kubernetes tools](https://kubernetes.io/docs/tasks/tools/)\nAt the very least you will need a local K8s cluster, Docker \u0026 Kubectl.\nDocker Desktop [provides all of this out the box](https://www.docker.com/products/docker-desktop/).\n\n### Install Skaffold\n\nChoose the appropriate binary [here](https://skaffold.dev/docs/install/#standalone-binary).\nFor example, ARM64:\n\n```sh\ncurl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-darwin-arm64 \u0026\u0026 \\\nsudo install skaffold /usr/local/bin/\n```\n\n### Run Latest Version\n\n```sh\ncd ./k8s\nskaffold run --profile base \n```\n\nUse `skaffold delete --profile base` to bring the cluster down.\n\n### Alternately Build \u0026 Run Local Version\n\nYou can build a local copy of gleam for development with hot reloading:\n\n```sh\ncd ./k8s\nskaffold dev --profile dev \n```\n\n## Change Execution Mode.\n\nAfter the flow is defined, the Run() function can be executed in local mode or distributed mode.\n\n```go\n  f := flow.New(\"\")\n  ...\n  // 1. local mode\n  f.Run()\n\n  // 2. distributed mode\n  import \"github.com/chrislusf/gleam/distributed\"\n  f.Run(distributed.Option())\n  f.Run(distributed.Option().SetMaster(\"master_ip:45326\"))\n\n```\n\n# Important Features\n\n* Fault tolerant [OnDisk()](https://godoc.org/github.com/chrislusf/gleam/flow#Dataset.OnDisk).\n* Read data from Local, HDFS, or S3.\n* Data Sources\n  * [Cassandra](https://github.com/chrislusf/gleam/tree/master/plugins/cassandra), with [example](https://github.com/chrislusf/gleam/tree/master/examples/cassandra_reader)\n  * [Kafka](https://github.com/chrislusf/gleam/tree/master/plugins/kafka) [example](https://github.com/chrislusf/gleam/tree/master/examples/kafka_reader)\n  * [Parquet files](https://github.com/chrislusf/gleam/tree/master/plugins/file/parquet) [example](https://github.com/chrislusf/gleam/tree/master/examples/parquet)\n  * [ORC files](https://github.com/chrislusf/gleam/tree/master/plugins/file/orc) [example](https://github.com/chrislusf/gleam/tree/master/examples/orc)\n  * [CSV files](https://github.com/chrislusf/gleam/tree/master/plugins/file/csv) [example](https://github.com/chrislusf/gleam/tree/master/examples/csv)\n  * [TSV files](https://github.com/chrislusf/gleam/tree/master/plugins/file/tsv) \n  * [TXT files](https://github.com/chrislusf/gleam/tree/master/plugins/file/txt) \n  * Raw Socket [example]()\n\n# Status\nGleam is just beginning. Here are a few todo items. Welcome any help!\n* [Add new plugin to read external data](https://github.com/chrislusf/gleam/wiki/Add-New-Source).\n* Add windowing functions similar to Apache Beam/Flink. (in progress)\n* Add schema support for each dataset.\n* Support using SQL as a flow step, similar to LINQ.\n* Add dataset metadata for better caching of often re-calculated data.\n\nEspecially Need Help Now:\n* Go implementation to read Parquet files.\n\nPlease start to use it and give feedback. Help is needed. Anything is welcome. Small things count: fix documentation, adding a logo, adding docker image, blog about it, share it, etc.\n\n[![](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick\u0026hosted_button_id=EEECLJ8QGTTPC)\n\n## License\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrislusf%2Fgleam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchrislusf%2Fgleam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrislusf%2Fgleam/lists"}