{"id":13645715,"url":"https://github.com/grailbio/bigmachine","last_synced_at":"2025-04-21T14:32:04.209Z","repository":{"id":54812274,"uuid":"207667174","full_name":"grailbio/bigmachine","owner":"grailbio","description":"Bigmachine is a library for self-managing serverless computing in Go","archived":false,"fork":false,"pushed_at":"2023-05-16T18:18:07.000Z","size":650,"stargazers_count":200,"open_issues_count":9,"forks_count":20,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-23T01:02:32.723Z","etag":null,"topics":["cloud","distributed-systems","golang","parallel-computing","serverless-framework"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/grailbio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-09-10T21:34:18.000Z","updated_at":"2024-12-02T18:26:08.000Z","dependencies_parsed_at":"2024-01-14T09:57:23.862Z","dependency_job_id":"1310d60f-7082-44f5-8873-9b20f6f31370","html_url":"https://github.com/grailbio/bigmachine","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Fbigmachine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Fbigmachine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Fbigmachine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Fbigmachine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/grailbio","download_url":"https://codeload.github.com/grailbio/bigmachine/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250070259,"owners_count":21369844,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud","distributed-systems","golang","parallel-computing","serverless-framework"],"created_at":"2024-08-02T01:02:40.374Z","updated_at":"2025-04-21T14:32:03.710Z","avatar_url":"https://github.com/grailbio.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"# Bigmachine\n\nBigmachine is a toolkit for building self-managing serverless applications\nin [Go](https://golang.org/).\nBigmachine provides an API that lets a driver process\nform an ad-hoc cluster of machines to\nwhich user code is transparently distributed.\n\nUser code is exposed through services,\nwhich are stateful Go objects associated with each machine.\nServices expose one or more Go methods that may\nbe dispatched remotely.\nUser services can call remote user services;\nthe driver process may also make service calls.\n\nPrograms built using Bigmachine are agnostic\nto the underlying machine implementation,\nallowing distributed systems to be easily tested\nthrough an [in-process implementation](https://godoc.org/github.com/grailbio/bigmachine/testsystem),\nor inspected during development using [local Unix processes](https://godoc.org/github.com/grailbio/bigmachine#Local).\n\nBigmachine currently supports instantiating clusters of\n[EC2 machines](https://godoc.org/github.com/grailbio/bigmachine/ec2system);\nother systems may be implemented with a [relatively compact Go interface](https://godoc.org/github.com/grailbio/bigmachine#System).\n\n- API documentation: [godoc.org/github.com/grailbio/bigmachine](https://godoc.org/github.com/grailbio/bigmachine)\n- Issue tracker: [github.com/grailbio/bigmachine/issues](https://github.com/grailbio/bigmachine/issues)\n- [![CI](https://github.com/grailbio/bigmachine/workflows/CI/badge.svg)](https://github.com/grailbio/bigmachine/actions?query=workflow%3ACI)\n- Implementation notes: [github.com/grailbio/bigmachine/blob/master/docs/impl.md](https://github.com/grailbio/bigmachine/blob/master/docs/impl.md)\n\nHelp wanted!\n- [GCP compute engine backend](https://github.com/grailbio/bigmachine/issues/1)\n- [Azure VM backend](https://github.com/grailbio/bigmachine/issues/2)\n\n# A walkthrough of a simple Bigmachine program\n\nCommand [bigpi](https://github.com/grailbio/bigmachine/blob/master/cmd/bigpi/bigpi.go)\nis a relatively silly use of cluster computing,\nbut illustrative nonetheless.\nBigpi estimates the value of $\\pi$\nby sampling $N$ random coordinates inside of the unit square,\ncounting how many $C \\le N$ fall inside of the unit circle.\nOur estimate is then $\\pi = 4*C/N$.\n\nThis is inherently parallelizable:\nwe can generate samples across a large number of nodes,\nand then when we're done,\nthey can be summed up to produce our estimate of $\\pi$.\n\nTo do this in Bigmachine,\nwe first define a service that samples some $n$ points\nand reports how many fell inside the unit circle.\n\n```\ntype circlePI struct{}\n\n// Sample generates n points inside the unit square and reports\n// how many of these fall inside the unit circle.\nfunc (circlePI) Sample(ctx context.Context, n uint64, m *uint64) error {\n\tr := rand.New(rand.NewSource(rand.Int63()))\n\tfor i := uint64(0); i \u003c n; i++ {\n\t\tif i%1e7 == 0 {\n\t\t\tlog.Printf(\"%d/%d\", i, n)\n\t\t}\n\t\tx, y := r.Float64(), r.Float64()\n\t\tif (x-0.5)*(x-0.5)+(y-0.5)*(y-0.5) \u003c 0.25 {\n\t\t\t*m++\n\t\t}\n\t}\n\treturn nil\n}\n```\n\nThe only notable aspect of this code is the signature of `Sample`,\nwhich follows the schema below:\nmethods that follow this convention may be dispatched remotely by Bigmachine,\nas we shall see soon.\n\n```\nfunc (service) Name(ctx context.Context, arg argtype, reply *replytype) error\n```\n\nNext follows the program's `func main`.\nFirst, we do the regular kind of setup a main might:\ndefine some flags,\nparse them,\nset up logging.\nAfterwards, a driver must call\n[`driver.Start`](https://godoc.org/github.com/grailbio/bigmachine/driver#Start),\nwhich initializes Bigmachine\nand sets up the process so that it may be bootstrapped properly on remote nodes.\n([Package driver](https://godoc.org/github.com/grailbio/bigmachine/driver)\nprovides high-level facilities for configuring and bootstrapping Bigmachine;\nadventurous users may use the lower-level facilitied in\n[package bigmachine](https://godoc.org/github.com/grailbio/bigmachine)\nto accomplish the same.)\n`driver.Start()` returns a [`*bigmachine.B`](https://godoc.org/gitub.com/grailbio/bigmachine#B)\nwhich can be used to start new machines.\n\n```\nfunc main() {\n\tvar (\n\t\tnsamples = flag.Int(\"n\", 1e10, \"number of samples to make\")\n\t\tnmachine = flag.Int(\"nmach\", 5, \"number of machines to provision for the task\")\n\t)\n\tlog.AddFlags()\n\tflag.Parse()\n\tb := driver.Start()\n\tdefer b.Shutdown()\n```\n\nNext,\nwe start a number of machines (as configured by flag nmach),\nwait for them to finish launching,\nand then distribute our sampling among them,\nusing a simple \"scatter-gather\" RPC pattern.\nFirst, let's look at the code that starts the machines\nand waits for them to be ready.\n\n```\n// Start the desired number of machines,\n// each with the circlePI service.\nmachines, err := b.Start(ctx, *nmachine, bigmachine.Services{\n\t\"PI\": circlePI{},\n})\nif err != nil {\n\tlog.Fatal(err)\n}\nlog.Print(\"waiting for machines to come online\")\nfor _, m := range machines {\n\t\u003c-m.Wait(bigmachine.Running)\n\tlog.Printf(\"machine %s %s\", m.Addr, m.State())\n\tif err := m.Err(); err != nil {\n\t\tlog.Fatal(err)\n\t}\n}\nlog.Print(\"all machines are ready\")\n```\n\nMachines are started with [`(*B).Start`](https://godoc.org/github.com/grailbio/bigmachine#B.Start),\nto which we provide the set of services that should be installed on each machine.\n(The service object provided is serialized and initialized on the remote machine,\nso it may include any desired parameters.)\nStart returns a slice of\n[`Machine`](https://godoc.org/github.com/grailbio/bigmachine#Machine)\ninstances representing each machine that was launched.\nMachines can be in a number of\n[states](https://godoc.org/github.com/grailbio/bigmachine#State).\nIn this case,\nwe keep it simple and just wait for them to enter their running states,\nafter which the underlying machines are fully bootstrapped and the services\nhave been installed and initialized.\nAt this point,\nall of the machines are ready to receive RPC calls.\n\nThe remainder of `main` distributes a portion of\nthe total samples to be taken to each machine,\nwaits for them to complete,\nand then prints with the precision warranted by the number of samples taken.\nNote that this code further subdivides the work by calling PI.Sample\nonce for each processor available on the underlying machines\nas defined by [`Machine.Maxprocs`](https://godoc.org/github.com/grailbio/bigmachine#Machine.Maxprocs),\nwhich depends on the physical machine configuration.\n\n\n```\n// Number of samples per machine\nnumPerMachine := uint64(*nsamples) / uint64(*nmachine)\n\n// Divide the total number of samples among all the processors on\n// each machine. Aggregate the counts and then report the estimate.\nvar total uint64\nvar cores int\ng, ctx := errgroup.WithContext(ctx)\nfor _, m := range machines {\n\tm := m\n\tfor i := 0; i \u003c m.Maxprocs; i++ {\n\t\tcores++\n\t\tg.Go(func() error {\n\t\t\tvar count uint64\n\t\t\terr := m.Call(ctx, \"PI.Sample\", numPerMachine/uint64(m.Maxprocs), \u0026count)\n\t\t\tif err == nil {\n\t\t\t\tatomic.AddUint64(\u0026total, count)\n\t\t\t}\n\t\t\treturn err\n\t\t})\n\t}\n}\nlog.Printf(\"distributing work among %d cores\", cores)\nif err := g.Wait(); err != nil {\n\tlog.Fatal(err)\n}\nlog.Printf(\"total=%d nsamples=%d\", total, *nsamples)\nvar (\n\tpi   = big.NewRat(int64(4*total), int64(*nsamples))\n\tprec = int(math.Log(float64(*nsamples)) / math.Log(10))\n)\nfmt.Printf(\"π = %s\\n\", pi.FloatString(prec))\n```\n\nWe can now build and run our binary like an ordinary Go binary.\n\n```\n$ go build\n$ ./bigpi\n2019/10/01 16:31:20 waiting for machines to come online\n2019/10/01 16:31:24 machine https://localhost:42409/ RUNNING\n2019/10/01 16:31:24 machine https://localhost:44187/ RUNNING\n2019/10/01 16:31:24 machine https://localhost:41618/ RUNNING\n2019/10/01 16:31:24 machine https://localhost:41134/ RUNNING\n2019/10/01 16:31:24 machine https://localhost:34078/ RUNNING\n2019/10/01 16:31:24 all machines are ready\n2019/10/01 16:31:24 distributing work among 5 cores\n2019/10/01 16:32:05 total=7853881995 nsamples=10000000000\nπ = 3.1415527980\n```\n\nHere,\nBigmachine distributed computation across logical machines,\neach corresponding to a single core on the host system.\nEach machine ran in its own Unix process (with its own address space),\nand RPC happened through mutually authenticated HTTP/2 connections.\n\n[Package driver](https://godoc.org/github.com/grailbio/bigmachine/driver)\nprovides some convenient flags that helps configure the Bigmachine runtime.\nUsing these, we can configure Bigmachine to launch machines into EC2 instead:\n\n```\n$ ./bigpi -bigm.system=ec2\n2019/10/01 16:38:10 waiting for machines to come online\n2019/10/01 16:38:43 machine https://ec2-54-244-211-104.us-west-2.compute.amazonaws.com/ RUNNING\n2019/10/01 16:38:43 machine https://ec2-54-189-82-173.us-west-2.compute.amazonaws.com/ RUNNING\n2019/10/01 16:38:43 machine https://ec2-34-221-143-119.us-west-2.compute.amazonaws.com/ RUNNING\n...\n2019/10/01 16:38:43 all machines are ready\n2019/10/01 16:38:43 distributing work among 5 cores\n2019/10/01 16:40:19 total=7853881995 nsamples=10000000000\nπ = 3.1415527980\n```\n\nOnce the program is running,\nwe can use standard Go tooling to examine its behavior.\nFor example,\n[expvars](https://golang.org/pkg/expvar/)\nare aggregated across all of the machines managed by Bigmachine,\nand the various profiles (CPU, memory, contention, etc.)\nare available as merged profiles through `/debug/bigmachine/pprof`.\nFor example,\nin the first version of `bigpi`,\nthe CPU profile highlighted a problem:\nwe were using the global `rand.Float64` which requires a lock;\nthe resulting contention was easily identifiable through the CPU profile:\n\n```\n$ go tool pprof localhost:3333/debug/bigmachine/pprof/profile\nFetching profile over HTTP from http://localhost:3333/debug/bigmachine/pprof/profile\nSaved profile in /Users/marius/pprof/pprof.045821636.samples.cpu.001.pb.gz\nFile: 045821636\nType: cpu\nTime: Mar 16, 2018 at 3:17pm (PDT)\nDuration: 2.51mins, Total samples = 16.80mins (669.32%)\nEntering interactive mode (type \"help\" for commands, \"o\" for options)\n(pprof) top\nShowing nodes accounting for 779.47s, 77.31% of 1008.18s total\nDropped 51 nodes (cum \u003c= 5.04s)\nShowing top 10 nodes out of 58\n      flat  flat%   sum%        cum   cum%\n   333.11s 33.04% 33.04%    333.11s 33.04%  runtime.procyield\n   116.71s 11.58% 44.62%    469.55s 46.57%  runtime.lock\n    76.35s  7.57% 52.19%    347.21s 34.44%  sync.(*Mutex).Lock\n    65.79s  6.53% 58.72%     65.79s  6.53%  runtime.futex\n    41.48s  4.11% 62.83%    202.05s 20.04%  sync.(*Mutex).Unlock\n    34.10s  3.38% 66.21%    364.36s 36.14%  runtime.findrunnable\n       33s  3.27% 69.49%        33s  3.27%  runtime.cansemacquire\n    32.72s  3.25% 72.73%     51.01s  5.06%  runtime.runqgrab\n    24.88s  2.47% 75.20%     57.72s  5.73%  runtime.unlock\n    21.33s  2.12% 77.31%     21.33s  2.12%  math/rand.(*rngSource).Uint64\n```\n\nAnd after the fix,\nit looks much healthier:\n\n```\n$ go tool pprof localhost:3333/debug/bigmachine/pprof/profile\n...\n      flat  flat%   sum%        cum   cum%\n    29.09s 35.29% 35.29%     82.43s   100%  main.circlePI.Sample\n    22.95s 27.84% 63.12%     52.16s 63.27%  math/rand.(*Rand).Float64\n    16.09s 19.52% 82.64%     16.09s 19.52%  math/rand.(*rngSource).Uint64\n     9.05s 10.98% 93.62%     25.14s 30.49%  math/rand.(*rngSource).Int63\n     4.07s  4.94% 98.56%     29.21s 35.43%  math/rand.(*Rand).Int63\n     1.17s  1.42%   100%      1.17s  1.42%  math/rand.New\n         0     0%   100%     82.43s   100%  github.com/grailbio/bigmachine/rpc.(*Server).ServeHTTP\n         0     0%   100%     82.43s   100%  github.com/grailbio/bigmachine/rpc.(*Server).ServeHTTP.func2\n         0     0%   100%     82.43s   100%  golang.org/x/net/http2.(*serverConn).runHandler\n         0     0%   100%     82.43s   100%  net/http.(*ServeMux).ServeHTTP\n```\n\n# GOOS, GOARCH, and Bigmachine\n\nWhen using Bigmachine's\n[EC2 machine implementation](https://godoc.org/github.com/grailbio/bigmachine/ec2system),\nthe process is bootstrapped onto remote EC2 instances.\nCurrently,\nthe only supported GOOS/GOARCH combination for these are linux/amd64.\nBecause of this,\nthe driver program must also be linux/amd64.\nHowever,\nBigmachine also understands the\n[fatbin format](https://godoc.org/github.com/grailbio/base/fatbin),\nso that users can compile fat binaries using the gofat tool.\nFor example,\nthe above can be run on a macOS driver if the binary is built using gofat instead of 'go':\n\n```\nmacOS $ GO111MODULE=on go get github.com/grailbio/base/cmd/gofat\ngo: finding github.com/grailbio/base/cmd/gofat latest\ngo: finding github.com/grailbio/base/cmd latest\nmacOS $ gofat build\nmacOS $ ./bigpi -bigm.system=ec2\n...\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrailbio%2Fbigmachine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgrailbio%2Fbigmachine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrailbio%2Fbigmachine/lists"}